EMO-BON Metagenomics: From Backend Integration to Frontend Processing
November 16, 2025
This course should give you set of minimal examples on how to generate a knowledge graph from set of RO-Crates, and use it as a sparQL endpoint either directly or using python rdflib.
Second part focused on a specific use case of EMO-BON data, the pilot implementation of the Virtual Research Environment is introduced, basic analysis performed, issues reported, extended analysis proposed.
What is omitted here is how to organize the data and build the RO-Crates themselves.
Environmental setup
This setup serves is ubiquitus for both parts of the tutorial. We will clone two repositories and install one of them. To keep everything exactly the same, we create a designated folder
mkdir emobon_demo
cd emobon_demoCreate a dedicated conda/python environment
# if you are using conda
conda create -n "emobon" python=3.10 # or higher
conda activate emobonPython resources
# create a folder
mkdir momics-demos
cd momics-demos
# clone the repository into newly created folder
git clone https://github.com/emo-bon/momics-demos.git
# step into the repository
cd momics-demos
# install dependencies using pip
pip install -e .
# setup jupyter kernel
ipython kernel install --user --name "emobon"Backend setup:
As we are going to work with EMO-BON metagenomics data, one way is to download some of the RO-Crates GitHub repository manually.
Easier, however. is to clone the whole repository. Navigate back to the ..../emobon_demo/ and do:
mkdir ro-crates
cd ro-crates
git clone https://github.com/emo-bon/analysis-results-cluster-01-crate.gitAs you see in the GH as well, you have now the RO-Crates per sample and corresponding .ttl files to simplify your life.
We will use fuseki server for a SPARQL endpoint. It is a java ndpoint, therefore, you might need to install java first if java -version command does not return anything or your version is <17.0.
sudo apt update
sudo apt install -y openjdk-17-jre-headless
java -versionHere comes the fuseki download itself. Please chose appropriate folder for this. For direct download visit this page.
wget https://dlcdn.apache.org/jena/binaries/apache-jena-fuseki-5.6.0.tar.gz
# open the archive
tar -xvf apache-jena-fuseki-5.6.0.tar.gzFor java, if you have winget, you can follow steps here (not tested by the authors). Otherwise, download OpenJDK .exe file from here and install it. You might need to close and reopen the command line window to see the updated version with java --version command in your PowerShell.
Download directly the fuseki 5.6.0 zip file from apache and extract the archive.
Hands-on tutorial
There are many parallel technologies to employ in each step. The pipeline here relies on fuseki for exposing the SPARQL endpoint and uses python to query the graph.
A RO-Crate is an integrated view through which you can see an entire Research Object; the methods, the data, the output and the outcomes of a project or a piece of work. Linking all this together enables the sharing of research outputs with their context, as a coherent whole. https://www.researchobject.org

Note that we already have some of the EMO-BON RO-Crates locall in /emobon_demo/ro-crates/analysis-results-cluster-01-crate/.
Expose your triples as a SPARQL end-point accessible over HTTP. Fuseki provides REST-style interaction with your RDF data. Navigate to your downloaded fuseki files, open the archive if you have not done it yet and start the server as follows
cd apache-jena-fuseki-5.6.0/
# start the server
./fuseki-serverThis should show you similar to
15:46:47 INFO Config :: Fuseki Base = /home/david-palecek/coding/apache-jena-fuseki-5.6.0/run
15:46:47 INFO Config :: No databases: dir=/home/david-palecek/coding/apache-jena-fuseki-5.6.0/run/configuration
15:46:48 INFO Config :: UI Base = fuseki-server.jar
15:46:48 INFO Shiro :: Shiro configuration: file:/home/david-palecek/coding/apache-jena-fuseki-5.6.0/run/shiro.iniNow opening your localhost, http://localhost:3030/, you should see the following page

Upload Dataset / Graph
In fuseki, go new dataset -> give it a name emobon -> add data -> select the .ttl files and upload all or one by one. This is a shortcut, because many times you will not have access to the .ttl serialized version of the RO-Crate. In that case, the starting point is ro-crate-metadata.json file, which is always at the root of every RO-Crate.
# read the json file and convert it to a graph
file = ".../ro-crate-metadata.json"
with open(file, "r", encoding="utf-8") as f:
jsonld_text = f.read()
g = jsonld_to_rdflib(jsonld_text)
# upload data to the endpoint
resp = requests.put(
fuseki_url,
data=g.serialize(format="turtle").encode("utf-8"),
headers={"Content-Type": "text/turtle"},
timeout=60,
)The full example is in this notebook.
The direct way from fuseki is to edit the query in actions. See more detailed but accesible introduction with examples. When you click query, the default query is shown
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT * WHERE {
?sub ?pred ?obj .
} LIMIT 10, which queries all the triples in the graph.
SELECT (COUNT(*) as ?c)
WHERE {
?subject ?predicate ?object .
}
LIMIT 10text/html files.
PREFIX sdo: <http://schema.org/>
SELECT ?x ?dtype
WHERE {
?x sdo:encodingFormat ?dtype .
FILTER regex(str(?dtype), "^text/html", "i")
}sdo:downloadUrl of those files.
PREFIX sdo: <http://schema.org/>
SELECT ?x ?dtype ?durl
WHERE {
?x sdo:encodingFormat ?dtype ;
sdo:downloadUrl ?durl .
FILTER regex(str(?dtype), "^text/html", "i")
}Now you can click the link of one of the Krona files, open them in the browser and with no surprise, it is a Krona plot.
Now let’s get the real metaGOflow outputs, specifically SSU taxonomy tables. There are several ways how to do it.
1. RO-Crate browser EMBRC hosts the RO-Crate viewer for the EMO-BON data
2. Local SPARQL Write a query to get the sdo:downloadUrl links, put them into your browser, which automatically triggers the download. Hint for the exercise below, match regex of the object on “SSU-taxonomy-summary”.
PREFIX sdo: <http://schema.org/>
SELECT ?subject ?predicate ?object ?durl
WHERE {
?subject ?predicate ?object .
FILTER regex(str(?object), "SSU-taxonomy-summary", "i")
OPTIONAL { ?object sdo:downloadUrl ?durl }
}
LIMIT 503. Use data version control (DVC) tool Shown in the python implementation of the above, which follows in the next section. For more on dvc, check its documentation.
It is possible to export the tables form the fuseki for subsequent work, but let’s do everything seamlessly from a jupyter notebook.
First we reproduce what we did until now from Jupyter notebook which is part of the momics-demos repository, therefore you have it already locally with all installed dependencies too.
Since we have already ingested the triples from the RO-Crates, we just need to query the existing endpoint
q = """
SELECT (COUNT(*) AS ?c)
WHERE {
?s ?p ?o
}
"""
r = requests.get("http://localhost:3030/emobon/query", params={"query": q}, headers={"Accept": "application/sparql-results+json"})
print(r.json())returned json is relatively easy to convert to a dataframe (sparql_json_to_df function)
def sparql_json_to_df(sparql_json):
"""
Convert a SPARQL SELECT query JSON result to a pandas DataFrame.
Parameters
----------
sparql_json : dict
JSON returned by Fuseki / SPARQL endpoint with Accept: application/sparql-results+json
Returns
-------
pd.DataFrame
"""
vars_ = sparql_json.get("head", {}).get("vars", [])
rows = []
for binding in sparql_json.get("results", {}).get("bindings", []):
row = {}
for var in vars_:
# Some results might not bind all variables
if var in binding:
row[var] = binding[var]["value"]
else:
row[var] = None
rows.append(row)
df = pd.DataFrame(rows, columns=vars_)
return dfSecond notebook includes all the steps of setting up the SparQL endpoint and also combining local queries with public endpoints from wikidata and UniProt.
There is a python wrapper for SPARQL SPARQLWrapper, which is pip installable and can be used as a standard python module but also as a command line script.
Because (a) public SPARQL endpoint with all the EMO-BON data, we completaly separate this second part from the first part, and data starting point will be standard .csv tables, just compressed into .parquet files. Since Blue-Cloud 2026 virtual research environment does not exist, we will run all the analysis locally.
All dependencies were installed in the setup. Initialize the jupyter server with
cd momics-demos
python -m jupyterlabOpen wf2_diversity/diversities_panel.ipynb. It looks and it is a native python notebook code. However clicking the panel icon
initiates the dashboard.
After getting familiar with the functionality, pick your favourite combination of taxon level + categoricall variable showing certain pattern on the first two PCA components.
Every dashboard is also available for interactive mode ..._interactive.ipynb. After discovering one of the main drivers, we would like to see what drives this difference.
Select one sample from each category and try to identify the taxa responsible from the variance.
Possible options
- One approach could be the permonova test removing the taxa one by one. Permonova is implemented in
marine-omics-methods(marine-omics on PyPI). - Alternatively compute correlation of each
taxonwith the PC1 and PC2, rank the features - Fit features as vectors onto PCoA (envfit). This is the most widely used method, especially in ecology (vegan package in R). In Python (via scikit-bio):
from skbio.stats.ordination import cca_scores
# envfit equivalent not built-in, but you can manually regress:
import scipy.stats as st
results = {}
for feature in abund.columns:
slope_x, _, r_x, _, _ = st.linregress(coords["PC1"], abund[feature])
slope_y, _, r_y, _, _ = st.linregress(coords["PC2"], abund[feature])
results[feature] = (r_x, r_y)
vectors = pd.DataFrame(results, index=["r_PC1", "r_PC2"]).TThis is a free time to try your own ideas with the data building upon the existing functionality of the momics-demos. Do not hesitate to raise issues, and reach out.