EMO-BON Metagenomics: From Backend Integration to Frontend Processing

Author

Published

November 17, 2025

Last change

November 16, 2025

This course should give you set of minimal examples on how to generate a knowledge graph from set of RO-Crates, and use it as a sparQL endpoint either directly or using python rdflib.

Second part focused on a specific use case of EMO-BON data, the pilot implementation of the Virtual Research Environment is introduced, basic analysis performed, issues reported, extended analysis proposed.

What is omitted here is how to organize the data and build the RO-Crates themselves.

Environmental setup

This setup serves is ubiquitus for both parts of the tutorial. We will clone two repositories and install one of them. To keep everything exactly the same, we create a designated folder

mkdir emobon_demo
cd emobon_demo

Create a dedicated conda/python environment

# if you are using conda
conda create -n "emobon" python=3.10  # or higher
conda activate emobon

Python resources

# create a folder
mkdir momics-demos
cd momics-demos

# clone the repository into newly created folder
git clone https://github.com/emo-bon/momics-demos.git

# step into the repository
cd momics-demos

# install dependencies using pip
pip install -e .

# setup jupyter kernel
ipython kernel install --user --name "emobon"

Backend setup:

As we are going to work with EMO-BON metagenomics data, one way is to download some of the RO-Crates GitHub repository manually.

Easier, however. is to clone the whole repository. Navigate back to the ..../emobon_demo/ and do:

mkdir ro-crates
cd ro-crates

git clone https://github.com/emo-bon/analysis-results-cluster-01-crate.git

As you see in the GH as well, you have now the RO-Crates per sample and corresponding .ttl files to simplify your life.

We will use fuseki server for a SPARQL endpoint. It is a java ndpoint, therefore, you might need to install java first if java -version command does not return anything or your version is <17.0.

Linux
Windows

sudo apt update
sudo apt install -y openjdk-17-jre-headless
java -version

Here comes the fuseki download itself. Please chose appropriate folder for this. For direct download visit this page.

wget https://dlcdn.apache.org/jena/binaries/apache-jena-fuseki-5.6.0.tar.gz

# open the archive
tar -xvf apache-jena-fuseki-5.6.0.tar.gz

For java, if you have winget, you can follow steps here (not tested by the authors). Otherwise, download OpenJDK .exe file from here and install it. You might need to close and reopen the command line window to see the updated version with java --version command in your PowerShell.

Download directly the fuseki 5.6.0 zip file from apache and extract the archive.

There are many parallel technologies to employ in each step. The pipeline here relies on fuseki for exposing the SPARQL endpoint and uses python to query the graph.

A RO-Crate is an integrated view through which you can see an entire Research Object; the methods, the data, the output and the outcomes of a project or a piece of work. Linking all this together enables the sharing of research outputs with their context, as a coherent whole. https://www.researchobject.org

Image credit: Goble, C. (2024, February 16). FAIR Digital Research Objects: Metadata Journeys. University of Auckland Seminar, Auckland. Zenodo. https://doi.org/10.5281/zenodo.10710142

Note that we already have some of the EMO-BON RO-Crates locall in /emobon_demo/ro-crates/analysis-results-cluster-01-crate/.

Expose your triples as a SPARQL end-point accessible over HTTP. Fuseki provides REST-style interaction with your RDF data. Navigate to your downloaded fuseki files, open the archive if you have not done it yet and start the server as follows

cd apache-jena-fuseki-5.6.0/

# start the server
./fuseki-server

This should show you similar to

15:46:47 INFO  Config          :: Fuseki Base = /home/david-palecek/coding/apache-jena-fuseki-5.6.0/run
15:46:47 INFO  Config          :: No databases: dir=/home/david-palecek/coding/apache-jena-fuseki-5.6.0/run/configuration
15:46:48 INFO  Config          :: UI Base = fuseki-server.jar
15:46:48 INFO  Shiro           :: Shiro configuration: file:/home/david-palecek/coding/apache-jena-fuseki-5.6.0/run/shiro.ini

Now opening your localhost, http://localhost:3030/, you should see the following page

Upload Dataset / Graph

In fuseki, go new dataset -> give it a name emobon -> add data -> select the .ttl files and upload all or one by one. This is a shortcut, because many times you will not have access to the .ttl serialized version of the RO-Crate. In that case, the starting point is ro-crate-metadata.json file, which is always at the root of every RO-Crate.

Pseudo python code

# read the json file and convert it to a graph
file = ".../ro-crate-metadata.json"
with open(file, "r", encoding="utf-8") as f:
    jsonld_text = f.read()
g = jsonld_to_rdflib(jsonld_text)

# upload data to the endpoint
resp = requests.put(
  fuseki_url,
  data=g.serialize(format="turtle").encode("utf-8"),
  headers={"Content-Type": "text/turtle"},
  timeout=60,
)

The full example is in this notebook.

The direct way from fuseki is to edit the query in actions. See more detailed but accesible introduction with examples. When you click query, the default query is shown

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT * WHERE {
  ?sub ?pred ?obj .
} LIMIT 10

, which queries all the triples in the graph.

How many triples do we have?

SELECT (COUNT(*) as ?c)
WHERE {
  ?subject ?predicate ?object .
}
LIMIT 10

Filter out all the text/html files.

PREFIX sdo: <http://schema.org/>

SELECT ?x ?dtype
WHERE {
  ?x sdo:encodingFormat ?dtype .
  FILTER regex(str(?dtype), "^text/html", "i")
}

Return also the sdo:downloadUrl of those files.

PREFIX sdo: <http://schema.org/>

SELECT ?x ?dtype ?durl
WHERE {
  ?x sdo:encodingFormat ?dtype ;
    sdo:downloadUrl ?durl .
  FILTER regex(str(?dtype), "^text/html", "i")
}

Now you can click the link of one of the Krona files, open them in the browser and with no surprise, it is a Krona plot.

Now let’s get the real metaGOflow outputs, specifically SSU taxonomy tables. There are several ways how to do it.

1. RO-Crate browser EMBRC hosts the RO-Crate viewer for the EMO-BON data

2. Local SPARQL Write a query to get the sdo:downloadUrl links, put them into your browser, which automatically triggers the download. Hint for the exercise below, match regex of the object on “SSU-taxonomy-summary”.

Return SSU taxonomy download links.

PREFIX sdo: <http://schema.org/>

SELECT ?subject ?predicate ?object ?durl
WHERE {
  ?subject ?predicate ?object .
  FILTER regex(str(?object), "SSU-taxonomy-summary", "i")
  OPTIONAL { ?object sdo:downloadUrl ?durl }
}
LIMIT 50

3. Use data version control (DVC) tool Shown in the python implementation of the above, which follows in the next section. For more on dvc, check its documentation.

It is possible to export the tables form the fuseki for subsequent work, but let’s do everything seamlessly from a jupyter notebook.

First we reproduce what we did until now from Jupyter notebook which is part of the momics-demos repository, therefore you have it already locally with all installed dependencies too.

Since we have already ingested the triples from the RO-Crates, we just need to query the existing endpoint

q = """
SELECT (COUNT(*) AS ?c) 
WHERE { 
  ?s ?p ?o
}
"""

r = requests.get("http://localhost:3030/emobon/query", params={"query": q}, headers={"Accept": "application/sparql-results+json"})
print(r.json())

returned json is relatively easy to convert to a dataframe (sparql_json_to_df function)

def sparql_json_to_df(sparql_json):
    """
    Convert a SPARQL SELECT query JSON result to a pandas DataFrame.
    
    Parameters
    ----------
    sparql_json : dict
        JSON returned by Fuseki / SPARQL endpoint with Accept: application/sparql-results+json
    
    Returns
    -------
    pd.DataFrame
    """
    vars_ = sparql_json.get("head", {}).get("vars", [])
    rows = []

    for binding in sparql_json.get("results", {}).get("bindings", []):
        row = {}
        for var in vars_:
            # Some results might not bind all variables
            if var in binding:
                row[var] = binding[var]["value"]
            else:
                row[var] = None
        rows.append(row)

    df = pd.DataFrame(rows, columns=vars_)
    return df

Second notebook includes all the steps of setting up the SparQL endpoint and also combining local queries with public endpoints from wikidata and UniProt.

Tip

There is a python wrapper for SPARQL SPARQLWrapper, which is pip installable and can be used as a standard python module but also as a command line script.

Because (a) public SPARQL endpoint with all the EMO-BON data, we completaly separate this second part from the first part, and data starting point will be standard .csv tables, just compressed into .parquet files. Since Blue-Cloud 2026 virtual research environment does not exist, we will run all the analysis locally.

All dependencies were installed in the setup. Initialize the jupyter server with

cd momics-demos
python -m jupyterlab

Open wf2_diversity/diversities_panel.ipynb. It looks and it is a native python notebook code. However clicking the panel icon initiates the dashboard.

After getting familiar with the functionality, pick your favourite combination of taxon level + categoricall variable showing certain pattern on the first two PCA components.

Every dashboard is also available for interactive mode ..._interactive.ipynb. After discovering one of the main drivers, we would like to see what drives this difference.

Select one sample from each category and try to identify the taxa responsible from the variance.

Possible options

One approach could be the permonova test removing the taxa one by one. Permonova is implemented in marine-omics-methods (marine-omics on PyPI).
Alternatively compute correlation of each taxon with the PC1 and PC2, rank the features
Fit features as vectors onto PCoA (envfit). This is the most widely used method, especially in ecology (vegan package in R). In Python (via scikit-bio):

    from skbio.stats.ordination import cca_scores

    # envfit equivalent not built-in, but you can manually regress:
    import scipy.stats as st

    results = {}
    for feature in abund.columns:
        slope_x, _, r_x, _, _ = st.linregress(coords["PC1"], abund[feature])
        slope_y, _, r_y, _, _ = st.linregress(coords["PC2"], abund[feature])
        results[feature] = (r_x, r_y)

    vectors = pd.DataFrame(results, index=["r_PC1", "r_PC2"]).T

This is a free time to try your own ideas with the data building upon the existing functionality of the momics-demos. Do not hesitate to raise issues, and reach out.