Abundance table preprocessing

Author

Published

May 5, 2025

Work in progress

This is a personal commitment to understand the effect (variance) of abundance table normalization and scaling methods on downstream tasks, which may be differential analysis etc.

TODO

Compare R based methods to the python ones.

Most well-known packages have the normalization methods implemented so raw data tables can be supplied to them, such as QIIME2 or refseq. For EMO-BON analysis, I do not use those (might be a mistake, because of bug risks), so I need to understand them properly.

Load SSU combined taxonomy from 181 EMO-BON samplings.

Code

import pandas as pd
import numpy as np
from skbio.diversity import beta_diversity

#| code-fold: false
# read the data from github
ssu_url = "https://github.com/emo-bon/momics-demos/raw/refs/heads/main/data/parquet_files/metagoflow_analyses.SSU.parquet"

ssu = pd.read_parquet(ssu_url)

# change abundance to int
ssu['abundance'] = ssu['abundance'].astype(int)
ssu.head()

	ref_code	ncbi_tax_id	abundance	superkingdom	kingdom	phylum	class	order	family	genus	species
0	EMOBON00084	2157	7	Archaea	None	None	None	None	None	None	None
1	EMOBON00084	1801616	1	Archaea		Candidatus_Woesearchaeota	None	None	None	None	None
2	EMOBON00084	28890	1	Archaea		Euryarchaeota	None	None	None	None	None
3	EMOBON00084	183968	1	Archaea		Euryarchaeota	Thermococci	None	None	None	None
4	EMOBON00084	192989	3	Archaea		Nanoarchaeota	None	None	None	None	None

Let’s order them by abundance

Code

ssu.sort_values(by='abundance', inplace=True, ascending=False)

ssu

	ref_code	ncbi_tax_id	abundance	superkingdom	kingdom	phylum	class	order	family	genus	species
9377	EMOBON00009	1236	26938	Bacteria		Proteobacteria	Gammaproteobacteria	None	None	None	None
9014	EMOBON00010	1236	26431	Bacteria		Proteobacteria	Gammaproteobacteria	None	None	None	None
10691	EMOBON00008	1236	14402	Bacteria		Proteobacteria	Gammaproteobacteria	None	None	None	None
68804	EMOBON00125	54526	13108	Bacteria		Proteobacteria	Alphaproteobacteria	Pelagibacterales	None	None	None
17321	EMOBON00003	72037	12545	Eukaryota	Metazoa	Arthropoda	Hexanauplia	None	None	None	None
...	...	...	...	...	...	...	...	...	...	...	...
76710	EMOBON00228	12916	1	Bacteria		Proteobacteria	Betaproteobacteria	Burkholderiales	Comamonadaceae	Acidovorax	None
76708	EMOBON00228	48736	1	Bacteria		Proteobacteria	Betaproteobacteria	Burkholderiales	Burkholderiaceae	Ralstonia	None
76702	EMOBON00228	1381133	1	Bacteria		Proteobacteria	Betaproteobacteria			Candidatus_Profftella	None
76698	EMOBON00228	165695	1	Bacteria		Proteobacteria	Alphaproteobacteria	Sphingomonadales	Sphingomonadaceae	Sphingobium	None
76694	EMOBON00228	418853	1	Bacteria		Proteobacteria	Alphaproteobacteria	Sneathiellales	Sneathiellaceae	Sneathiella	Sneathiella_glossodoripedis

111320 rows × 11 columns

Total Sum Scaling (TSS) followed by Square Root Transformation

TSS

converts raw counts into relative abundances, alternative name Relative Abundance Normalization. Simple division by sum of abundances in each sample separately.
Purpose: Adjusts for varying sequencing depths between samples.
reference, McMurdie, P. J., & Holmes, S. (2014). Waste not, want not: why rarefying microbiome data is inadmissible. PLoS computational biology, 10(4), e1003531.

Square root transformation to relative abundances

This is a variance-stabilizing transformation — it reduces the effect of highly abundant taxa and improves comparability across samples.
It’s commonly used before distance-based analyses like Bray–Curtis dissimilarity or ordination (e.g., NMDS, PCoA).
reference, Legendre, P., & Gallagher, E. D. (2001). Ecologically meaningful transformations for ordination of species data. Oecologia, 129(2), 271–280.

Here is a function to pivot the taxonomy:

Code

def pivot_taxonomic_data(df: pd.DataFrame, values_col='abundance') -> pd.DataFrame:
    """
    Prepares the taxonomic data (LSU and SSU tables) for analysis.

    Args:
        df (pd.DataFrame): The input DataFrame containing taxonomic information.

    Returns:
        pd.DataFrame: A pivot table with taxonomic data.
    """
    # Select relevant columns
    df['taxonomic_concat'] = (
        df['ncbi_tax_id'].astype(str) + 
        ';sk_' + df['superkingdom'].fillna('') + 
        ';k_' + df['kingdom'].fillna('') + 
        ';p_' + df['phylum'].fillna('') + 
        ';c_' + df['class'].fillna('') + 
        ';o_' + df['order'].fillna('') + 
        ';f_' + df['family'].fillna('') + 
        ';g_' + df['genus'].fillna('') + 
        ';s_' + df['species'].fillna('')
    )
    pivot_table = df.pivot_table(
        index=['ncbi_tax_id','taxonomic_concat'], 
        columns='ref_code', 
        values=values_col,
    ).fillna(0)
    pivot_table = pivot_table.reset_index()
    # change inex name
    pivot_table.columns.name = None

    return pivot_table

, and methods to calculate to apply various scaling and normalization methods:

def TSS(df, sampleIds='ref_code'):
    """ Calculate TSS"""
    df['abundance_TSS'] = df.groupby(sampleIds)['abundance'].transform(lambda x: x / x.sum())
    return df

Now I want to systematically transform and send downstream

Downstream tasks are

Beta diversity
PCoA
???

ssu = TSS(ssu)

assert ssu[ssu['ref_code'] == 'EMOBON00009']['abundance_TSS'].sum() == 1.0
ssu.head()

	ref_code	ncbi_tax_id	abundance	superkingdom	kingdom	phylum	class	order	family	genus	species	abundance_TSS
9377	EMOBON00009	1236	26938	Bacteria		Proteobacteria	Gammaproteobacteria	None	None	None	None	0.457989
9014	EMOBON00010	1236	26431	Bacteria		Proteobacteria	Gammaproteobacteria	None	None	None	None	0.458879
10691	EMOBON00008	1236	14402	Bacteria		Proteobacteria	Gammaproteobacteria	None	None	None	None	0.216178
68804	EMOBON00125	54526	13108	Bacteria		Proteobacteria	Alphaproteobacteria	Pelagibacterales	None	None	None	0.376753
17321	EMOBON00003	72037	12545	Eukaryota	Metazoa	Arthropoda	Hexanauplia	None	None	None	None	0.337676

Calculate and plot Beta diversity

Code

import seaborn as sns

pivot = pivot_taxonomic_data(ssu, values_col='abundance_TSS')
metric = 'braycurtis'
pivot.head()

	ncbi_tax_id	taxonomic_concat	EMOBON00001	EMOBON00003	EMOBON00004	EMOBON00005	EMOBON00006	EMOBON00007	EMOBON00008	EMOBON00009	...	EMOBON00242	EMOBON00243	EMOBON00244	EMOBON00245	EMOBON00246	EMOBON00247	EMOBON00248	EMOBON00249	EMOBON00250	EMOBON00251
0	2	2;sk_Bacteria;k_;p_;c_;o_;f_;g_;s_	0.018025	0.017335	0.027319	0.047492	0.04602	0.069328	0.060912	0.00692	...	0.017378	0.013971	0.051088	0.016756	0.053871	0.019248	0.041221	0.025231	0.039109	0.026800
1	6	6;sk_Bacteria;k_;p_Proteobacteria;c_Alphaprote...	0.000000	0.000000	0.000000	0.000000	0.00000	0.000000	0.000000	0.00000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
2	10	10;sk_Bacteria;k_;p_Proteobacteria;c_Gammaprot...	0.000000	0.000000	0.000000	0.000000	0.00000	0.000017	0.000000	0.00000	...	0.000000	0.000000	0.000201	0.000051	0.000145	0.000050	0.000123	0.000118	0.000088	0.000182
3	16	16;sk_Bacteria;k_;p_Proteobacteria;c_Betaprote...	0.000000	0.000000	0.000000	0.000000	0.00000	0.000000	0.000000	0.00000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
4	18	18;sk_Bacteria;k_;p_Proteobacteria;c_Deltaprot...	0.000000	0.000000	0.000000	0.000000	0.00000	0.000000	0.000000	0.00000	...	0.000097	0.000219	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000

5 rows × 183 columns

Code

beta = beta_diversity(metric, pivot.iloc[:, 2:].T)
sns.heatmap(beta.to_data_frame(), vmin=0, vmax=1.0, cmap="viridis")

Total Sum Scaling (TSS) followed by Square Root Transformation

TSS

Square root transformation to relative abundances

Now I want to systematically transform and send downstream

How do I evaluate difference between methods?