Abundance table preprocessing

Author

David Paleček dpalecek@ualg.pt

Published

May 5, 2025

Work in progress

This is a personal commitment to understand the effect (variance) of abundance table normalization and scaling methods on downstream tasks, which may be differential analysis etc.

Compare R based methods to the python ones.

Most well-known packages have the normalization methods implemented so raw data tables can be supplied to them, such as QIIME2 or refseq. For EMO-BON analysis, I do not use those (might be a mistake, because of bug risks), so I need to understand them properly.

Load SSU combined taxonomy from 181 EMO-BON samplings.

Code
import pandas as pd
import numpy as np
from skbio.diversity import beta_diversity

#| code-fold: false
# read the data from github
ssu_url = "https://github.com/emo-bon/momics-demos/raw/refs/heads/main/data/parquet_files/metagoflow_analyses.SSU.parquet"

ssu = pd.read_parquet(ssu_url)

# change abundance to int
ssu['abundance'] = ssu['abundance'].astype(int)
ssu.head()
ref_code ncbi_tax_id abundance superkingdom kingdom phylum class order family genus species
0 EMOBON00084 2157 7 Archaea None None None None None None None
1 EMOBON00084 1801616 1 Archaea Candidatus_Woesearchaeota None None None None None
2 EMOBON00084 28890 1 Archaea Euryarchaeota None None None None None
3 EMOBON00084 183968 1 Archaea Euryarchaeota Thermococci None None None None
4 EMOBON00084 192989 3 Archaea Nanoarchaeota None None None None None

Let’s order them by abundance

Code
ssu.sort_values(by='abundance', inplace=True, ascending=False)

ssu
ref_code ncbi_tax_id abundance superkingdom kingdom phylum class order family genus species
9377 EMOBON00009 1236 26938 Bacteria Proteobacteria Gammaproteobacteria None None None None
9014 EMOBON00010 1236 26431 Bacteria Proteobacteria Gammaproteobacteria None None None None
10691 EMOBON00008 1236 14402 Bacteria Proteobacteria Gammaproteobacteria None None None None
68804 EMOBON00125 54526 13108 Bacteria Proteobacteria Alphaproteobacteria Pelagibacterales None None None
17321 EMOBON00003 72037 12545 Eukaryota Metazoa Arthropoda Hexanauplia None None None None
... ... ... ... ... ... ... ... ... ... ... ...
76710 EMOBON00228 12916 1 Bacteria Proteobacteria Betaproteobacteria Burkholderiales Comamonadaceae Acidovorax None
76708 EMOBON00228 48736 1 Bacteria Proteobacteria Betaproteobacteria Burkholderiales Burkholderiaceae Ralstonia None
76702 EMOBON00228 1381133 1 Bacteria Proteobacteria Betaproteobacteria Candidatus_Profftella None
76698 EMOBON00228 165695 1 Bacteria Proteobacteria Alphaproteobacteria Sphingomonadales Sphingomonadaceae Sphingobium None
76694 EMOBON00228 418853 1 Bacteria Proteobacteria Alphaproteobacteria Sneathiellales Sneathiellaceae Sneathiella Sneathiella_glossodoripedis

111320 rows × 11 columns

Total Sum Scaling (TSS) followed by Square Root Transformation

TSS

  • converts raw counts into relative abundances, alternative name Relative Abundance Normalization. Simple division by sum of abundances in each sample separately.
  • Purpose: Adjusts for varying sequencing depths between samples.
  • reference, McMurdie, P. J., & Holmes, S. (2014). Waste not, want not: why rarefying microbiome data is inadmissible. PLoS computational biology, 10(4), e1003531.

Square root transformation to relative abundances

  • This is a variance-stabilizing transformation — it reduces the effect of highly abundant taxa and improves comparability across samples.
  • It’s commonly used before distance-based analyses like Bray–Curtis dissimilarity or ordination (e.g., NMDS, PCoA).
  • reference, Legendre, P., & Gallagher, E. D. (2001). Ecologically meaningful transformations for ordination of species data. Oecologia, 129(2), 271–280.

Here is a function to pivot the taxonomy:

Code
def pivot_taxonomic_data(df: pd.DataFrame, values_col='abundance') -> pd.DataFrame:
    """
    Prepares the taxonomic data (LSU and SSU tables) for analysis.

    Args:
        df (pd.DataFrame): The input DataFrame containing taxonomic information.

    Returns:
        pd.DataFrame: A pivot table with taxonomic data.
    """
    # Select relevant columns
    df['taxonomic_concat'] = (
        df['ncbi_tax_id'].astype(str) + 
        ';sk_' + df['superkingdom'].fillna('') + 
        ';k_' + df['kingdom'].fillna('') + 
        ';p_' + df['phylum'].fillna('') + 
        ';c_' + df['class'].fillna('') + 
        ';o_' + df['order'].fillna('') + 
        ';f_' + df['family'].fillna('') + 
        ';g_' + df['genus'].fillna('') + 
        ';s_' + df['species'].fillna('')
    )
    pivot_table = df.pivot_table(
        index=['ncbi_tax_id','taxonomic_concat'], 
        columns='ref_code', 
        values=values_col,
    ).fillna(0)
    pivot_table = pivot_table.reset_index()
    # change inex name
    pivot_table.columns.name = None

    return pivot_table

, and methods to calculate to apply various scaling and normalization methods:

def TSS(df, sampleIds='ref_code'):
    """ Calculate TSS"""
    df['abundance_TSS'] = df.groupby(sampleIds)['abundance'].transform(lambda x: x / x.sum())
    return df

Now I want to systematically transform and send downstream

Downstream tasks are

  • Beta diversity
  • PCoA
  • ???
ssu = TSS(ssu)

assert ssu[ssu['ref_code'] == 'EMOBON00009']['abundance_TSS'].sum() == 1.0
ssu.head()
ref_code ncbi_tax_id abundance superkingdom kingdom phylum class order family genus species abundance_TSS
9377 EMOBON00009 1236 26938 Bacteria Proteobacteria Gammaproteobacteria None None None None 0.457989
9014 EMOBON00010 1236 26431 Bacteria Proteobacteria Gammaproteobacteria None None None None 0.458879
10691 EMOBON00008 1236 14402 Bacteria Proteobacteria Gammaproteobacteria None None None None 0.216178
68804 EMOBON00125 54526 13108 Bacteria Proteobacteria Alphaproteobacteria Pelagibacterales None None None 0.376753
17321 EMOBON00003 72037 12545 Eukaryota Metazoa Arthropoda Hexanauplia None None None None 0.337676

Calculate and plot Beta diversity

Code
import seaborn as sns

pivot = pivot_taxonomic_data(ssu, values_col='abundance_TSS')
metric = 'braycurtis'
pivot.head()
ncbi_tax_id taxonomic_concat EMOBON00001 EMOBON00003 EMOBON00004 EMOBON00005 EMOBON00006 EMOBON00007 EMOBON00008 EMOBON00009 ... EMOBON00242 EMOBON00243 EMOBON00244 EMOBON00245 EMOBON00246 EMOBON00247 EMOBON00248 EMOBON00249 EMOBON00250 EMOBON00251
0 2 2;sk_Bacteria;k_;p_;c_;o_;f_;g_;s_ 0.018025 0.017335 0.027319 0.047492 0.04602 0.069328 0.060912 0.00692 ... 0.017378 0.013971 0.051088 0.016756 0.053871 0.019248 0.041221 0.025231 0.039109 0.026800
1 6 6;sk_Bacteria;k_;p_Proteobacteria;c_Alphaprote... 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.00000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
2 10 10;sk_Bacteria;k_;p_Proteobacteria;c_Gammaprot... 0.000000 0.000000 0.000000 0.000000 0.00000 0.000017 0.000000 0.00000 ... 0.000000 0.000000 0.000201 0.000051 0.000145 0.000050 0.000123 0.000118 0.000088 0.000182
3 16 16;sk_Bacteria;k_;p_Proteobacteria;c_Betaprote... 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.00000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
4 18 18;sk_Bacteria;k_;p_Proteobacteria;c_Deltaprot... 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.00000 ... 0.000097 0.000219 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000

5 rows × 183 columns

Code
beta = beta_diversity(metric, pivot.iloc[:, 2:].T)
sns.heatmap(beta.to_data_frame(), vmin=0, vmax=1.0, cmap="viridis")

How do I evaluate difference between methods?