Taxa equivalence across heterogeneous microbiome datasets

Determine whether microbial taxa represented as features in different microbiome datasets correspond to the same biological taxa when datasets are profiled using different measurement techniques, such as distinct regions of a marker gene (for example, 16S rRNA), which can cause the same taxon to be represented by different features across studies. Establish principled criteria or mapping procedures to ascertain cross-dataset taxon equivalence under these heterogeneous profiling conditions.

Background

A central challenge in applying domain adaptation to biological data is heterogeneity of feature spaces. In microbiome research, differences in measurement techniques (e.g., sequencing distinct regions of a marker gene) can lead to non-aligned feature representations of the same biological taxa across cohorts or studies.

This heterogeneity hampers direct aggregation and alignment of datasets because it is nontrivial to establish correspondence between features that may denote the same underlying taxon. Resolving cross-dataset taxon equivalence is therefore a prerequisite for robust domain adaptation and for discovering domain-invariant biological signals.

References

In microbiome research, it can be unclear whether a particular taxa is the same across datasets, especially because sometimes the measurement techniques differ (e.g., taxa are characterized using different regions of a marker gene, such that the same taxa might be represented by different features in different datasets).

Domain adaptation in small-scale and heterogeneous biological datasets  (2405.19221 - Orouji et al., 2024) in Section 3.2.2 Heterogeneity of features