GenoTEX Benchmark: Automated Gene Expression

Updated 3 July 2026

GenoTEX is an expert-curated benchmark designed to assess the automation of gene expression analysis with real transcriptomic data.
It systematically evaluates end-to-end workflows—from dataset filtering to preprocessing and statistical inference—using metrics like F₁ and AUROC.
The benchmark supports agentic frameworks such as GenoMAS and GenoAgent, driving reproducible, code-driven advancements in genomics.

GenoTEX is an expert-curated, multi-faceted benchmark designed to evaluate the automation of gene expression data analysis by artificial intelligence agents and multi-modal machine learning models. Its key focus is the systematic assessment of end-to-end workflows for gene–trait association (GTA) studies with real transcriptomic data, encompassing dataset selection, preprocessing, and statistical inference. GenoTEX provides a rigorous, reproducible foundation for examining both the scientific validity and technical robustness of automated gene analysis systems, and has been central to the evaluation and advancement of agentic bioinformatics frameworks such as GenoMAS and GenoAgent (Liu et al., 28 Jul 2025, Liu et al., 2024, Kan-Tor et al., 2024).

1. Design Principles and Benchmark Scope

GenoTEX was established to fill gaps in benchmarking the automation of highly technical gene expression pipelines, explicitly aligning with best practices in computational genomics. Unlike synthetic or retrieval-focused evaluations, GenoTEX uses real human genomics data—spanning microarray (GEO) and RNA-seq (TCGA) platforms—and enforces the generation and successful execution of analysis code. A pivotal requirement is that agentic systems not only plan and route logical steps, but also implement, debug, and adapt code to handle the challenges of semi-structured, noisy biological data and high-dimensional sparsity (Liu et al., 28 Jul 2025, Liu et al., 2024).

The benchmark couples open-source datasets and expert-validated annotations, ensuring each automated output can be measured against a curated reference standard. GenoTEX decomposes the gene-trait association workflow into three sequential, interdependent tasks:

Dataset Filtering and Selection: Identifying and ranking omics cohorts that contain traits of interest.
Data Preprocessing: Extraction, normalization, and merging of gene expression matrices and clinical metadata, including standardized probe mapping and batch effect correction.
Statistical Analysis (Gene Identification): Application of sparse regression models (e.g., Lasso, linear mixed models) for discovering genes associated with target phenotypes, alongside adjustments for latent confounders.

The datasets comprise 1,384 GTA problems covering 913 human cohorts and 132 clinical traits, all annotated twice by trained bioinformaticians with adjudication for inconsistencies (Liu et al., 28 Jul 2025).

2. Task Definitions, Pipelines, and Ground Truth

Each GenoTEX task models canonical epidemiological and bioinformatics workflows as deployed in real-world genomics research:

Dataset Filtering (DF): Binary classification task—does a dataset possess the required phenotype?
Dataset Selection (DS): Ranking task—in multi-cohort scenarios, choose the dataset matching expert reference.
Preprocessing: Integration and normalization of gene and phenotype tables, probe-to-symbol mapping (GNorm2), batch correction (ComBat), and clinical trait parsing. Downstream, all outputs are unified into standardized pandas DataFrames.
Statistical Analysis: Both unconditional and conditional gene-trait associations are supported, using Lasso regression with hyperparameter tuning or mixed models when batch effects or population structure are detected. Outputs include ranked gene lists, effect estimates, and p-values.

The expert ground truth is constructed by multiple independent annotations, with rigorous inter-annotator agreement (e.g., F₁ = 94.73% for DF, 90.26% for DS), ensuring the scientific relevance and fidelity of the evaluation (Liu et al., 2024, Liu et al., 28 Jul 2025).

3. Evaluation Metrics

GenoTEX employs a suite of rigorous, biologically grounded metrics, each designed to probe a distinct layer of correctness or scientific utility:

Dataset Filtering and Selection: Measured by accuracy (DS) or F₁ score (DF), reflecting the ability to match trained bioinformaticians.
Preprocessing Quality: Quantified using Attribute Jaccard (gene/feature naming), Sample Jaccard (sample identifier overlap), and average Pearson correlation over matched features. These are multiplicatively combined into the Composite Similarity Correlation (CSC):

$\text{CSC} = \mathrm{AJ} \times \mathrm{SJ} \times \mathrm{Corr}_{\mathrm{avg}}$

Statistical Analysis: Precision, recall, F₁, AUROC, Jaccard index, and GSEA enrichment, comparing predicted gene sets against ground truth associations.
Trait Prediction: Accuracy of regression/model-based trait labelling on held-out samples.

For architecture-agnostic, gene-centric evaluations, GenoTEX ("Does your model understand genes?") uses gene embeddings obtained from arbitrary encoders and assesses their informativeness via downstream logistic or linear models on hundreds of classification/regression tasks spanning genomic properties, regulatory functions, localization, biological processes, and protein properties (Kan-Tor et al., 2024).

4. Baselines, Notable Systems, and Comparative Performance

Multiple baselines are defined on GenoTEX for both agentic and representation learning scenarios:

Agentic Baselines: GenoAgent, a multi-role LLM agent ensemble with project management, code review, domain guidance, and flexible self-correction; Biomni (generalist biomedical agent); homogeneous LLM workflows (e.g. Claude Sonnet, OpenAI o3, Gemini 2.5 Pro); classical scripts (Lasso, mixed models); and human expert output (Liu et al., 28 Jul 2025, Liu et al., 2024).
Representation Models: Text LLMs (MTEB-L/S, MPNet), protein LMs (ESM-2), DNA models (DNABERT-2), scRNA-seq models (ScGPT, CellPLM, Geneformer), and baselines (bag-of-words, Gene2Vec) (Kan-Tor et al., 2024).

Key performance results for end-to-end GTA tasks (gene identification, data preprocessing) underline the benchmark's discriminative power:

System	DF Acc (%)	DS Acc (%)	Preproc CSC (%)	Gene F₁ (%)	AUROC	Code exec. (%)	Trait F₁
GenoMAS	84.62	74.27	89.13	60.48	0.81	98.78	71.34
GenoAgent	78.79	71.06	78.52	43.63	0.65	—	71.93
Human Exp.	—	—	—	—	—	—	—

Architecture-agnostic comparison (mean AUROC, ±std):

Model	Genomic	Regulatory	Localization	Biol. Proc	Prot. Prop
MTEB-L	0.75 ± .03	0.78 ± .04	0.76 ± .05	0.73 ± .06	0.82 ± .04
ScGPT-H	0.68 ± .04	0.71 ± .05	0.81 ± .03	0.79 ± .05	0.78 ± .06
ESM-2	0.73 ± .05	0.75 ± .06	0.74 ± .05	0.70 ± .07	0.89 ± .02
DNABERT-2	0.72 ± .06	0.70 ± .07	0.68 ± .08	0.66 ± .07	0.72 ± .05

These results illustrate distinct modality advantages: text-based and protein LLMs outperform expression-based embeddings for genomic and regulatory tasks, whereas scRNA-seq models excel at localization tasks (Kan-Tor et al., 2024).

5. Implementation Resources and Usage

GenoTEX is open-source with full evaluation pipeline, problem datasets, and annotation guidelines available at https://github.com/Liu-Hy/GenoTEX (agentic, pipeline-based), and for feature-based model evaluation at https://github.com/IBM/genoTEX-benchmark (Liu et al., 2024, Kan-Tor et al., 2024).

Agentic: All source code to configure, run, and evaluate multi-agent workflows on curated problems is included. Outputs are quantitatively compared against ground truth via provided scripts and summary notebooks.
Representation Learning: Installable Python package (pip install genotex), with standardized interfaces for model wrappers and plug-in task extension. Gene encoders are assessed using frozen representation vectors and consistent predictive heads, controlling for fine-tuning differences.

The resource is explicitly extensible—new tasks can be defined by appending gene-level sub-tasks, and model architectures can be benchmarked by subclassing the provided encoder template (Kan-Tor et al., 2024).

6. Impact, Critical Challenges, and Future Directions

GenoTEX has established itself as a standard for evaluating the automation of gene expression workflows, emphasizing rigorous, biologically relevant, and code-driven assessments. By combining real-world heterogeneity (e.g., evolving gene nomenclature, high-dimensional sparsity, metadata noise) and demanding complete workflow execution, GenoTEX has shaped the development and credibility of agentic and multi-modal bioinformatics systems (Liu et al., 28 Jul 2025, Liu et al., 2024).

Notable open challenges include:

Instability in code review agent outputs and trait extraction logic across runs (Liu et al., 2024).
Remaining gaps in trait field inference—highlighting the need for tighter integration with biological ontologies and interactive debugging modes.
The challenge of latent confounding, subtle batch effects, and evolving data structure in large-scale clinical genomics.
Model evaluation is limited to “frozen” representations; future work will extend to fine-tuned models and non-linear predictors (e.g., GNNs), as well as cross-species generalization and expanded epigenome annotations (Kan-Tor et al., 2024).

As genomic datasets continue to proliferate and machine learning models become more sophisticated, GenoTEX’s comprehensive and biologically validated design will remain central to benchmarking robust, scientifically trustworthy automation in transcriptomic research.