MedGEN-Bench: AI Benchmark for Gene Embeddings
- MedGEN-Bench is a unified benchmarking suite that evaluates gene-level embeddings from diverse AI models using hundreds of curated prediction tasks.
- It employs an architecture-agnostic protocol with simple linear classifiers to ensure direct comparability, reproducibility, and practical insights across gene properties.
- The benchmark reveals modality-specific strengths and limitations, guiding model selection for genomic properties, regulatory functions, localization, and biological processes.
MedGEN-Bench is a comprehensive benchmarking suite designed to systematically assess the capabilities of diverse artificial intelligence models—spanning LLMs, protein and DNA foundation models, single-cell transcriptomics models, and classical baselines—to “understand” medically and functionally relevant properties of genes by evaluating their gene-level embeddings across hundreds of curated prediction tasks. Its architecture-agnostic protocol has been widely adopted in computational genomics and biomedical AI research due to its modularity, broad task coverage, and reproducibility (Kan-Tor et al., 5 Dec 2024).
1. Benchmarking Framework and Rationale
MedGEN-Bench employs a unified, architecture-agnostic approach for model comparison. Instead of direct fine-tuning or task-specific retraining, each model is prompted (or invoked) to generate per-gene embeddings using modality-appropriate strategies. These embeddings, regardless of their source—text, DNA, protein, or expression—are then used as fixed representations for all downstream tasks. Simple classifiers (logistic regression, one-vs-rest for multiclass, multiple binary heads for multilabel) or linear regressors are trained on these embeddings to predict gene properties derived from expert-curated bioinformatics databases. The central assumption is that a robust gene representation encodes core biological knowledge in a manner linearly accessible to such predictors, reflecting the internalization of ground-truth biology (Kan-Tor et al., 5 Dec 2024).
2. Task Structure, Ontology, and Sources
MedGEN-Bench’s task universe is derived from five major property families, each anchored in authoritative datasets:
| Family | #Core Tasks (+Binary Subtasks) | Example Properties | Canonical Sources |
|---|---|---|---|
| Genomic Properties | 7 (+79) | Methylation, dosage, chromosome | HGNC, Human Protein Atlas, Geneformer |
| Regulatory Functions | 6 | TF status, PPI count, network pos. | Human TF database, GenePT, HPA |
| Localization | 30 (+70) | Tissue clusters, subcellular loc. | HPA v23 |
| Biological Processes | 29 (+91) | Pathway, disease involvement | Reactome, UniProt, Open Targets, HPA |
| Protein Properties | 3 (+53) | PTMs, ligand binding, domains | UniProt |
Each task is defined as either binary, multiclass, multilabel classification, or regression. Subtasks (e.g., pathway membership, subcellular localization) are extracted for fine-grained evaluation. All tasks use the intersection of gene symbols recognized by every encoder (≈19,000 protein-coding genes) and are filtered by label prevalence (≥1% coverage) (Kan-Tor et al., 5 Dec 2024).
3. Data Processing, Normalization, and Evaluation
All identifiers are standardized to HGNC gene symbols (via MyGeneInfo), with ambiguous or missing entries excluded. Embeddings are used without external normalization; internal standardization is performed automatically in the classifier pipeline (scikit-learn). Label parsing ensures each multi-label task can be broken into binary subtasks, making the suite highly granular.
Training and evaluation use stratified 5-fold cross-validation for classification (to maintain label balance) and k-fold for regression (task-size dependent). There are no fixed train/validation/test splits; all reported results are average metrics across folds (Kan-Tor et al., 5 Dec 2024).
Evaluation metrics include:
- AUC-ROC:
- Macro-averaged F1, Accuracy, Precision, Recall, Hamming Loss (multi-label)
- , RMSE, MAE (regression)
This enables direct metric comparability across tasks and model classes.
4. Model Families and Embedding Protocols
MedGEN-Bench supports a diverse array of pretrained embedding sources:
- Text Models: MTEB-L (SFR-Embedding-Mistral, 7.1B params, 4096-dim), MTEB-S (mxbai-embed-large-v1, 109M, 1024-dim), MPNet (420M, 768), Bag-of-Words (TF–IDF, 1024).
- Genes presented as: “Gene symbol <SYMBOL> full name <FULL NAME> summary <GENE SUMMARY>.”
- Single-Cell Models: CellPLM (85M, 1024), Geneformer (10.3M, 256), ScGPT (51M/39M, 512), gene2vec (5M, 200).
- Gene embeddings taken from model vocabularies or pretrained weights.
- DNA Models: DNABERT-2 (117M, 768).
- Produce embeddings from full genomic sequence, tokenized and aggregated.
- Protein LLMs: ESM-2 (3B, 2560).
- All canonical protein isoforms are encoded and mean-aggregated.
- Classical Baselines: CountVectorizer/TF–IDF on summaries.
All embeddings are “frozen”; only shallow predictive models are learned per task, enforcing comparability and interpretability (Kan-Tor et al., 5 Dec 2024).
5. Principal Experimental Findings
Family-level performance is modality-dependent:
- Genomic Properties & Regulatory Functions: LLMs (MTEB-L/S, MPNet) and protein LMs (ESM-2) are leading (AUC ≈ 0.85–0.90), typically outperforming expression-based models by 5–10 points.
- Localization & Biological Processes: scRNAseq-derived models (ScGPT-H, CellPLM) outperform text/protein models (AUC ≈ 0.80–0.85), a reversal of the global trend.
- Protein Properties: ESM-2 achieves best performance (AUC ≈ 0.92); DNA-based models slightly lag; expression models fall behind (AUC ≈ 0.60–0.70).
Within each task family, exceptions to overall trends arise—for instance, text models can outperform others on certain pathways, and expression models excel in subcellular localization (Kan-Tor et al., 5 Dec 2024).
Model similarity analysis (via cosine similarity of per-task AUCs) reveals strong clustering by training modality, highlighting architectural effects on representational content. The use of MLPs versus logistic regression as downstream predictors has minimal effect on relative outcomes (correlation ), validating the framework’s use of linear classifiers.
6. Codebase, Usage, and Reproducibility
MedGEN-Bench is maintained at [https://github.com/BiomedSciAI/gene-benchmark] under Apache 2.0. The repository includes:
- Scripts for automatic data download (Reactome, HPA, UniProt, etc.) and cleaning.
- Tools to map symbols/IDs to gene summaries (MyGeneInfo-backed).
- Wrappers for sentence-transformers, scRNA models, and CSV import of embeddings.
- Configurable benchmarking pipeline (YAML/CLI) for embedding, model fitting, and reporting.
- Notebooks for figure generation and downstream analysis.
Reproducibility protocol (Python 3.9+): data download, optional embedding precomputation for scRNA models, running core benchmarking with model/task config, and use of provided notebooks for post-hoc analysis (Kan-Tor et al., 5 Dec 2024).
7. Broader Impact and Future Prospects
MedGEN-Bench enables systematic, quantitative comparison of foundation models’ ability to represent gene biology. Its modularity highlights which biological properties are encoded in various modalities, informing model selection, design, and integration for both basic research and clinical bioinformatics. The benchmark’s structure can be naturally extended to new model classes (e.g., multimodal embeddings) and to additional gene property sets as databases evolve. A plausible implication is the routine use of MedGEN-Bench for foundational model evaluation in gene-centric workflows, especially in biomarker discovery, variant interpretation, and therapeutic development (Kan-Tor et al., 5 Dec 2024).
The reported results expose both the strengths and the modality-specific “blind spots” of leading model types, emphasizing the continued need for architectural innovation and curated benchmark expansion.