GenoMAS: Agentic Genomic Analysis Framework

Updated 3 July 2026

GenoMAS is a dual framework for high-throughput gene expression and GWAS analysis that integrates agentic automation with rigorous statistical modeling.
It employs a heterogeneous ensemble of specialized agents to preprocess, normalize, and analyze large-scale transcriptomic and genetic datasets.
The system supports disease mechanism inference and therapeutic repurposing through dynamic workflow resolution and network-based analysis.

GenoMAS refers to two distinct but conceptually related frameworks for high-throughput genomic data analysis: (1) a multi-agent LLM-based system for automated gene expression analysis and scientific discovery, and (2) a statistical and computational toolkit for gene- and network-oriented analysis of genome-wide association studies (GWAS). Both frameworks target the integration, normalization, and downstream interpretation of large-scale transcriptomic or genetic datasets, with particular emphasis on interpretable modeling, automation, and disease mechanism inference (Liu et al., 28 Jul 2025, Pedroso, 2013, Chen et al., 6 Aug 2025).

1. Agentic Multi-Agent Architecture for Gene Expression Analysis

GenoMAS, as described in recent work (Liu et al., 28 Jul 2025), implements a heterogeneous ensemble of six LLM agents, each specialized in distinct analytic or advisory tasks. The architecture is organized as follows:

PI Agent (Orchestration): Oversees workflow, cohort identification, parallelization, and completion tracking.
Data Engineer Agents (Programming): GEO and TCGA agents for domain-specific data preprocessing including loading raw files (CEL, TXT, FASTQ), mapping probes to genes, normalization of identifiers, and extraction of clinical metadata.
Statistician Agent: Handles downstream statistical inference, including batch-effect correction and Lasso or linear mixed model regression for gene–trait association analyses (GTA).
Code Reviewer Agent: Isolates and validates code, diagnosing failures and providing approval or diagnostic feedback.
Domain Expert Agent: Offers domain-specific biomedical guidance by generating executable code, especially for complex normalization and variable mapping.

These agents communicate via a typed message-passing protocol—each message includes sender identity, message type, content (code, error traces, data summaries), and target recipient(s). A central message queue enforces partial ordering to avoid dependency cycles. The cognitive diversity is realized by leveraging multiple LLM backends (Claude Sonnet 4, OpenAI o3, Gemini 2.5 Pro), which improves overall analytic robustness and efficiency.

2. Guided-Planning and Dynamic Workflow Resolution

Central to the GenoMAS agentic pipeline is a "guided planning" mechanism that departs from static bioinformatics workflows. Analytical sub-tasks, termed "Action Units," are derived from high-level templates expressed as directed acyclic graphs encoding mandatory, optional, and conditional steps. During execution, agents continuously evaluate:

Success/failure states of prior Action Units
Data-specific idiosyncrasies (e.g., missing or ambiguous entries)
Remaining analytic objectives

At every analytic juncture, agents may (1) advance, (2) revise, (3) bypass, or (4) backtrack the workflow, ensuring both logical coherence and resilience to data heterogeneity. This retraction mechanism is critical for correction in the event of cascading errors (e.g., failed normalization propagating downstream), ensuring adaptability in complex real-world datasets (Liu et al., 28 Jul 2025).

3. Automated Pipeline for Code-Driven Gene Expression and Statistical Modeling

The GenoMAS analytic pipeline is modularized into three principal stages:

Dataset Selection: Application of metadata filters to identify disease/cohort subsets for transcriptomic analysis.
Data Preprocessing: Loading of multiple data types (microarray, RNA-seq); gene/metadata normalization; batch-effect correction (ComBat); imputation of missing data; log transformation and quantile normalization. The resulting data matrices, $X \in \mathbb{R}^{n \times p}$ (expression) and clinical covariate matrix $C \in \mathbb{R}^{n \times k}$ , are then available for downstream modeling.
Statistical Association: Lasso regression (or linear mixed-effects models) regressing response $y$ (phenotype vector) on expression $X$ , with explicit covariate control. Significant associations correspond to nonzero coefficients and thresholded p-values.

This pipeline is evaluated on the GenoTEX benchmark across multiple metrics: data preprocessing quality (Composite Similarity Correlation, CSC), gene identification (precision, recall, F1), and additional calibration indices (Liu et al., 28 Jul 2025).

4. Quantitative Performance and Validation

On the GenoTEX benchmark, GenoMAS achieves:

Data Preprocessing: CSC of 89.13% for gene expression preprocessing (GenoAgent 78.52%, Biomni 33.91%).
Gene Identification: F1 = 60.48% (compared to GenoAgent 43.63% and human expert F1 = 71.63%); AUROC = 0.81 (expert: 0.90).
Clinical Metadata Extraction: Substantially lower CSC (32.61%), highlighting difficulties with semi-structured clinical text.

Robustness is supported by large effect size gains (+10.61% CSC, +16.85% F1) relative to prior frameworks and a task completion success rate of 98.78%. No formal statistical significance is provided; however, extensive empirical benchmarking and biological plausibility validation (e.g., associations involving VDR, SOCS1, SLC11A1, with appropriate confounder adjustment) support the approach (Liu et al., 28 Jul 2025).

5. Disease Similarity and Therapeutic Repurposing via Agentic Transcriptomic Networks

An extended application of GenoMAS leverages transcriptomic signatures to construct disease–disease similarity networks at both gene and pathway levels (Chen et al., 6 Aug 2025). Key steps include:

Extraction of transcriptomic signatures via Lasso regression (selecting genes with |β| > 0.05) for each disease–condition pair.
Statistical testing of gene signature overlaps via bidirectional hypergeometric tests, with Benjamini–Hochberg correction to control the false discovery rate (FDR ≤ 0.05).
Pathway-level similarity scoring using joint enrichment analysis (GO:BP, Reactome, KEGG, TF/miRNA/HPO targets), where the aggregated similarity between two disease–condition pairs $S_{ij} = \sum_{k \in \text{shared}} [\log(1 - p_{i,k}) + \log(1 - p_{j,k})]$ .
Network construction with nodes corresponding to disease–condition pairs and edge weights reflecting statistical overlap or pathway similarity.

The resultant network reveals non-obvious comorbidities and candidate links for therapeutic repurposing. For example, strong transcriptional connections between Autism Spectrum Disorder and Osteoporosis or Type 1 Diabetes highlight shared mechanistic pathways, supporting rational drug repurposing using pathway and shared gene overlap scores (Chen et al., 6 Aug 2025).

6. Comparative Approach for GWAS–Pathway and Network Analysis

The earlier GenoMAS (Gene-set and Network-oriented Mapping of Association Signals) framework targets GWAS data, elevating analysis from SNP-level signals to genes, gene sets, and molecular interaction networks (Pedroso, 2013). Methodological contributions include:

Multiple gene-level statistics:
- Sidak-corrected minimum p-values
- Fisher-style methods for correlated SNPs (explicit variance formula using pairwise LD)
- Fixed and random-effects z-scores
- Empirical p-values via multivariate normal sampling for LD-aware aggregation (VEGAS-like)
Self-contained and competitive gene-set tests, supporting GMT-formatted pathway collections and protein–protein interaction network modules.
Integrated pipelines for gene mapping, annotation, efficient matrix operations (Perl/PDL), and meta-analysis.
Benchmarking on psychiatric and immune GWA datasets demonstrates substantial gains in locus recovery and biological interpretability over SNP-level methods.

GenoMAS’s gene- and network-based methodology has uncovered both established and emergent disease pathways in applications to Crohn’s disease, bipolar disorder, and rare disease subtypes (Pedroso, 2013).

7. Limitations, Bottlenecks, and Future Directions

Recognized limitations of current GenoMAS systems include:

Bottlenecks in clinical metadata extraction from heterogeneous or free text fields.
Auditability and model interpretability—while code is human-readable, versioned tracking of agent code evolution could further support transparency.
Potential scalability challenges for multi-modal (beyond transcriptomics) integration and advanced planning algorithms.
Ongoing development to incorporate richer human-in-the-loop feedback and advanced planners (e.g., search-based dynamic planning).
For GWAS applications, limitations may arise from LD reference panel representativeness, software dependencies, and suitability for extremely polygenic or rare variant architectures.

Future directions are targeting expansion to multimodal omics, more granular provenance capture, and inclusion of more sophisticated planning and feedback mechanisms (Liu et al., 28 Jul 2025, Pedroso, 2013).

GenoMAS exemplifies the convergence of agentic automation, flexible workflow orchestration, and rigorous statistical methodology for scalable, interpretable, and biologically relevant genomic analyses across transcriptomics and GWAS modalities. These advances support both mechanistic hypothesis generation and clinically actionable insights, leveraging reproducible pipelines and network-level representations of disease biology.