Mamba-SSM with LLM Reasoning for Feature Selection: Faithfulness-Aware Biomarker Discovery

Published 15 Apr 2026 in q-bio.QM and cs.AI | (2604.14334v2)

Abstract: Gradient saliency from deep sequence models surfaces candidate biomarkers efficiently, but the resulting gene lists can be contaminated by tissue-composition confounders that degrade downstream classifiers. We study whether LLM chain-of-thought (CoT) reasoning can filter these confounders, and whether reasoning quality is associated with downstream performance. We train a Mamba SSM on TCGA-BRCA RNA-seq and extract the top-50 genes by gradient saliency; DeepSeek-R1 evaluates every candidate with structured CoT to produce a final 17-gene set. On the held-out test split, the raw 50-gene saliency set (no LLM) performs worse than a 5,000-gene variance baseline (AUC 0.832 vs. 0.903), while the LLM-filtered set surpasses it (AUC 0.927), using 294x fewer features. A faithfulness audit (COSMIC CGC, OncoKB, PAM50) shows that 6 of 17 selected genes (35.3%) are validated BRCA biomarkers, while 10 of 16 known BRCA genes present in the input were missed - including FOXA1. This divergence between downstream performance and reasoning faithfulness suggests selective faithfulness in this setting: targeted confounder removal can improve predictive performance without comprehensive recall.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper demonstrates that integrating Mamba SSM with LLM chain-of-thought reasoning refines biomarker selection by filtering non-causal genes.
It achieves superior AUC performance (0.927 vs. 0.903) while reducing feature dimensionality by 294x, emphasizing precision over exhaustive recall.
The study highlights challenges in retaining validated markers and calls for improved prompt engineering to better align LLM reasoning with biological causality.

Neuro-Symbolic Feature Selection for Biomarker Discovery in TCGA-BRCA

Problem Setting and Motivation

The feature selection challenge in high-dimensional RNA-seq (>20,000 genes per sample) is particularly acute in cancer genomics due to widespread confounding from tissue-composition, immune infiltration, and batch effects. Saliency-based approaches, relying on neural models such as Mamba SSM, are routinely used for candidate biomarker identification, but their outputs frequently include non-causal genes correlated with tumor phenotype. The paper addresses whether LLM-based chain-of-thought (CoT) reasoning can enhance the faithfulness and specificity of feature selection for clinical genomics. The central hypothesis is that a neuro-symbolic pipeline—combining Mamba SSM gradient saliency with structured LLM reasoning—can not only improve downstream classification metrics, but selectively filter confounders while preserving biologically validated drivers.

Methodology Overview

The system comprises four sequential phases: (1) preprocessing TCGA-BRCA RNA-seq data (1,095 tumors, 113 normals, ~20,000 genes); (2) extracting 50 candidate genes via gradient saliency from a trained Mamba SSM; (3) structured CoT reasoning over these candidates with DeepSeek-R1 (7B) using explicit keep/reject criteria; (4) faithfulness benchmarking against ground-truth databases (COSMIC CGC, OncoKB, PAM50).

For downstream evaluation, three pipelines are compared:

Variance Baseline: top-5,000 high-variance genes (B1)
Saliency Only: top-50 Mamba saliency genes (B2)
LLM-Filtered Selection: 17-gene set from DeepSeek-R1 reasoning (B3)

All sets are retrained with identical Mamba classifier architecture for held-out test set evaluation (AUC, Accuracy, F1).

Numerical Performance and Empirical Findings

The LLM-filtered gene set (B3) achieves substantial reduction in input dimensionality (294x fewer features than baseline), yet yields strong downstream performance gains:

AUC-ROC: B1 (0.903), B2 (0.832), B3 (0.927)
Accuracy: B1 (0.8785), B2 (0.7247), B3 (0.8907)

Saliency-only selection (B2) substantially underperforms the baseline (AUC -0.071), demonstrating the risk of confounder inclusion. Conversely, neuro-symbolic refinement (B3) improves upon baseline by AUC +0.024, supporting the hypothesis that logical filtering is empirically valuable.

Faithfulness Audit and Biologically Grounded Analysis

The 17-gene LLM-selected set is cross-referenced against a 101-gene ground truth spanning curated cancer databases. Key findings:

Recall on validated BRCA genes in candidate pool: 0.375 (6/16)
Proportion of selected genes with BRCA evidence: 35.3%
False positives (known non-BRCA genes selected): 17.6%

Despite missing 62.5% of known true positives in the candidate pool—including FOXA1, a canonical PAM50 luminal marker—the LLM-filtered set delivers the highest classifier AUC. This establishes a key empirical claim: precision-oriented confounder removal confers higher generalization benefit than exhaustive recall of disease drivers.

Failure modes include confident rejection of established markers when they are lower-ranked in saliency, and retention of non-BRCA genes without evidence, often justified in CoT by generic or hallucinated pathway language. Importantly, the results demonstrate selective faithfulness, rather than comprehensive biological fidelity, cautions against equating metric improvement with reasoning correctness.

Practical and Theoretical Implications

This study suggests that neuro-symbolic pipelines are well-suited for high-dimensional omics, where abundant confounding noise makes classical feature ranking insufficient. Structured LLM reasoning can impose precision-oriented filtering, but current LLMs remain susceptible to both omission of validated targets and inclusion of biologically dubious features when prompt constraints are not optimally designed. The divergence between task-level metric improvement and reasoning faithfulness (as measured by biomarker recall) is significant for downstream translational validity.

While empirical gains are robust in this setting, generalizability to other cancer types or disease contexts requires further controlled studies. Future work should address domain-sensitive prompt engineering, decision-level parsing, and more reproducible evaluation protocols, as well as direct comparison to classical methods (LASSO, ElasticNet), currently absent from the analysis.

Speculation on Future Developments in AI

The findings underscore the nascent but critical role of LLMs in biomedical informatics pipelines. Continued improvement in faithfulness constraints, automated auditability, and integration with symbolic filters is likely to expand LLM capabilities beyond generic knowledge distillation into actionable translational genomics. Advances in domain-specific pretraining and fine-tuning on large-scale curated biological datasets could mitigate current failure modes (e.g., FOXA1 omission), optimizing both biomarker recall and precision. Better alignment between LLM reasoning chains and biological causality will be essential for regulatory acceptance in clinical settings.

Conclusion

The paper provides a rigorous demonstration that neuro-symbolic integration (Mamba SSM + structured LLM CoT) enables precision-oriented feature selection for biomarker discovery, achieving superior classification performance with dramatically reduced feature sets. Selective faithfulness, rather than exhaustive recall, emerges as the decisive factor for generalization, with implications for both methodological rigor and translational adoption in omics-driven precision medicine. Continued development in reasoning faithfulness and robust audit mechanisms will be crucial for broader application.

Markdown Report Issue