Genie-CAT for Protein Hypothesis Generation
- Genie-CAT is an integrated LLM framework that automates mechanistic hypothesis generation in protein design by unifying literature retrieval, structural parsing, electrostatics, and ML redox prediction.
- It employs a dynamic agentic architecture using a Thought–Action–Observation loop to iteratively orchestrate multiple computational tools for evidence-based analysis.
- The system quantitatively links protein sequence and structural context to redox function, producing rapid, interpretable, and testable hypotheses for metalloprotein engineering.
Genie-CAT is a tool-augmented LLM framework engineered to accelerate scientific hypothesis generation in protein design, with primary emphasis on the mechanistic tuning of metalloproteins such as ferredoxins. By orchestrating literature-grounded retrieval-augmented generation (RAG), advanced structural parsing, electrostatic potential computations, and machine learning redox property prediction in a unified, agentic workflow, Genie-CAT enables automated, mechanistically interpretable, and quantitatively robust hypothesis generation linking sequence, structure, and biochemical function (Jacob et al., 24 Nov 2025).
1. Agentic System Architecture
Genie-CAT is deployed as a Streamlit front-end that hosts a LangGraph “ReAct” agent, responsible for dynamic tool orchestration in response to user queries. The agent employs an iterative Thought–Action–Observation loop, where the LLM determines the needed capability (RAG, structure parser, APBS for electrostatics, or the redox ML model), executes structured function calls, and incorporates tool outputs into evolving context. This loop repeats until sufficient evidence and computation allow the LLM to synthesize a final answer. All domain tools operate within a single container, and the system is designed for modular extensibility: new physics-based modules such as quantum mechanics/molecular mechanics (QM/MM), density functional theory (DFT), or molecular dynamics (MD) register as additional tools without modification of core code.
2. Core Capabilities
Genie-CAT’s capabilities span four integrated domains:
2.1. Literature-Grounded Reasoning (RAG)
Genie-CAT utilizes a corpus of approximately 1,600 publications on hydrogenases and metalloenzymes, segmented into overlapping 500-character windows. Embeddings are generated using 384-dimensional MiniLM-L6–v2 vectors and indexed with FAISS using cosine similarity. For a given question, the system retrieves the top-k relevant segments alongside document-level summaries via a “multiple-abstraction-level RAG” approach, concatenating these in the LLM prompt to ground its response. This method demonstrably reduces hallucination and improves correctness, with RAG-based answers scoring 4.38 ± 0.05 against 4.01 ± 0.09 for GPT-5-mini without retrieval-augmentation.
2.2. Structural Parsing of PDB Files
Structural analysis incorporates user-uploaded, preloaded, or automatically fetched PDB files (via RCSB identifiers). MDAnalysis extracts atomic positions, identifies Fe atoms as [Fe–S] cluster centers, and computes distances to neighboring residues within a configurable cutoff (default Å). Residues are classified by physicochemical type (polar, nonpolar, charged) and visualized through distance histograms and class-distribution bar charts. Outputs include interactive tables, Matplotlib figures, and explicit residue-level summaries (chain, position, polarity, distance).
2.3. Electrostatic Potential Calculation
Electrostatic computations assign point charges using Amber ff14SB or an in-house [Fe–S] parameter database (SF4, FES, F3S). The Poisson–Boltzmann equation is solved via APBS on a 3D grid: where is the position-dependent dielectric, the Debye–Hückel screening, the dimensionless electrostatic potential, and the charge density. Visual outputs include PyMOL scripts for surface electrostatics and differential maps for mutant comparison. Typical runtime per protein is 2–3 minutes.
2.4. Machine Learning Prediction of Redox Properties
For each [4Fe–4S] cluster , Genie-CAT constructs group-invariant descriptors for (distances, angles, triple products), as well as global electrostatic descriptors (—potential at center, —field vector, ). These features are concatenated to a 57-dimensional vector , which is z-scored and input into a two-layer MLP ($256, 128$ neurons, ReLU activation, dropout 0.1) to predict cluster midpoint potential: The model is trained via MSE loss: Inference per cluster typically executes in ~20 seconds.
3. Mechanistic Hypothesis Generation
Genie-CAT explicitly links sequence, structural context, and functional redox outcomes to generate testable, mechanistically interpretable hypotheses. After parsing the protein structure, the LLM identifies residues within of each Fe atom whose physicochemical class is amenable to mutation (e.g., nonpolar → polar). For each candidate mutation, APBS computes resultant shifts in local electrostatic potential (), and the redox ML model predicts changes in midpoint potential (), directly mapping mutation → geometry/electrostatics → function.
3.1. Example Residue Hypotheses
| Mutation | Rationale | (mV) | Predicted (mV) |
|---|---|---|---|
| Leu56→Asp | Negative charge near cluster → stabilized oxidized state | +18 | +25 |
| Val34→Asn | Polar side chain introduced in hydrophobic pocket | +12 | +15 |
| Ile22→Glu | Longer, negatively charged side chain | +24 | +32 |
Table: Generated for 1CLF Cluster 1. Each entry includes local geometry (distance to Fe), electrostatic shift (APBS ), predicted redox shift, and mechanistic rationale (e.g., H-bonding, dielectric effects).
3.2. Hypothesis Validation and Ranking
Candidate hypotheses are scored by magnitude of electrostatic shift (physics-based), predicted (ML-based), and a tunable linear composite function . Top-ranked hypotheses are accompanied by confidence scores and links to relevant literature on similar mutations.
4. Quantitative Case Study: Ferredoxin Redox Tuning
A proof-of-concept deployment on PDB 1CLF demonstrates the agentic workflow. Genie-CAT automatically retrieves and parses two [4Fe–4S] clusters, characterizing their environments:
- Cluster A: 5 hydrophobic, 2 polar/charged residues within six angstroms.
- Cluster B: 2 hydrophobic, 5 polar/charged residues within six angstroms.
The system generates and explains the hypothesis that Cluster A’s more hydrophobic environment stabilizes its reduced state, yielding a more negative . Quantitative predictions compare well to reported values:
| Cluster | Environment | Predicted (mV) | Expert Trend | Reported (mV) |
|---|---|---|---|---|
| A | More hydrophobic | –425 | More negative | –420 ± 10 |
| B | More polar | –370 | More positive | –360 ± 15 |
Predictions are within 10–15 mV of literature values, confirming quantitative fidelity.
5. Comparative Analysis and Scope of Genie-CAT
5.1. Comparison to Traditional LLMs and Design Tools
Genie-CAT surpasses pure protein LLMs (e.g., GPT-5-mini without RAG) in grounded Q&A metrics (4.38 ± 0.05 vs. 4.01 ± 0.09 mean correctness). It is distinguished from conventional sequence/structure-generation tools such as ProteinMPNN or RFdiffusion by integrating explicit electrostatic computation, ML-based redox prediction, and direct literature grounding. The agentic “ReAct” pipeline enables dynamic, multi-modal evidence synthesis absent in static PLMs and diffusion pipelines.
5.2. Advantages and Limitations
Advantages:
- Mechanistic interpretability via explicit integration of geometric and electrostatic descriptors.
- Rapid end-to-end execution (<3 minutes per analysis) versus days of manual setup.
- Modular extensibility, allowing seamless addition of new QM/MM, DFT, or MD tools.
Limitations:
- Continuum electrostatics approximates near-metal polarization and omits certain quantum effects.
- Redox ML predictor is trained on [Fe–S] proteins, requiring retraining for other cofactors.
- Single-structure analysis; does not account for conformational ensemble effects.
5.3. Potential Extensions
- Integration of QM/MM cluster models for higher-fidelity local electrostatics.
- Incorporation of MD-derived ensemble averages into feature construction.
- Expansion of the literature and structural corpus to encompass crystal structure databases, patents, and thermodynamic datasets.
- Extension to non-[Fe–S] redox cofactors (e.g., heme, Cu-centers) via parameter and model retraining.
6. Role in Mechanistic Computational Enzyme Design
Genie-CAT exemplifies agentic LLM frameworks in scientific discovery: the synthesis of natural-language reasoning, literature-based grounding, explicit numeric simulation (APBS), and ML-based functional inference transforms the design of metalloproteins from an expert-driven, manual task into an interactive, quantitative, and interpretable computational process. By seamlessly bridging mutation identification, structure-function annotation, and functional outcome prediction, Genie-CAT provides testable, evidence-backed hypotheses for accelerated engineering of metalloproteins (Jacob et al., 24 Nov 2025).