Genie-CAT for Protein Hypothesis Generation

Updated 1 December 2025

Genie-CAT is an integrated LLM framework that automates mechanistic hypothesis generation in protein design by unifying literature retrieval, structural parsing, electrostatics, and ML redox prediction.
It employs a dynamic agentic architecture using a Thought–Action–Observation loop to iteratively orchestrate multiple computational tools for evidence-based analysis.
The system quantitatively links protein sequence and structural context to redox function, producing rapid, interpretable, and testable hypotheses for metalloprotein engineering.

Genie-CAT is a tool-augmented LLM framework engineered to accelerate scientific hypothesis generation in protein design, with primary emphasis on the mechanistic tuning of metalloproteins such as ferredoxins. By orchestrating literature-grounded retrieval-augmented generation (RAG), advanced structural parsing, electrostatic potential computations, and machine learning redox property prediction in a unified, agentic workflow, Genie-CAT enables automated, mechanistically interpretable, and quantitatively robust hypothesis generation linking sequence, structure, and biochemical function (Jacob et al., 24 Nov 2025).

1. Agentic System Architecture

Genie-CAT is deployed as a Streamlit front-end that hosts a LangGraph “ReAct” agent, responsible for dynamic tool orchestration in response to user queries. The agent employs an iterative Thought–Action–Observation loop, where the LLM determines the needed capability (RAG, structure parser, APBS for electrostatics, or the redox ML model), executes structured function calls, and incorporates tool outputs into evolving context. This loop repeats until sufficient evidence and computation allow the LLM to synthesize a final answer. All domain tools operate within a single container, and the system is designed for modular extensibility: new physics-based modules such as quantum mechanics/molecular mechanics (QM/MM), density functional theory (DFT), or molecular dynamics (MD) register as additional tools without modification of core code.

2. Core Capabilities

Genie-CAT’s capabilities span four integrated domains:

2.1. Literature-Grounded Reasoning (RAG)

Genie-CAT utilizes a corpus of approximately 1,600 publications on hydrogenases and metalloenzymes, segmented into overlapping 500-character windows. Embeddings are generated using 384-dimensional MiniLM-L6–v2 vectors and indexed with FAISS using cosine similarity. For a given question, the system retrieves the top-k relevant segments alongside document-level summaries via a “multiple-abstraction-level RAG” approach, concatenating these in the LLM prompt to ground its response. This method demonstrably reduces hallucination and improves correctness, with RAG-based answers scoring 4.38 ± 0.05 against 4.01 ± 0.09 for GPT-5-mini without retrieval-augmentation.

2.2. Structural Parsing of PDB Files

Structural analysis incorporates user-uploaded, preloaded, or automatically fetched PDB files (via RCSB identifiers). MDAnalysis extracts atomic positions, identifies Fe atoms as [Fe–S] cluster centers, and computes distances to neighboring residues within a configurable cutoff (default $R_{\mathrm{cut}} = 6$ Å). Residues are classified by physicochemical type (polar, nonpolar, charged) and visualized through distance histograms and class-distribution bar charts. Outputs include interactive tables, Matplotlib figures, and explicit residue-level summaries (chain, position, polarity, distance).

2.3. Electrostatic Potential Calculation

Electrostatic computations assign point charges using Amber ff14SB or an in-house [Fe–S] parameter database (SF4, FES, F3S). The Poisson–Boltzmann equation is solved via APBS on a 3D grid: $-\nabla\!\cdot\!\left[\varepsilon(\mathbf{r})\,\nabla\phi(\mathbf{r})\right] +\kappa^{2}(\mathbf{r})\,\sinh\left[\phi(\mathbf{r})\right]=4\pi\,\rho(\mathbf{r}),$ where $\varepsilon(\mathbf{r})$ is the position-dependent dielectric, $\kappa$ the Debye–Hückel screening, $\phi$ the dimensionless electrostatic potential, and $\rho$ the charge density. Visual outputs include PyMOL scripts for surface electrostatics and differential maps for mutant comparison. Typical runtime per protein is 2–3 minutes.

2.4. Machine Learning Prediction of Redox Properties

For each [4Fe–4S] cluster $i$ , Genie-CAT constructs group-invariant descriptors $\phi_g(s_{i,g}, v_{i,g})\in\mathbb{R}^{18}$ for $g\in\{\mathrm{Fe}, \mathrm{S}_1, \mathrm{S}_2\}$ (distances, angles, triple products), as well as global electrostatic descriptors ( $Q_i$ —potential at center, $C_i$ —field vector, $\|C_i\|_2$ ). These features are concatenated to a 57-dimensional vector $x_i$ , which is z-scored and input into a two-layer MLP ($256, 128$ neurons, ReLU activation, dropout 0.1) to predict cluster midpoint potential: $\hat E_i = f_\theta(\tilde{x}_i)$ The model is trained via MSE loss: $\mathcal{L}(\theta) = \frac{1}{|\mathcal{D}_{\rm train}|} \sum_{i \in \mathcal{D}_{\rm train}} \left(E_i - f_\theta(\tilde{x}_i)\right)^2$ Inference per cluster typically executes in ~20 seconds.

3. Mechanistic Hypothesis Generation

Genie-CAT explicitly links sequence, structural context, and functional redox outcomes to generate testable, mechanistically interpretable hypotheses. After parsing the protein structure, the LLM identifies residues within $R_{\mathrm{cut}}$ of each Fe atom whose physicochemical class is amenable to mutation (e.g., nonpolar → polar). For each candidate mutation, APBS computes resultant shifts in local electrostatic potential ( $\Delta\psi$ ), and the redox ML model predicts changes in midpoint potential ( $\Delta E_\mathrm{m}$ ), directly mapping mutation → geometry/electrostatics → function.

3.1. Example Residue Hypotheses

Mutation	Rationale	$\Delta\psi$ (mV)	Predicted $\Delta E_\mathrm{m}$ (mV)
Leu56→Asp	Negative charge near cluster → stabilized oxidized state	+18	+25
Val34→Asn	Polar side chain introduced in hydrophobic pocket	+12	+15
Ile22→Glu	Longer, negatively charged side chain	+24	+32

Table: Generated for 1CLF Cluster 1. Each entry includes local geometry (distance to Fe), electrostatic shift (APBS $\Delta\psi$ ), predicted redox shift, and mechanistic rationale (e.g., H-bonding, dielectric effects).

3.2. Hypothesis Validation and Ranking

Candidate hypotheses are scored by magnitude of electrostatic shift $|\Delta\psi|$ (physics-based), predicted $\Delta E_\mathrm{m}$ (ML-based), and a tunable linear composite function $w_1 |\Delta\psi| + w_2 |\Delta E_{\rm m}|$ . Top-ranked hypotheses are accompanied by confidence scores and links to relevant literature on similar mutations.

4. Quantitative Case Study: Ferredoxin Redox Tuning

A proof-of-concept deployment on PDB 1CLF demonstrates the agentic workflow. Genie-CAT automatically retrieves and parses two [4Fe–4S] clusters, characterizing their environments:

Cluster A: 5 hydrophobic, 2 polar/charged residues within six angstroms.
Cluster B: 2 hydrophobic, 5 polar/charged residues within six angstroms.

The system generates and explains the hypothesis that Cluster A’s more hydrophobic environment stabilizes its reduced state, yielding a more negative $E_\mathrm{m}$ . Quantitative predictions compare well to reported values:

Cluster	Environment	Predicted $E_\mathrm{m}$ (mV)	Expert Trend	Reported $E_\mathrm{m}$ (mV)
A	More hydrophobic	–425	More negative	–420 ± 10
B	More polar	–370	More positive	–360 ± 15

Predictions are within 10–15 mV of literature values, confirming quantitative fidelity.

5. Comparative Analysis and Scope of Genie-CAT

5.1. Comparison to Traditional LLMs and Design Tools

Genie-CAT surpasses pure protein LLMs (e.g., GPT-5-mini without RAG) in grounded Q&A metrics (4.38 ± 0.05 vs. 4.01 ± 0.09 mean correctness). It is distinguished from conventional sequence/structure-generation tools such as ProteinMPNN or RFdiffusion by integrating explicit electrostatic computation, ML-based redox prediction, and direct literature grounding. The agentic “ReAct” pipeline enables dynamic, multi-modal evidence synthesis absent in static PLMs and diffusion pipelines.

5.2. Advantages and Limitations

Advantages:

Mechanistic interpretability via explicit integration of geometric and electrostatic descriptors.
Rapid end-to-end execution (<3 minutes per analysis) versus days of manual setup.
Modular extensibility, allowing seamless addition of new QM/MM, DFT, or MD tools.

Limitations:

Continuum electrostatics approximates near-metal polarization and omits certain quantum effects.
Redox ML predictor is trained on [Fe–S] proteins, requiring retraining for other cofactors.
Single-structure analysis; does not account for conformational ensemble effects.

5.3. Potential Extensions

Integration of QM/MM cluster models for higher-fidelity local electrostatics.
Incorporation of MD-derived ensemble averages into feature construction.
Expansion of the literature and structural corpus to encompass crystal structure databases, patents, and thermodynamic datasets.
Extension to non-[Fe–S] redox cofactors (e.g., heme, Cu-centers) via parameter and model retraining.

6. Role in Mechanistic Computational Enzyme Design

Genie-CAT exemplifies agentic LLM frameworks in scientific discovery: the synthesis of natural-language reasoning, literature-based grounding, explicit numeric simulation (APBS), and ML-based functional inference transforms the design of metalloproteins from an expert-driven, manual task into an interactive, quantitative, and interpretable computational process. By seamlessly bridging mutation identification, structure-function annotation, and functional outcome prediction, Genie-CAT provides testable, evidence-backed hypotheses for accelerated engineering of metalloproteins (Jacob et al., 24 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Beyond Protein Language Models: An Agentic LLM Framework for Mechanistic Enzyme Design (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Genie-CAT.