ChromLLM: Optimizing Chromatography Workflows
- ChromLLM is a domain-specific LLM explicitly trained on chromatography literature and QA pairs to enhance experimental design and process optimization.
- It leverages deep transformer architectures and retrieval-augmented generation to drive a multi-agent system that automates design, execution, and multi-objective optimization.
- Demonstrated in Ginkgo biloba extract purification, ChromLLM reduced development time from seven weeks to one week while meeting strict purity and productivity standards.
ChromLLM is a domain-specific LLM expressly trained for chromatographic process design and optimization. Integrated into the ChromR platform, ChromLLM leverages deep transformer architectures and retrieval-augmented generation to automate and accelerate chromatographic separation workflows in pharmaceutical, chemical, and food industries. With a specialized corpus and fine-tuned QA capabilities, the model drives a multi-agent system that completes end-to-end experimental design, execution, and multi-objective optimization with markedly reduced reliance on human expertise and significant reduction in process development time (Tang et al., 7 Jan 2026).
1. Model Architecture and Training
ChromLLM adopts the Qwen2.5-14B foundation, a transformer-based decoder containing approximately 14 billion parameters, 32 attention heads per layer, and a hidden size of 4096. Rotary positional embeddings (RoPE) are used for sequence encoding. Tokenization is handled via Qwen’s native byte pair encoding (BPE) with a vocabulary of roughly 128,000 tokens.
Domain-specific pretraining employs approximately 75 million tokens sourced from chromatography-related literature (arXiv, PubMed, Scopus), the Chinese Pharmacopoeia, method sections, and patents. The text extraction pipeline utilized Qwen-Long with a 10-million-token context on Alibaba Cloud Bailian, followed by regex cleaning and similarity-based deduplication. Training follows a standard next-token prediction objective, using the AdamW optimizer at a learning rate of over one epoch on Huawei Cloud ModelArts, with a post-pretraining cross-entropy loss function:
Supervised fine-tuning (SFT) is accomplished via LoRA (Low-Rank Adaptation) on 6,801 curated question–answer pairs (multi-turn dialogues), executed in Llama-Factory with a learning rate of and LoRA rank of 8, converging to a final SFT loss of approximately 0.8 on held-out data:
2. Tokenization and Chromatographic Knowledge Representation
ChromLLM’s tokenization strategy extends the base BPE vocabulary to encapsulate chromatography expertise. Numeric tokens are segmented into special tokens (“〈num〉”, “.”, “%”) and digit subsequences. Chromatography-specific terminology (e.g., “BV/h”, “resin”, “elution”) are injected as single tokens via vocabulary extension. Complex domain descriptors such as retention times (“t_R=12.5 min”) and solvent gradients (“20%→75% EtOH”) are encoded as sequences of domain tokens combined with numeric sub-tokens.
Numerical parameters such as concentrations and flow rates are normalized and encoded as floating-point strings truncated to two decimals (“1.50”→“1.50”), with explicit unit tokens (“〈BV〉”, “〈h〉”, “〈%〉”) preserving contextual metadata. During prompt construction, ranges (“[0.5,1.5] BV/h”) are bracketed to denote continuous design variables. Learned embedding vectors are trained for newly introduced tokens, but no external knowledge-graph embeddings are utilized; all domain expertise is encompassed by fine-tuned weights and a retrieval-augmented knowledge base.
3. ChromR Multi-Agent System Integration
ChromLLM serves as the domain knowledge engine at the core of the ChromR multi-agent architecture, which comprises four functional agents orchestrated by a Qwen-series planner LLM on Dify 1.5.1, delivered via FastAPI.
| Agent | Function | Internal Components |
|---|---|---|
| A (“Expert”) | Knowledge answering | ChromLLM + RAG + literature retrieval |
| B (DOE) | Experimental design | Prompt parser + pyDOE2 executor |
| C (Exec) | Experimental execution | Python-API to hardware, serial/REST |
| D (Analysis) | Data analysis & optimization | Regression + NSGA-II/III optimizer |
- Agent A: Processes user queries by intent detection, passage retrieval via RAG, and answer synthesis (parameter, resin, solvent recommendations).
- Agent B: Converts parameter ranges and material batch ID inputs into experiment tables via prompt parsing and DOE methods (Definitive Screening Design, Central Composite, Box-Behnken).
- Agent C: Translates experiment tables into hardware commands for automated batch execution, using Python-to-serial API, RS-232 for pump/valve control, and REST interactions with a MySQL sensor backend.
- Agent D: Applies stepwise regression (p=0.05), computes design spaces, and runs NSGA-II/III multi-objective optimizations, yielding predictive models and Pareto fronts.
4. Experimental Design and Optimization Algorithms
ChromLLM-assisted ChromR employs rigorous methodologies for chromatographic process design:
- Experimental Design (DOE): Utilizes DSD for 6 factors, 2 dummies, and 3 center points (total: 20 runs). Experiment tables are generated programmatically (e.g., via pyDOE2).
1 2 3
from pyDOE2 import dsd X = dsd(n_factors=6, center=3) assign_batches(X, batch_list)
- Response Surface Modeling (RSM): Models responses using second-order polynomials that incorporate material properties:
Stepwise term selection (p < 0.05), partial least squares (PLS) fitting, and are reported.
- Multi-Objective Optimization: Targets simultaneous maximization of purity and productivity:
Subject to constraints:
Solved by NSGA-II with a population of 2000 and 1000 iterations, returning Pareto sets in less than one minute per batch.
- Bayesian Optimization (Optional): A Gaussian Process surrogate and Expected Improvement (EI) acquisition function:
Not employed in the EGBL case, due to strong RSM performance.
5. Application Case Study and Performance Evaluation
ChromLLM’s efficacy was demonstrated in the chromatographic purification of Ginkgo biloba leaf extract (EGBL). The conventional process paradigm—literature review, resin screening, OFAT, DOE, and validation—requires approximately seven weeks. ChromR, powered by ChromLLM, reduced process development to approximately one week, effecting a seven-fold acceleration.
All fractions met Pharmacopoeia standards (FG ≥ 24%, TT ≥ 6%) and exceeded productivity thresholds (TT ≥ 50 mg/h, FG ≥ 200 mg/h). Model predictions maintained experimental accuracy within 5% over 10 optimization batches and 3 out-of-sample batches. Statistically significant improvements in purity/productivity were observed against baseline methods (, paired t-test).
6. Innovations, Limitations, and Future Directions
ChromLLM introduces the first domain-specific LLM trained on 75 million chromatography tokens and ∼7,000 QA pairs with retrieval augmentation. Its coupling with a multi-agent system, local knowledge base, real-time hardware control, and end-to-end workflow forms a universally applicable paradigm for process design.
Limitations include:
- No support for continuous gradient elution (absence of online mixer hardware).
- Manual sample preparation and component analysis (e.g., HPLC-MS).
- Restriction to polynomial RSM models; mechanistic or hybrid simulations are not currently integrated.
A plausible implication is that integration of online gradient control, automated sample preparation/analysis, adoption of physics-based column models, and extension to modalities such as preparative liquid chromatography (prep-HPLC) or supercritical fluid chromatography (SFC) will increase the generality and automation degree of future ChromLLM deployments (Tang et al., 7 Jan 2026).