Papers
Topics
Authors
Recent
Search
2000 character limit reached

ChromLLM: Optimizing Chromatography Workflows

Updated 14 January 2026
  • ChromLLM is a domain-specific LLM explicitly trained on chromatography literature and QA pairs to enhance experimental design and process optimization.
  • It leverages deep transformer architectures and retrieval-augmented generation to drive a multi-agent system that automates design, execution, and multi-objective optimization.
  • Demonstrated in Ginkgo biloba extract purification, ChromLLM reduced development time from seven weeks to one week while meeting strict purity and productivity standards.

ChromLLM is a domain-specific LLM expressly trained for chromatographic process design and optimization. Integrated into the ChromR platform, ChromLLM leverages deep transformer architectures and retrieval-augmented generation to automate and accelerate chromatographic separation workflows in pharmaceutical, chemical, and food industries. With a specialized corpus and fine-tuned QA capabilities, the model drives a multi-agent system that completes end-to-end experimental design, execution, and multi-objective optimization with markedly reduced reliance on human expertise and significant reduction in process development time (Tang et al., 7 Jan 2026).

1. Model Architecture and Training

ChromLLM adopts the Qwen2.5-14B foundation, a transformer-based decoder containing approximately 14 billion parameters, 32 attention heads per layer, and a hidden size of 4096. Rotary positional embeddings (RoPE) are used for sequence encoding. Tokenization is handled via Qwen’s native byte pair encoding (BPE) with a vocabulary of roughly 128,000 tokens.

Domain-specific pretraining employs approximately 75 million tokens sourced from chromatography-related literature (arXiv, PubMed, Scopus), the Chinese Pharmacopoeia, method sections, and patents. The text extraction pipeline utilized Qwen-Long with a 10-million-token context on Alibaba Cloud Bailian, followed by regex cleaning and similarity-based deduplication. Training follows a standard next-token prediction objective, using the AdamW optimizer at a learning rate of 2×1052 \times 10^{-5} over one epoch on Huawei Cloud ModelArts, with a post-pretraining cross-entropy loss function:

Lpp=ilogP(wiw1,,wi1)L_{pp} = -\sum_{i} \log P(w_i \mid w_1,\ldots,w_{i-1})

Supervised fine-tuning (SFT) is accomplished via LoRA (Low-Rank Adaptation) on 6,801 curated question–answer pairs (multi-turn dialogues), executed in Llama-Factory with a learning rate of 5×1055 \times 10^{-5} and LoRA rank of 8, converging to a final SFT loss of approximately 0.8 on held-out data:

Lsft=(q,a)tlogP(atq,a<t)L_{sft} = -\sum_{(q,a)} \sum_t \log P(a_t \mid q, a_{<t})

2. Tokenization and Chromatographic Knowledge Representation

ChromLLM’s tokenization strategy extends the base BPE vocabulary to encapsulate chromatography expertise. Numeric tokens are segmented into special tokens (“〈num〉”, “.”, “%”) and digit subsequences. Chromatography-specific terminology (e.g., “BV/h”, “resin”, “elution”) are injected as single tokens via vocabulary extension. Complex domain descriptors such as retention times (“t_R=12.5 min”) and solvent gradients (“20%→75% EtOH”) are encoded as sequences of domain tokens combined with numeric sub-tokens.

Numerical parameters such as concentrations and flow rates are normalized and encoded as floating-point strings truncated to two decimals (“1.50”→“1.50”), with explicit unit tokens (“〈BV〉”, “〈h〉”, “〈%〉”) preserving contextual metadata. During prompt construction, ranges (“[0.5,1.5] BV/h”) are bracketed to denote continuous design variables. Learned embedding vectors are trained for newly introduced tokens, but no external knowledge-graph embeddings are utilized; all domain expertise is encompassed by fine-tuned weights and a retrieval-augmented knowledge base.

3. ChromR Multi-Agent System Integration

ChromLLM serves as the domain knowledge engine at the core of the ChromR multi-agent architecture, which comprises four functional agents orchestrated by a Qwen-series planner LLM on Dify 1.5.1, delivered via FastAPI.

Agent Function Internal Components
A (“Expert”) Knowledge answering ChromLLM + RAG + literature retrieval
B (DOE) Experimental design Prompt parser + pyDOE2 executor
C (Exec) Experimental execution Python-API to hardware, serial/REST
D (Analysis) Data analysis & optimization Regression + NSGA-II/III optimizer
  • Agent A: Processes user queries by intent detection, passage retrieval via RAG, and answer synthesis (parameter, resin, solvent recommendations).
  • Agent B: Converts parameter ranges and material batch ID inputs into experiment tables via prompt parsing and DOE methods (Definitive Screening Design, Central Composite, Box-Behnken).
  • Agent C: Translates experiment tables into hardware commands for automated batch execution, using Python-to-serial API, RS-232 for pump/valve control, and REST interactions with a MySQL sensor backend.
  • Agent D: Applies stepwise regression (p=0.05), computes design spaces, and runs NSGA-II/III multi-objective optimizations, yielding predictive models and Pareto fronts.

4. Experimental Design and Optimization Algorithms

ChromLLM-assisted ChromR employs rigorous methodologies for chromatographic process design:

  • Experimental Design (DOE): Utilizes DSD for 6 factors, 2 dummies, and 3 center points (total: 20 runs). Experiment tables are generated programmatically (e.g., via pyDOE2).
    1
    2
    3
    
    from pyDOE2 import dsd
    X = dsd(n_factors=6, center=3)
    assign_batches(X, batch_list)
  • Response Surface Modeling (RSM): Models responses using second-order polynomials that incorporate material properties:

    Y=b0+ibiXi+ijbijXiXj+kckZkY = b_0 + \sum_{i} b_i X_i + \sum_{i \leq j} b_{ij} X_i X_j + \sum_k c_k Z_k

    Stepwise term selection (p < 0.05), partial least squares (PLS) fitting, and R2=0.850.96R^2 = 0.85–0.96 are reported.

  • Multi-Objective Optimization: Targets simultaneous maximization of purity and productivity:

    maxF(x)=[f1(x),f2(x)]=[purity,productivity]\max F(x) = [f_1(x), f_2(x)] = [\text{purity}, \text{productivity}]

    Subject to constraints:

    fpurity,TT6%,fpurity,FG24%,fprod,TT50  mg/h,fprod,FG200  mg/hf_{\mathrm{purity, TT}} \geq 6\%,\quad f_{\mathrm{purity, FG}} \geq 24\%,\quad f_{\mathrm{prod, TT}} \geq 50\;\mathrm{mg/h},\quad f_{\mathrm{prod, FG}} \geq 200\;\mathrm{mg/h}

    Solved by NSGA-II with a population of 2000 and 1000 iterations, returning Pareto sets in less than one minute per batch.

  • Bayesian Optimization (Optional): A Gaussian Process surrogate and Expected Improvement (EI) acquisition function:

    a(x)=E[max(0,f(x)f(x+))]a(x) = \mathbb{E}[\max(0, f(x)-f(x^+))]

    Not employed in the EGBL case, due to strong RSM performance.

5. Application Case Study and Performance Evaluation

ChromLLM’s efficacy was demonstrated in the chromatographic purification of Ginkgo biloba leaf extract (EGBL). The conventional process paradigm—literature review, resin screening, OFAT, DOE, and validation—requires approximately seven weeks. ChromR, powered by ChromLLM, reduced process development to approximately one week, effecting a seven-fold acceleration.

All fractions met Pharmacopoeia standards (FG ≥ 24%, TT ≥ 6%) and exceeded productivity thresholds (TT ≥ 50 mg/h, FG ≥ 200 mg/h). Model predictions maintained experimental accuracy within 5% over 10 optimization batches and 3 out-of-sample batches. Statistically significant improvements in purity/productivity were observed against baseline methods (p<0.01p < 0.01, paired t-test).

6. Innovations, Limitations, and Future Directions

ChromLLM introduces the first domain-specific LLM trained on 75 million chromatography tokens and ∼7,000 QA pairs with retrieval augmentation. Its coupling with a multi-agent system, local knowledge base, real-time hardware control, and end-to-end workflow forms a universally applicable paradigm for process design.

Limitations include:

  • No support for continuous gradient elution (absence of online mixer hardware).
  • Manual sample preparation and component analysis (e.g., HPLC-MS).
  • Restriction to polynomial RSM models; mechanistic or hybrid simulations are not currently integrated.

A plausible implication is that integration of online gradient control, automated sample preparation/analysis, adoption of physics-based column models, and extension to modalities such as preparative liquid chromatography (prep-HPLC) or supercritical fluid chromatography (SFC) will increase the generality and automation degree of future ChromLLM deployments (Tang et al., 7 Jan 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ChromLLM.