TinyChemVL: Efficient Chemical VLM

Updated 16 November 2025

TinyChemVL is a chemical vision-language model that integrates visual and textual data to enable accurate molecular and reaction-level reasoning.
It employs a novel visual token reduction mechanism that decreases tokens by up to 16x, eliminating redundancy while preserving key chemical features.
The model achieves superior performance with high accuracy in reaction recognition and prediction tasks, showing improved metrics over state-of-the-art baselines.

TinyChemVL is a chemical vision-LLM (VLM) designed to efficiently integrate visual and textual information for complex reasoning in cheminformatics. Addressing the computational inefficiency and limited task scope of prior chemical VLMs, TinyChemVL introduces an architecture and training paradigm that reduces visual token redundancy and expands capabilities beyond molecular-level tasks to encompass reaction-level reasoning. The model, with approximately 4 billion parameters, demonstrates improved accuracy and speed relative to state-of-the-art baselines, particularly through its visual token reduction pipeline and the ChemRxn-V benchmark for reaction-centric evaluation.

1. Model Architecture and Modality Fusion

TinyChemVL employs the InternVL2.5-4B backbone, partitioned into a vision encoder and a language encoder:

Vision Encoder: Utilizes a ViT-MLP trunk (InternViT-300M-448px-V2.5), incorporating token merge/prune layers between self-attention and feedforward modules to facilitate sequence length reduction.
Text Encoder: Deploys the InternLM2.5 LLM, implementing the full Transformer decoder stack and comprising roughly 3.6 billion parameters.
Modality Fusion: Visual tokens $\{v_1,\dots,v_T\}$ , projected to match the LLM hidden size, are prepended to tokenized text instructions:

$\text{LLM input} = [ \langle \text{VIS} \rangle, v_1, \ldots, v_T, \langle \text{TXT} \rangle, t_1, \ldots, t_N ]$

enabling joint self-attention over both modalities and supporting cross-modal chemical reasoning.

2. Visual Token Reduction Mechanism

A central feature of TinyChemVL is its visual token reduction, which decreases the token burden by a factor of 16 relative to ChemVLM (from $\sim 1280$ to $\sim 80$ tokens per 800×800 image). The process consists of several sequential steps:

Step	Operation	Key Formula/Policy
a.	Patch extraction & tiling	Each image split into 448×448 tiles and a thumbnail; each tile yields (448/16)²=256 tokens
b.	Adaptive token scoring	$A = \mathrm{Softmax}(QK^\top/\sqrt{d})$ , $\mathrm{Score}_i = \frac{A_{1,i+1}\\|V_{i+1}\\|}{\sum_j A_{1,j+1}\\|V_{j+1}\\|}$
c.	Prune vs. merge	$S_\ell = \mathrm{var}(\{ \mathrm{Score}_i \})$ ; policy: prune if $S_\ell \leq \tau$ , else merge ( $\tau=1e^{-5}$ )
d.	Top-K token pruning	Retain highest $K$ scores; discard remainder
e.	Bipartite soft matching	Pairwise cosine similarity and weighted average for merging
f.	Proportional attention	$A = \mathrm{Softmax}\left( QK^\top/(\sqrt{d}+\log s) \right)$ , $s$ tracks "original-token count" per merged token

This sequence adaptively prunes uninformative background tokens while merging those representing similar molecular substructures, culminating in an efficient summarization of chemical images.

3. Molecular and Reaction-Level Task Array

TinyChemVL is trained and evaluated on a bifurcated task taxonomy, encompassing both molecular-level and reaction-level tasks:

Molecular-Level:

Molecule recognition (img2smiles): $f_{\text{recog}}: I \to \text{SMILES}$ , with cross-entropy loss.
Property prediction (img2property): $f_{\text{prop}}: I \to \mathbb{R}^7$ (MW, LogP, TPSA, HBD, HBA, RB, QED), with summed MSE.
Molecular image generation: img2img seeks improved LogP; property2img aims for target chemical properties via code generation.

Reaction-Level:

Reaction recognition (rxn-recognition): $f_{\text{rxn-rec}}: I \rightarrow (\text{SMILES}_R > \text{SMILES}_S > \text{SMILES}_P)$ , inferring reactants, reagents/solvents, products.
Reaction prediction (rxn-prediction): $f_{\text{rxn-pred}}: I_{\text{reactants}} \rightarrow \text{SMILES}_{\text{products}}$ , generating product SMILES from reactant images.

This expanded scope supports both image-based structure recognition and multi-component chemical reasoning.

4. ChemRxn-V Benchmark Design and Evaluation Metrics

The ChemRxn-V benchmark operationalizes reaction-level evaluation for VLMs with two 5,000-sample test sets:

Recognition: Each scheme’s reactants, reagents/solvents, and products are independently predicted and scored using RDKit fingerprint Tanimoto similarity, averaged and weighted by molecule count; Exact Match (EM) is reported for complete correctness.
Prediction: Outputs are compared to ground truth product molecules with average Tanimoto similarity and [email protected] (percent perfectly matching predictions).
Complexity Stratification: Sample selection is balanced across reaction lengths (simple: 2–3 molecules; complex: >5 molecules).

This structure enables nuanced quantification of both recognition fidelity and predictive capability across chemically diverse scenarios.

5. Experimental Results: Accuracy and Efficiency

Performance summaries substantiate TinyChemVL’s improvements over prior models:

Task	Best Baseline	TinyChemVL (4B)
Mol. recognition	ChemVLM-8B: Avg Sim=81.7%, [email protected]=57.7%	Avg Sim=91.2%, [email protected]=77.4%
SMILES OCR	ChemDFM-X (13B): Avg Sim=90.9%, [email protected]=77.6%	89.5%, 75.6%
Property pred.	ChemMLLM: MW=789.7, QED=0.008	MW=488.0, QED=0.003
Image generation	GPT-4o: MW MSE=7,633	MW MSE=1,620
Rxn recognition	ChemDFM-X: Avg Sim=28.3%, EM=3.2%	Avg Sim=93.4%, EM=67.9%
Rxn prediction	n/a	Avg Sim=78.9%, [email protected]=52.4%
Inference speed	ChemVLM-8B: 7.41 samp/s, 896 tokens	11.84 samp/s, 108 tokens

Notably, TinyChemVL’s token reduction yields a 30% inference speed-up and a 68% reduction in full SFT time relative to comparable baselines, with equal or superior accuracy—particularly at reaction-level tasks previously unaddressed by existing models.

6. Ablations and Token Reduction Analysis

Ablation studies probe the impact of visual token counts:

At 16 tokens/image (default), ChemOCR [email protected]=77.4%, Rxn-recognition=62.7%, Rxn-prediction=52.4%.
At 4 tokens/image, metrics decrease only marginally (ChemOCR: 76.2%, Rxn-rec: 59.5%, Rxn-pred: 50.1%)—suggesting robustness to dramatized token sparsification, although ~16 tokens remains optimal for reaction-level reasoning.

Qualitative visualizations (Figure 1) reveal that token pruning consistently eliminates background noise, while token merging preserves chemically salient features even when combining spatially distant but structurally similar molecular subregions—supporting the premise that token-adaptive compression does not compromise core reasoning capacity.

7. Context and Implications

TinyChemVL demonstrates that token-efficient VLM architectures can achieve state-of-the-art performance on both molecular and reaction tasks, suggesting pathway toward scalable chemical AI that maintains both computational tractability and high-fidelity chemical reasoning. By combining visual pruning/merging and expanded task definition (through ChemRxn-V), this model sets a precedent for future chemical vision-language systems to optimize both architecture and benchmark design for scientific utility. A plausible implication is that further architectural innovations in token selection and fusion may enable even broader application domains while reducing computational barriers associated with large-scale chemical image analysis.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to TinyChemVL.