TinyChemVL: Efficient Chemical VLM
- The paper introduces a novel chemical vision-language model that leverages adaptive token reduction to significantly reduce processing tokens while preserving key molecular details.
- It employs a multimodal transformer architecture that fuses high-resolution visual encoding with an autoregressive language decoder to generate SMILES and natural language outputs.
- Experimental results show state-of-the-art performance on reaction recognition and prediction tasks with improved inference speed and reduced training time compared to baseline models.
TinyChemVL is a chemical vision-LLM (VLM) designed to address the computational inefficiency and limited reasoning scope of prior VLMs applied to the chemical domain. By integrating an adaptive visual token reduction strategy and supporting reaction-level tasks via a multimodal transformer architecture, TinyChemVL achieves state-of-the-art accuracy on both molecular and complex reaction recognition and prediction, with marked gains in efficiency. The model’s innovations, together with the ChemRxn-V benchmark for vision-based reaction tasks, aim to advance vision-driven chemical informatics and automated reaction understanding (Zhao et al., 9 Nov 2025).
1. Model Architecture
TinyChemVL builds on the ViT-MLP-LLM paradigm with an InternVL2.5-4B backbone, resulting in a 4 billion-parameter multimodal transformer. The architecture comprises:
- Vision Encoder: A Vision Transformer processes high-resolution chemical images with a dynamic resolution approach using 448×448 tiles plus a downsampled thumbnail, preserving essential molecular detail while mitigating non-informative backgrounds.
- Visual-Language Fusion: A lightweight projection layer aligns visual patch embeddings to the LLM’s feature space. Visual tokens are prepended to the language input and fused via standard transformer cross-attention within the decoder.
- Language Decoder: A 2.5B-parameter autoregressive LLM (InternLM) capable of generating SMILES, code, or natural language outputs.
The only modification to the core transformer lies in the visual token reduction block, applied between attention and feed-forward layers. The novel “proportional attention” mechanism re-weights each token by its cumulative representation (“token mass”) following merge/prune operations, modifying the scaled softmax by
where , and tracks token mass after reduction.
2. Visual Token Reduction Mechanism
The token reduction strategy in TinyChemVL combines adaptive merge/prune operations for efficient yet information-preserving processing:
- Token Scoring: Each image token receives a score
where represents attention, are value embeddings.
- Token Pruning: Tokens are pruned by retaining the top-K based on score.
- Token Merging (BSM): Tokens are divided into two sets and merged using Bipartite Soft Matching, pairing tokens based on cosine similarity and averaging them with weights.
- Proportional Attention: After reduction, “token mass” is updated and employed in attention scaling.
- Adaptive Policy: The operation in each encoder layer—prune or merge—is determined by the variance () of in the layer:
with .
This process reduces average visual tokens from approximately 896 (as in ChemVLM/InternVL) to 108—yielding a 1/8 reduction overall and 1/16 reduction in molecular-region tokens.
| Method | Tokens per Image | Token Reduction Ratio |
|---|---|---|
| InternVL2.5-4B, ChemVLM | ∼896 | Baseline |
| TinyChemVL | ∼108 | 1/8 overall, 1/16 molecular |
3. Reaction-Level Tasks and ChemRxn-V Benchmark
TinyChemVL expands the scope of chemical VLMs by targeting reaction-level tasks and introducing the ChemRxn-V benchmark, comprising:
- Reaction Recognition: Ingests full reaction images (reactants > reagents+solvents > products); output is a SMILES-encoded reaction with components separated by “>”, molecules by “.”. Evaluation uses weighted average of RDKit fingerprint similarities and full reaction SMILES exact-match (EM).
- Reaction Prediction: Inputs include only reactants, reagents, and solvents (products withheld); output is a products’ SMILES string. Metrics include average fingerprint similarity and [email protected].
ChemRxn-V delivers 5,000 stratified test samples per task, applying Mol-instructions evaluation: component-wise fingerprint similarity and overall weighted averages. Both tasks are trained with standard autoregressive cross-entropy:
4. Training and Inference Efficiency
Adopting aggressive visual token reduction, TinyChemVL achieves significant training and inference speed enhancements:
- Inference (ChemOCR):
- ChemVLM-8B: 7.41 samples/s, ∼896 tokens/input
- InternVL2.5-4B: 9.11 samples/s, ∼894 tokens/input
- TinyChemVL: 11.84 samples/s, ∼108 tokens/input
- Training (8 × A100 80GB, 1.5 epochs):
- InternVL2.5-4B: ∼47 hours
- TinyChemVL: ∼15 hours
Larger batch sizes (up to 4× relative to baselines) are supported under identical memory constraints, facilitating faster convergence and more efficient supervised fine-tuning.
5. Experimental Performance and Ablations
5.1 Molecular-Level Tasks
TinyChemVL demonstrates state-of-the-art performance and parameter efficiency:
| Task | TinyChemVL | ChemVLM-8B | ChemDFM-X (13B) |
|---|---|---|---|
| OCR (Avg Tanimoto / @1.0) | 91.2% / 77.4% | 81.7% / 57.7% | 70.9% / 36.5% |
| img2smiles (Avg / @1.0) | 89.5% / 75.6% | — | 90.9% / 77.6% |
| Property Prediction (MW, MSE) | 488 | ~790 | — |
| property2img (MW, MSE) | 1620 | — | 7633 (GPT-4o) |
5.2 Reaction-Level Tasks (ChemRxn-V)
| Task | TinyChemVL | GPT-4o |
|---|---|---|
| Recognition (Avg Sim. / EM) | 93.4% / 67.9% | 19.1% / 0.1% |
| Prediction (Avg Sim. / @1.0) | 78.9% / 52.4% | 30.4% / 1.4% |
5.3 Token Reduction Ablation
Reducing to 4 tokens/image slightly degrades task performance:
- ChemOCR [email protected]: 77.4% → 76.2% (−1.2)
- Reaction Recognition EM: 67.9% → 64.7% (−3.2)
- Reaction Prediction [email protected]: 52.4% → 50.1% (−2.3)
6. Conclusions, Limitations, and Future Directions
TinyChemVL establishes that vision-language co-design—specifically aggressive yet adaptive token reduction—enables efficient, accurate multimodal modeling in chemistry. The introduction of ChemRxn-V demonstrates the viability of direct image-to-reaction prediction, showing ∼79% similarity in predicting products solely from visual input. However, excessive token reduction (e.g., 4 tokens/image) impairs performance, and molecular image generation via code is contingent on external packages such as RDKit. Ongoing research aims to extend coverage to more complex reaction classes, multi-step synthesis planning, and joint fine-tuning with large-scale textual reaction corpora to further enhance vision-based chemical reasoning (Zhao et al., 9 Nov 2025).