Papers
Topics
Authors
Recent
2000 character limit reached

ChemRxn-V: VLM Reaction Benchmark

Updated 16 November 2025
  • ChemRxn-V is a benchmark for reaction-level tasks that evaluates both visual recognition and chemical predictive reasoning from imagery.
  • It quantifies performance using metrics like fingerprint similarity and exact match for reactants, reagents, and products.
  • The benchmark drives advances in automated chemical analysis by highlighting challenges in visual token reduction and hierarchical reasoning.

ChemRxn-V is a benchmark specifically designed for evaluating reaction-level tasks in the context of vision-LLMs (VLMs) applied to chemical imagery. Serving as a diagnostic and evaluation suite, ChemRxn-V addresses the limitations of previous molecular benchmarks that have concentrated on isolated molecular identification while neglecting the complex reasoning required for full chemical reactions from visual input. This benchmark enables a quantitative assessment of both recognition and predictive reasoning capacities in models that process depictions of chemical reactions, contributing to advances in automated chemical analysis and synthesis planning.

1. Motivation and Scope

Traditional VLMs in chemistry have focused on molecule-level recognition—mapping an isolated structural image to a molecular identifier (e.g., SMILES)—or employed text-only datasets, overlooking the visual heterogeneity inherent to reaction diagrams. ChemRxn-V was introduced to fill this methodological gap by targeting more demanding reaction-level tasks that require decomposition, context integration, and mechanistic prediction starting from molecular images. The guiding philosophy is to evaluate models not merely for image understanding, but for their ability to infer complex relationships—reactant-to-product mappings, reagent interpretation, and mechanistic reasoning—directly from visual cues (Zhao et al., 9 Nov 2025).

2. Task Formulation and Benchmarked Subtasks

ChemRxn-V formalizes two principal tasks:

  • Reaction Recognition: Given a single chemical image II containing reactants, reagents/solvents, and products, the model must map I(Sr,Ss,Sp)I \mapsto (S_r, S_s, S_p), where SrS_r, SsS_s, and SpS_p are SMILES strings for reactants, solvents/reagents, and products, respectively. Successful performance requires accurate segmentation and assignment of role-labeled molecular entities.
  • Reaction Prediction: Provided an image IrI^r depicting only reactants and reagents, the model must directly predict the major product in SMILES form, i.e., IrSpI^r \mapsto S_p. This assessment demands not just recognition but chemical reasoning, requiring the model to infer plausible products based only on the molecular visual content and contextual chemistry (Zhao et al., 9 Nov 2025).

3. Dataset Composition

The ChemRxn-V benchmark dataset is curated from the ORDerly test split and comprises 5,000 samples for reaction recognition and 5,000 samples for reaction prediction. The samples are stratified by reaction length, ensuring coverage across simple to complex transformations. All imagery is generated via synthetic renderings using RDKit or Indigo, standardizing graphical features and ensuring task clarity. Each data point contains high-resolution depictions suitable for transformer-based vision encoders. An important caveat is the exclusive use of synthetic images, which, while ideal for algorithmic evaluation, may present domain adaptation challenges for real-world, hand-drawn, or noisy laboratory data (Zhao et al., 9 Nov 2025).

4. Evaluation Metrics and Methodology

Performance on ChemRxn-V is measured with a suite of chemically rigorous metrics:

  • Fingerprint Similarity: Each predicted component (reactant, solvent/reagent, product) is compared to the ground truth using RDKit Tanimoto similarity computed over molecular fingerprints, then weighted by the number of molecules per component.
  • Exact Match (EM): Successful recognition requires that all three predicted components exactly match their corresponding ground-truth SMILES.
  • Prediction Metrics: For reaction prediction, average similarity and [email protected] (identical structure) are reported for product identification.

Scoreavg=1Ni=1NTanimoto(fi,pred,fi,true)\text{Score}_{\text{avg}} = \frac{1}{N}\sum_{i=1}^N \text{Tanimoto}(f_{i,\text{pred}}, f_{i,\text{true}})

This suite enables fine-grained comparison of partial and holistic model performance on recognition and predictive reasoning (Zhao et al., 9 Nov 2025).

5. Empirical Results and Baseline Comparison

When evaluated on ChemRxn-V, the TinyChemVL model achieves 93.4% average similarity and 67.9% exact match on recognition tasks, notably surpassing ChemDFM-X (28.3% / 3.2%) and providing a strong diagnostic contrast to earlier VLMs that were limited to molecule-level benchmarks. In the reaction prediction setting, TinyChemVL attains 78.9% average similarity and 52.4% [email protected]—establishing a baseline for direct visual reaction prediction, as no prior VLMs attempted this task. These results empirically substantiate the benchmark’s difficulty and its utility for distinguishing between recognition-only and reasoning-capable chemical models (Zhao et al., 9 Nov 2025).

Model Recognition Avg Sim (%) Recognition EM (%) Prediction Avg Sim (%) Prediction [email protected] (%)
TinyChemVL 93.4 67.9 78.9 52.4
ChemDFM-X 28.3 3.2

6. Benchmark Design Rationale and Advancements

The conceptual advance embodied by ChemRxn-V lies in its co-design with both architectural and task-level innovations in chemical VLMs. The benchmark’s complexity compels models to extract fine-grained local structure from visual input, robustly segment molecular actors, and perform schematic reasoning—all in a context where input modalities are dominated by low-information background pixels. The benchmark thereby motivates methods such as adaptive visual token reduction and hierarchical reasoning modules, as implemented in TinyChemVL (Zhao et al., 9 Nov 2025).

A plausible implication is that continued development along the axes defined by ChemRxn-V—incorporating more heterogeneous imagery, increasing reaction complexity, and extending to multi-step synthesis—will further strain purely vision-based chemical models and necessitate richer reasoning architectures.

7. Limitations and Future Outlook

Several limitations apply to ChemRxn-V. Its reliance on synthetic molecule diagrams generated by RDKit or Indigo precludes assessment of performance on real-world reaction depictions (e.g., hand-drawn, scanned, or photographic images). The current scope is single-step transformations, while industrial and medicinal chemistry often demand multistep synthesis route reasoning. The fixed construction of reaction role assignment may not generalize to reactions with ambiguous or non-standard roles. Toward future development, proposed directions include learning adaptive reduction thresholds for visual token strategies, incorporating reinforcement or uncertainty signals, and expanding the benchmark to encompass laboratory-condition imagery and more intricate synthetic sequences (Zhao et al., 9 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to ChemRxn-V.