MolSight: Dual Approaches in Molecular Informatics
- MolSight is a dual-strategy framework that combines a deep learning-based Optical Chemical Structure Recognition system for converting chemical diagrams into SMILES with a microfluidic ETs platform for single-molecule quantification.
- The OCSR component employs an EfficientViT-L1 encoder and Transformer decoder with SMILES pre-training, multi-granularity fine-tuning, and reinforcement learning to achieve state-of-the-art accuracy on chemical structure benchmarks.
- The microfluidic ETs approach leverages entropic escape-time measurements to accurately determine molecular size, shape, binding kinetics, and dynamic behavior at sub-zeptomole sensitivity.
MolSight encompasses two distinct and advanced approaches in molecular informatics: (1) a deep learning-based Optical Chemical Structure Recognition (OCSR) framework for automated extraction of chemical information from images (Zhang et al., 21 Nov 2025), and (2) a high-throughput microfluidic platform (escape-time stereometry, or ETs) for quantitative single-molecule characterization of size, shape, interactions, and dynamics on a chip (Zhu et al., 13 May 2025). Each implementation addresses a critical, but separate, axis in molecular science. The following sections detail their architectures, methodologies, and principal contributions.
1. MolSight for Optical Chemical Structure Recognition
MolSight is an end-to-end neural framework targeting the OCSR problem: converting chemical structure diagrams into machine-actionable molecular representations. The workflow integrates three synergistic components: SMILES captioning pre-training, multi-granularity supervision, and semantic-aware reinforcement learning.
1.1 Architecture
- Encoder: EfficientViT-L1 (≈53 million parameters), ingesting 512×512 pixel images, produces multi-scale convolutional features with linear-attention fusion; these are linearly mapped as cross-attention “keys” and “values” for the decoder.
- Decoder: Six-layer Transformer employing rotary position embeddings (RoPE), SwiGLU feed-forward blocks, and RMSNorm, autoregressively predicts SMILES tokens, attending both previous outputs and image features.
- Prediction Heads:
- SMILES head projects decoder hidden state to the chemical vocabulary.
- Chemical bond classification head uses concatenated pairwise atom token representations to predict bond types via a 2-layer MLP with softmax.
- Atom-coordinate localization head, after two decoder-specialized Transformer layers, outputs discrete and coordinates and an uncertainty scale parameter.
1.2 Training Paradigm
- SMILES Pre-training: Self-supervised, using noisy SMILES-caption datasets. Cross-entropy loss trains the encoder–decoder to map images to SMILES sequences, instilling a chemical “language modeling” bias:
- Multi-Granularity Fine-Tuning: On annotated datasets (e.g., PubChem-1M, USPTO-680K), two auxiliary tasks are introduced:
- Bond Classification: Predicts bond types among atom token pairs, with cross-entropy loss:
Atom Localization: Predicts binned positions with a Laplace likelihood, negative log-likelihood per atom:
The composite loss is . Empirically, decoupling the coordinate head (freezing encoder/decoder, extra Transformer layers for Ø_coord) resolves cross-task interference and improves localization.
Semantic-Aware Reinforcement Learning with Group Relative Policy Optimization (GRPO):
- Treats the decoder as a policy , generating candidate SMILES sequences as trajectories .
- The GRPO objective is:
- Reward combines the Tanimoto similarity between predicted/ground-truth Morgan fingerprints and a stereochemistry term (+1 for exact InChIKey match, +0.3 for atom-count match, +0.1 for otherwise valid SMILES).
2. Stereo-200K Dataset for Stereochemical Recognition
The Stereo-200K dataset, assembled for the RL stage, consists of 200,000 stereoisomeric molecules. Structures are rendered by RDKit in five style presets (classic, blue/gray backgrounds, line thickness variations, atom indices). The molecules were filtered from PubChem CID: 2,000,000 for diversity—including chiral centers ("@", "@@"), cis-trans bonds ("/", "\"), wedge and dash bonds, varied ring conformations, and spatial arrangements. This enables benchmarking of subtle stereo cues and avoids overfitting to canonical renderings (Zhang et al., 21 Nov 2025).
3. Experimental Performance and Ablation Findings
MolSight sets the state-of-the-art on public and synthetic OCSR benchmarks (USPTO, Maybridge UoB, CLEF-2012, JPO; Staker, ChemDraw, Indigo, Stereo-2K). Key metrics:
USPTO: Exact SMILES accuracy 94.0% overall, 85.1% on stereochemical molecules (MolScribe: 69.0%).
JPO: 66.7% exact SMILES match (+9.1% over MolScribe), 68.7% graph accuracy.
Tanimoto fingerprint similarity consistently >90% across benchmarks.
Robust to rotations, shearing, and low-resolution image degradations.
As a frozen feature extractor on MoleculeNet, the image encoder matches or exceeds ROC-AUC of ImageNet backbones, rivaling graph-based pre-training.
Ablations demonstrate: bond-classification head increases SMILES accuracy from 93.7%→94.2%; jointly optimizing localization hurts SMILES accuracy, while decoupling and extending Ø_coord recovers and improves stereo inference; RL-based GRPO on Stereo-200K raises Stereo-2K in-domain stereo accuracy from 80.1%→87.1% and out-of-domain CLEF stereo from 71.0%→80.6% (Zhang et al., 21 Nov 2025).
4. Microfluidic MolSight: Escape-Time Stereometry (ETs) for Molecular Quantification
In parallel, the MolSight (ETs) approach enables direct measurement of molecular size, shape, binding, and dynamic behavior using a photonic microfluidic chip (Zhu et al., 13 May 2025):
Chip and Imaging: Thermally bonded Si/SiO₂-glass substrates, nanoslit (h₁ = 10–70 nm), and periodic cavities (d ≃ 220–360 nm, Dₚ ≃ 550 nm) imaged via widefield epifluorescence.
Principle: Single-fluorophore-labeled molecules undergo diffusion; pockets act as entropic traps. The escape time distribution yields the mean escape time (weighted least-squares fit).
Calibration and Quantitation:
- Proteins of known (insulin, carbonic anhydrase, etc.) yield vs. , allowing extraction of geometric parameters.
- Small-molecule derivatives demonstrate scaling (down to ∼10 Da resolution for 1 kDa molecules).
- DNA/RNA helical form: rise per base-pair extracted from measurements.
- Applications: Detects size range 0.5 kDa–500 kDa; concentration sensitivity ~10 fM. Determines kinetic/thermodynamic binding ( down to sub-nM), tracks reactions in real time, and resolves molecular heterogeneity via single-molecule statistics. Multi-height measurements permit inference of both hydrodynamic radius and bounding-sphere diameter at ≲10% precision.
- Clinical Sensitivity: Insulin receptor conformation/state is directly discerned in serum at 10⁻²¹ mol loading and sub-nanoliter volumes, with single-molecule precision over physiologically relevant insulin concentrations. Copy-to-copy reproducibility is within 1% (Zhu et al., 13 May 2025).
5. Limitations and Future Directions
MolSight OCSR:
- The SMILES-M extension for Markush structures is dependent on ad-hoc delimiters; a graph-based or more principled notation would expand generality.
- RL with GRPO post-training, while effective, demands substantial compute and is sensitive to reward hyperparameters; automation or curriculum-based reward addition is an open challenge.
- Full 3D molecular coordinate reconstruction (beyond 1D atom localization) and systematic ring-conformation prediction remain unresolved.
- Integration with LLMs for chemically informed question answering is identified as a promising avenue for expansion (Zhang et al., 21 Nov 2025).
MolSight ETs:
- The chip-based approach is contingent on precise sample preparation and passivation steps; calibration is sensitive to device geometry.
- Broader molecular classes and more complex mixtures beyond those tested may introduce additional analytical challenges.
A plausible implication is that future applications will increasingly leverage both approaches—deep learning-based OCSR for annotation/mining of large-scale chemical datasets, and ETs microfluidics for physical validation, dynamics, and diagnostics at the single-molecule level.
6. Comparative Summary
| MolSight (OCSR) (Zhang et al., 21 Nov 2025) | MolSight (ETs) (Zhu et al., 13 May 2025) |
|---|---|
| Automated extraction from images (OCSR) | Direct single-molecule quantification |
| Deep learning: EfficientViT-L1 + Transformer decoder | Microfluidic chip, entropic escape-time measurement |
| Multi-task and RL-based training | Physical/chemical property measurement (size, binding, reactions) |
| State-of-the-art on 2D and stereochemistry benchmarks | 0.5% precision, sub-zeptomole sensitivity, 3D shape inference |
Both approaches, while technically unrelated, contribute substantially to chemical informatics by automating either the extraction of chemical meaning from visual sources or the direct characterization of molecules in solution, and can be seen as complementary elements in an emerging digital-physical molecular analytics stack.