CoLLaMo: Collaborative Molecular Language Model
- CoLLaMo is a molecular language model that unifies 1D SELFIES, 2D graphs, and 3D conformations into a singular molecule token space.
- It features a multi-level modality-collaborative projector employing relation-aware cross-attention to merge diverse molecular cues and mitigate hallucination.
- Empirical evaluations show CoLLaMo outperforms baseline models in tasks such as molecule captioning, property QA, motif counting, and IUPAC naming.
CoLLaMo (Collaboration-aware LLM for Molecules) is a large molecular LLM (LMLM) designed for robust, multimodal molecular language understanding and generation. It addresses core limitations in prior LMLMs, particularly hallucination and poor robustness arising from inadequate modality fusion. CoLLaMo incorporates 1D sequential SELFIES representations, 2D molecular graph topologies, and 3D molecular conformations into a unified “molecule token” space through a multi-level, relation-aware modality-collaborative projector. This architectural innovation enables fine-grained, relation-guided attention mechanisms and facilitates advanced downstream tasks such as molecule captioning, property QA, motif counting, and IUPAC naming. CoLLaMo is evaluated with molecule-centric hallucination metrics and LLM-based caption quality scores, providing substantially improved generalization and performance over strong baselines, including domain-specific GNNs and large GPT models (Park et al., 18 Jan 2026).
1. Architectural Overview
The CoLLaMo pipeline consists of three core modules:
- Modality-specific Encoders:
- 1D SELFIES strings are embedded via token lookup:
- 2D molecular graphs are processed with a pretrained GIN:
- 3D conformers are processed with a pretrained UniMol encoder:
- Multi-level Modality-collaborative Projector (Co-Proj):
Relation-aware cross-attention modules fuse , , atomwise and pool unified molecule representations, yielding as the molecular token sequence.
- LLM Backbone:
A frozen LLaMA2-7B model receives concatenated with textual instructions and generates output via auto-regressive decoding.
The unified architecture is robust to missing modalities by randomly dropping streams during training and injecting learned modality embeddings. This prevents the model from overfitting to any single molecular cue.
2. Multi-level Modality-collaborative Projector and Relation-aware Attention
At each of projection layers (empirically ), CoLLaMo computes:
- For modality :
- 1D: Raw token embeddings.
- 2D/3D: Transformer self-attention blocks with “Co-Attn” bias.
The Co-Attn(Q,K,V) operation incorporates two additive bias matrices:
- : Learnable lookup based on shortest-path graph distance () between atoms , , encoded as
- : Derived from a Gaussian Basis Kernel (GBK) expansion of Euclidean interatomic distances:
with
After self-attention and residual connections within each modality, the atomwise embeddings are concatenated:
A fixed set of pooling queries aggregates by cross-attending onto the concatenated sequence. The output over all layers, after MLP mapping, forms the molecular representation input to the LLM. Modality dropout and modality embedding injection at training further enhance generalization under modality sparseness.
3. Molecule-centric Evaluation Metrics
CoLLaMo introduces two new automatic metrics to address the limitations of token-based measures (e.g., BLEU):
- CHARM (Caption Hallucination Assessment w.r.t. Molecule):
CHARM identifies captioned molecular entities that do not exist in the input structure (using BERN2 entity extraction at confidence ).
- RCHARM (Recall-oriented CHARM):
This penalizes captions missing key molecular details.
- LLM-based Caption Judge Score:
GPT-4o evaluates generated captions for "factual informativeness" and "alignment with reference" on a 0-5 scale using structured prompts with SELFIES, reference, and candidate captions. These scores correlate more closely with chemical correctness than BLEU/METEOR.
4. Training Procedure
CoLLaMo training is conducted in two stages:
- Stage 1: Molecule–Language Alignment (64K steps)
- Data: Mol-Instructions, GPT-3.5-enriched PubChemQC captions.
- Objective: Next-token cross-entropy.
- Trained components: Modality encoders + Co-Proj (LLM frozen).
- Stage 2: Multitask Instruction Tuning (900K steps)
- Data: Captioning, descriptive property QA, computed property QA, motif counting, and IUPAC naming splits.
- Objective: Cross-entropy over molecule tokens + instruction → output.
- Trained components: Co-Proj + LLM (with LoRA).
Default optimization (AdamW, LR , WD ), frozen encoders in second stage.
5. Downstream Tasks and Empirical Results
CoLLaMo is evaluated against GPT-4, GPT-4o, o1-mini (zero-/few-shot ICL), LLaMA2-7B, Mol-Instructions, specialist models (2D-MoLM, 3D-MoLM), and Uni-Mol. It demonstrates improved molecular modality generalization and outperforms baselines across diverse tasks:
| Task/Model | BLEU | METEOR | Motif MAE | IUPAC BLEU | CHARM (%) | Judge (0–5) |
|---|---|---|---|---|---|---|
| CoLLaMo (full) | 44.5 | 70.5 | 1.31 | 59.8 | 58.5 | 2.52 |
| GPT-4o (ICL) | 31.1 | 57.1 | 2.45 | 39.4 | 77.1 | 2.17 |
| LLaMo (2D-only) | — | — | — | — | 64.7 | 2.15 |
| 3D-MoLM | 26.13 | 52.15 | — | — | — | — |
MAE for all computed properties; motif counting validity 100%. All specialist and generalist CoLLaMo variants surpass baselines on both BLEU/METEOR and the molecule-centric metrics.
6. Ablation Studies and Component Analysis
- Modality ablation:
Captioning BLEU: 1D only: 33.2; 2D: 38.0; 3D: 35.8; no Co-Proj (independent CrossAttn): 35.7; full Co-Proj (1D+2D+3D): 40.1.
- Component ablation:
w/o Co-Attention bias: 36.6 (–3.5); w/o modality embeddings: 38.9 (–1.2); w/o modality dropout: 37.8 (–2.3); full: 40.1.
- Missing modalities:
Co-Proj model yields BLEU 28.0 with only 1D, versus 2.4 without; adding 1D+2D recovers 39.1; all modalities: 40.1.
These ablations establish the substantive contribution of collaborative cross-attention, relation-aware biasing, modality dropout, and modality embeddings to performance and robustness.
7. Significance, Limitations, and Future Directions
CoLLaMo confirms that relation-aware collaborative attention across molecular modalities offers substantial gains in both interpretability and robustness for LMLMs. By fusing sequence, graph topology, and 3D geometry representations, CoLLaMo avoids modality over-reliance and supports informative, entity-grounded molecular descriptions.
Notably, although CoLLaMo shows strong generalization even with partial modality input, reliance on pretrained GNN and 3D encoders may limit adaptation to novel representations or unstructured inputs. CHARM/RCHARM and LLM-based judge scores provide more chemically relevant evaluations than standard n-gram metrics, but further work may refine entity extraction and judge reliability. A plausible implication is that modality-collaborative attention mechanisms—augmented with robust molecular-specific metrics—will be essential for scalable molecular LLMs addressing complex tasks in property prediction and automated description generation (Park et al., 18 Jan 2026).