Relation-aware Transformers: CoLLaMo

Updated 1 March 2026

Relation-aware Transformers are models that integrate explicit relational reasoning into attention mechanisms, enhancing data efficiency and systematic fusion of multimodal inputs.
CoLLaMo employs modality-specific encoders and a multi-level collaborative projector to fuse 1D, 2D, and 3D molecular information, improving molecular LLM performance and robustness.
Innovative techniques such as symmetry reduction and dual attention enable invariant representations and effective separation of sensory features from relational cues.

Relation-aware Transformers encompass a spectrum of architectural advances in the Transformer paradigm, targeting explicit relational reasoning, symmetry exploitation, and multimodal relation integration. The CoLLaMo system—introduced in "Improving Large Molecular LLM via Relation-aware Multimodal Collaboration"—exemplifies this trend for molecular language modeling, utilizing specialized relation-aware modality-collaborative attention to fuse 1D, 2D, and 3D molecular information. Parallel developments address relational processing in self-attention through invariant representations or dual-stream designs to improve data efficiency and generalization in both general and domain-specific contexts (François et al., 21 Feb 2026, Altabaa et al., 2024, Park et al., 18 Jan 2026).

1. Motivation and Scope of Relation-aware Transformers

Standard Transformers, while successful across modalities, provide no explicit computational mechanism for relational reasoning or systematic integration of multiple relation-rich modalities. In the molecular domain, this limitation manifests as hallucination and lack of robustness in large molecular LLMs (LMLMs) due to insufficient modeling of atomic/topological and spatial correlations across 1D, 2D, and 3D data sources (Park et al., 18 Jan 2026). Beyond molecules, conventional self-attention captures feature aggregation, but struggles to encode and manipulate structured relationships among objects, tokens, or entities, motivating the construction of architectures that disentangle and explicitly process relational information (Altabaa et al., 2024, François et al., 21 Feb 2026).

2. Architectural Foundations: CoLLaMo and Relation-aware Attention

CoLLaMo ("Cooperative LLM for Molecules") is structurally characterized by:

Modality-specific encoders: A TokenEmbed module for 1D SELFIES strings, a GIN-based graph encoder for 2D graphs, and an SE(3)-equivariant UniMol encoder for 3D conformations. These generate $Z_{\mathrm{1d}}\in\mathbb{R}^{s\times d_1}$ , $Z_{\mathrm{2d}}\in\mathbb{R}^{n\times d_2}$ , and $Z_{\mathrm{3d}}\in\mathbb{R}^{n\times d_3}$ , respectively.
Multi-level modality-collaborative projector ("Co-Proj"): At each level $l$ , representations from all modalities ( $\{\widehat Z_m^{(l)}\}$ ) are fused; a set of $b$ learnable query vectors $P^{(l)}$ attends via cross-attention to the fused space, producing $\widehat P^{(l)}$ .
Relation-aware collaborative attention ("Co-Attn"): For 2D and 3D, attention weights incorporate pairwise graph distances ( $A_{ij}$ , embedded and labeled as $\Phi^{2d}_{ij}$ ) and Euclidean/geometric distance kernels ( $d_{ij}$ , mapped to $\Phi^{3d}_{ij}$ via learnable Gaussian bases and projection).
Fusion pipeline: The concatenation $[\widehat Z_{\mathrm{1d}}^{(l)}; \widehat Z_{\mathrm{2d}}^{(l)}; \widehat Z_{\mathrm{3d}}^{(l)}]$ becomes attended by $P^{(l)}$ , and stacked across $L$ levels, yielding molecule tokens $M_{\mathrm{unify}}$ for the subsequent LLM.
Modality dropout and learnable modality embeddings: To promote robustness and prevent modality over-reliance, masking and embedding addition per active modality is applied during training.

This structure is coupled with a variant of Transformer attention that explicitly incorporates relation features (e.g., SPD on graphs, spatial kernels on conformers) into the attention score computation: $e_{ij} = \frac{(Q_i W_Q)(K_j W_K)^\top}{\sqrt{d}} + \Phi^{2d}_{ij} + \Phi^{3d}_{ij}.$ Following normalization and aggregation, output vectors are composed accordingly.

3. Theoretical Advances: Invariant and Relational Reformulations

Two principal lines of theory-driven relational Transformer design underpin and contextualize the CoLLaMo architecture.

Symmetry-reduced Transformers (François et al., 21 Feb 2026):

Standard Transformer parameterizations are shown to contain continuous symmetries: orthogonal transformations in model space ( $O(d)$ ), basis-changing in head-query/key ( $O(d_h)$ ), and value/output ( $GL(d_h,\mathbb{R})$ ) subspaces, and discrete head permutations ( $S_H$ ).
Symmetry reduction replaces coordinate-dependent representations with invariant relational quantities:
- The Gram matrix $G = XX^\top \in \mathbb{R}^{n\times n}$ encodes all $O(d)$ -invariant pairwise relations between tokens.
- Parameter composites $G_{QK} = W_Q^\top W_K$ and $G_{VO} = W_O W_V$ serve as symmetry-invariant carriers for attention parameters.
Attention is reformulated as $s_{ij} = f(G_{ij}, G_{ii}, G_{jj})$ , implemented via MLPs or dot products.
Optimization is recast on the quotient space of invariants, removing redundant parameter updates, with architectural modifications that replace head-specific matrices with shared, low-rank relational kernels subject to rank constraints.

Dual Attention Transformers (DAT) (Altabaa et al., 2024):

Explicitly split sensory (feature) and relational (pairwise, symmetric) streams in each attention block.
Sensory heads process standard self-attention; relational heads operate on $\mathrm{r}(x_i,x_j)$ , explicitly parameterizing relations (e.g., similarities, orderings) as kernelized inner products.
After headwise processing, sensory and relational outputs are concatenated and linearly projected, restoring standard Transformer block structure.
This explicit separation enables efficient exploitation of relational inductive biases and supports higher-order relational reasoning.

4. Training, Objectives, and Optimization

CoLLaMo employs frozen or LoRA-tuned LLMs for autoregressive or instruction-guided outputs, with multi-part objectives:

Cross-entropy loss for molecule captioning, IUPAC naming, and descriptive property QA
MSE for computed (quantitative) property QA and MAE for motif counting

Evaluations use both standard NLP metrics (BLEU, METEOR) and novel molecule-centric hallucination/precision scores (CHARM, RCHARM) to account for domain-specific informativeness.

For symmetry-reduced and dual-attention models, optimizers include AdamW with conventional schedules, warmup, and regularization. Notably, in symmetry-reduced architectures, gradients are projected onto symmetry-irrelevant (functionally meaningful) subspaces or optimization is performed directly in the space of relational invariants.

5. Empirical Results, Ablations, and Model Robustness

CoLLaMo demonstrates improved performance across molecule-centric tasks (Park et al., 18 Jan 2026):

Molecule captioning: BLEU/METEOR of 44.5/70.5, outperforming LLaMo and GPT-4o (ICL).
IUPAC naming: BLEU/METEOR of 59.8/77.0.
Descriptive QA: BLEU 45.62 versus 26.13 for strong baselines.
Computed QA: substantiated MAE reductions in LogP and HOMO–LUMO gap.
Motif counting: MAE of 1.31 (lower is better).
Hallucination metrics (CHARM/RCHARM) and GPT-4o judged factuality also favor CoLLaMo over alternatives.

Ablations indicate that removing Co-Attn, modality embeddings, or modality dropout consistently degrades results. The modality-collaborative projector is essential for robust inference; in missing-modality scenarios, CoLLaMo maintains high accuracy, while non-collaborative approaches collapse.

Dual Attention Transformers (DAT) (Altabaa et al., 2024) empirically validate relational attention benefits across synthetic (relational games), symbolic (math), vision (ViT), and large language modeling tasks. Data/sample efficiency, scaling laws, and qualitative interpretability (headwise relational patterns) consistently favor explicit relational architectures.

6. Significance, Limitations, and Open Questions

Relation-aware Transformers, including CoLLaMo, symmetry-reduced models, and dual-attention architectures, systematically reduce or remove the need for models to "discover" relational structure, providing inductive biases for coordinate-free, relation-centric processing (François et al., 21 Feb 2026, Altabaa et al., 2024, Park et al., 18 Jan 2026). The explicitness of relational content improves data efficiency, generalization (especially systematic combinatorial generalization), and interpretability.

However, overhead remains a consideration: increases in projected parameter count (e.g., from $4dd_h$ to $2d^2$ per head in symmetry-invariant architectures), need for careful rank/budget control, and open challenges in integrating fully relational processing with scalable hardware implementations. Theoretical developments have largely proceeded ahead of extensive empirical scaling, with some models reporting primarily conceptual results and proposing further evaluation on optimization dynamics, convergence, and robustness.

A plausible implication is that domain-specialized relation-aware attention modules, such as those in CoLLaMo, are likely to generalize well to other scientific and structured data modalities, contingent on appropriately engineered pairwise and higher-order feature representations.

7. Comparative Overview

Model/Framework	Key Innovation	Impact Area
CoLLaMo (Park et al., 18 Jan 2026)	Relation-aware modality-collaborative projector and Co-Attn	Molecular multimodal LLMs
Symmetry-reduced Transformers (François et al., 21 Feb 2026)	Invariant Gram and parameter composites under continuous and discrete symmetries	General relational learning, model redundancy
Dual Attention Transformers (DAT) (Altabaa et al., 2024)	Explicit sensory and relational attention; fusion stream	Relational games, vision, language modeling

Each approach enforces or exploits relational structure distinctly: CoLLaMo through domain-aligned modality fusion, symmetry-reduced Transformers by geometric invariance, and DAT through architectural decomposition of feature and relation streams.

Relation-aware Transformers constitute a broad, theoretically principled, and empirically substantiated direction for integrating relational processing directly into the architectural and optimization fabric of large attention-based models. Ongoing work is focused on scaling, robustness analyses, interpretability of learned relations, and domain-specific generalization.