Atomic Capability Pre-Training

Updated 22 January 2026

Atomic capability pre-training is a paradigm that trains models to acquire, disentangle, and recombine fundamental, indivisible skills across various domains.
It leverages domain-specific tokenization and modular techniques to ensure balanced exposure and efficient compositional generalization in tasks like molecular modeling and robotics.
Empirical results demonstrate improved performance, with significant gains in accuracy, data efficiency, and transferability across vision-language, chemical, and reasoning challenges.

Atomic capability pre-training is a paradigm wherein foundational models are explicitly trained to acquire, disentangle, and compose fundamental skills ("atomic capabilities") as a precursor to complex, multi-faceted tasks. This principle finds rigorous instantiation across diverse fields—including molecular modeling, multimodal vision-language reasoning, robotics, and symbolic reasoning LMs—each adapting the notion of atomicity to its native granularity (chemical elements, visual skills, action units, arithmetic operations). Atomic pre-training strategies systematically scaffold higher-order generalization and data efficiency by ensuring exhaustive exposure to basic capabilities and controlling their recombination.

1. Formalization of Atomic Capabilities Across Domains

Atomic capability denotes the minimal, indivisible skill or primitive recognized in a domain-specific ontology of tasks:

Molecular modeling: The atomic level refers to individual chemical elements and their combinations (SMILES tokenization at atom granularity), while substructure-level represents common combos of atoms (functional groups, rings) (Ding et al., 2023).
Vision-language: Atomic visual skills include color attribution, object recognition, counting, and spatial reasoning. COMPACT formalizes 10 such capabilities, defining compositional complexity $k$ as the cardinality of atomic skills jointly required by a task (Wu et al., 30 Apr 2025).
Robotics: Atomic actions are purified motion units (e.g., grasp, push, open) extracted from long-horizon manipulation sequences, ensuring temporal localization for disentangled learning (Zhang et al., 2 Apr 2025).
Reasoning LMs: Atomic operations are basic arithmetic (addition, subtraction, multiplication, division) cast as edges in a DAG underlying compositional word problems (Zhang et al., 8 Dec 2025).
ML potentials: Atomic descriptors are per-atom representations; atomic property pre-training ensures generalization across chemistry and physics tasks by leveraging energy and force supervision at the atomic scale (Shoghi et al., 2023, Zhang et al., 2023).

Table 1. Examples of Atomic Capabilities by Domain

Domain	Atomic Capability	Example Instance
Molecular modeling	Atom-level token	'C', 'N', 'O' in SMILES
Vision-language	Object recognition	"What object is present?"
Robotics	Atomic action unit	"grasp"
Reasoning LMs	Arithmetic primitive	$v_i = v_j + v_k$

Atomicity thus provides a controlled substrate for capability accumulation and further compositional generalization.

2. Pre-training Methodologies for Atomic Capabilities

Atomic capability pre-training strategies share several architectural and procedural components, adapted to the particulars of their domain:

a. Data Construction and Tokenization

Molecular: BPE-style tokenizers recursively merge frequent SMILES substrings but can stochastically decompose into atomic tokens via Bernoulli-masked dropout, yielding a hybrid atomic/substructure granularity. During pre-training, a $p_{\rm drop}=0.2$ leads to 20% substructure tokens replaced with atoms; for purely atomic downstreams $p_{\rm drop}=1$ (Ding et al., 2023).
Vision-language: COMPACT samples combinations of $k$ atomic capabilities per image and generates QA pairs via LLM prompting, with k-wise balancing and QA verification (Wu et al., 30 Apr 2025).
Robotics: Video segments are automatically filtered and re-annotated to contain a single atomic action, guaranteeing semantic purity before visual-language contrastive learning (Zhang et al., 2 Apr 2025).
Reasoning LMs: Synthetic problems are generated so that each requires explicit recovery of a sequence of atomic operations, with operation-count complexity $op(G)$ as a controlling variable (Zhang et al., 8 Dec 2025).
Atomic ML potentials: Supervision is on per-atom energies and forces, across a broad multi-task mixture of chemical/material datasets (Shoghi et al., 2023, Zhang et al., 2023).

b. Model Objectives and Architectures

Sequence Modeling: Sequence-to-sequence canonicalization objectives (e.g., mapping generic to canonical SMILES) or contrastive next-token prediction are used to ground atomic capability (Ding et al., 2023, Zhang et al., 8 Dec 2025).
Contrastive and Compositional Learning: Hierarchical CLIP and recombination losses align video representations with disentangled text semantics across atomic partitions (subject, action, object) (Zhang et al., 2 Apr 2025). Visual-language pre-training employs cross-entropy over text tokens and balanced compositional sampling (Wu et al., 30 Apr 2025).
Multi-task Losses: ML potential pre-training optimizes jointly across energy and force regression heads, with explicit task weights and structure-wise reductions to ensure per-atom balance and scalability (Shoghi et al., 2023, Zhang et al., 2023).

3. Interaction with Model Architectures and Learning Dynamics

Atomic capability pre-training leverages flexible architectures capable of both fine-grained and compositional feature synthesis:

Transformers: Used for both molecule generation (AdaMR: 12-layer encoder-decoder) and vision-language alignment (temporal-difference Transformer in RoboAct-CLIP) (Ding et al., 2023, Zhang et al., 2 Apr 2025).
Graph Neural Networks: Form the backbone of interatomic potential pre-training, supporting invariant and equivariant message passing, and being compatible with both self-supervised and supervised atomic objectives (Shoghi et al., 2023, Cui et al., 2023, Zhang et al., 2023).
Feature Disentanglement: Modular disentanglement heads (subject, action, object) with orthogonality penalties, feature-bank recombination, and hierarchical contrastive objectives ensure atomicity is represented in a decoupled latent structure (Zhang et al., 2 Apr 2025).
Curricular Scheduling: In reasoning LMs, atomic pre-training is often followed by mid-training at the “edge of competence” and RL-based post-training, targeting tasks just beyond the base model’s solved set (Zhang et al., 8 Dec 2025).

4. Empirical Metrics and Effects of Atomic Capabilities

The efficacy of atomic capability pre-training is measured by a range of metrics highlighting generalization, data efficiency, and compositionality:

Generative Metrics (Molecular): Validity, uniqueness, and novelty of generated molecules. AdaMR achieves 90.7% validity, 99.1% uniqueness, and 93.2% novelty on ZINC250K at 100% atomic granularity, outperforming previous models (Ding et al., 2023).
Compositional Generalization (Vision-Language): COMPACT reaches substantial (+94%, +83% relative) gains over full-scale VIT baselines on tasks requiring $k \geq 4$ atomic skills, despite using only 10% of the data (Wu et al., 30 Apr 2025).
Robotics Success Rates: RoboAct-CLIP yields 76.5% success versus 64.5% (MPI-Base) on multi-object manipulation, with controlled ablations (–10.5pp for removing temporal-differencing, –6.5pp for removing disentanglement) (Zhang et al., 2 Apr 2025).
Atomic ML Potentials: Pre-training can cut force MAEs by >50% and dramatically accelerate fine-tuning (GPIP: achieves SchNet accuracy with an order-of-magnitude less DFT data; JMP: 59% error improvement and up to 12× reduced fine-tuning compute) (Shoghi et al., 2023, Cui et al., 2023).
Reasoning LMs: “Headroom” after atomic pre-training is necessary: if OOD performance is saturated, RL does not extend capabilities (pass@128 increases only when pre-training stops short of the hardest tasks). RL gains of +42pp pass@128 on OOD–edge and +15pp on OOD–hard are reported only when atomic pre-training has installed but not saturated the required primitive operations (Zhang et al., 8 Dec 2025).

5. Synergy, Transfer, and Multi-task Considerations

A principal finding is that atomic capability pre-training synergizes with higher-granularity representations, mid-training, and targeted RL/imitation to support multi-task and cross-domain transfer:

Hybrid Granularity: AdaMR demonstrates that mixed atomic/substructure pre-training benefits both generative accuracy (atoms) and property prediction (substructures); unified models maintain strong performance across divergent downstreams (Ding et al., 2023).
Data Efficiency: COMPACT and GPT-based reasoning curricula illustrate that explicit compositional control during pre-training—covering all atomic skills and their combinations in a balanced schedule—accelerates overall skill acquisition and extends generalization with lower corpus scale (Wu et al., 30 Apr 2025, Zhang et al., 8 Dec 2025).
Multi-task Scaling: JMP and DPA-2 pre-train on heterogenous datasets spanning molecules and materials, using shared representations for universal atomic priors. This enables order-of-magnitude reductions in labeled data for new domains and stable scaling to hundreds of millions of parameters (Shoghi et al., 2023, Zhang et al., 2023).
Transfer and Distillation: In atomic ML, large pre-trained models can be distilled into lightweight surrogates after few-shot fine-tuning, transferring atomic priors efficiently; t-SNE analyses show per-atom embeddings structure according to chemical and geometric environment irrespective of the initial DFT labeling (Zhang et al., 2023).
Process-level Rewarding: For reasoning LMs, process-verified (not outcome-only) RL rewards prevent reward hacking and robustify extrapolative reasoning. This is essential for pushing capabilities beyond atomic pre-training via post-training (Zhang et al., 8 Dec 2025).

6. Guidelines and Best Practices

Extracted recommendations for designing and deploying atomic capability pre-training frameworks:

Define a minimal, exhaustive atomic skill set relevant to the domain. Explicit enumeration and semantic filtering are necessary to ensure coverage and purity (Wu et al., 30 Apr 2025, Zhang et al., 2 Apr 2025).
Balance coverage and combinatorial sampling. Uniform $k$ -complexity distributions avoid overfitting to shallow compositions, facilitating robust generalization (Wu et al., 30 Apr 2025).
Apply atomic-level objectives and regularization. For molecular and ML potential models, use hybrid atom/substructure tokenizers and loss formulations that prevent any single granularity from dominating (Ding et al., 2023, Shoghi et al., 2023).
Exploit process-level (intermediate) feedback when possible. In reasoning and robotics, both sequence-level and step-level signals guide the precise composition of atomic capabilities (Zhang et al., 8 Dec 2025, Zhang et al., 2 Apr 2025).
Leverage lightweight, modular model components for disentanglement. Disentanglement heads and temporal-differencing modules isolate atomic capabilities, reducing confounding and improving transfer (Zhang et al., 2 Apr 2025).
Amortize compute and enable efficient fine-tuning/distillation. Atomic pre-trained backbones accelerate adaptation to new data and allow for compact downstream surrogates (Shoghi et al., 2023, Zhang et al., 2023).

A plausible implication is that explicit atomic capability pre-training will become foundational across domains striving for multi-task, data-efficient, and compositional generalization. The universality of the atomic paradigm is evidenced by convergence in methodology, empirical metrics, and representational strategies across otherwise disparate research areas.

Markdown Upgrade to Chat

References (7)

AdaMR: Adaptable Molecular Representation for Unified Pre-training Strategy (2023)

COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning (2025)

RoboAct-CLIP: Video-Driven Pre-training of Atomic Action Understanding for Robotics (2025)

On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models (2025)

From Molecules to Materials: Pre-training Large Generalizable Models for Atomic Property Prediction (2023)

DPA-2: a large atomic model as a multi-task learner (2023)

Geometry-enhanced Pre-training on Interatomic Potentials (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Atomic Capability Pre-Training.