Papers
Topics
Authors
Recent
Search
2000 character limit reached

AtomDisc: Atom-Level Discretization Frameworks

Updated 25 May 2026
  • AtomDisc is a unified set of frameworks that discretizes atomic configurations for generative modeling, trajectory mapping, and molecular LLM tokenization.
  • It employs coupled continuous and discrete diffusion processes, alongside quasi-classical trajectory techniques, to model chemical kinetics and atomic structures.
  • The approach enhances interpretability and performance in chemical modeling by integrating score-based neural networks, state-space mapping, and graph-based tokenization.

AtomDisc refers to a set of frameworks and algorithms for atom-level discretization, generation, and analysis of molecular, atomistic, and chemical kinetic systems. The term encompasses three distinct but fundamentally related research directions: (1) score-based generative modeling of atomistic structures via coupled continuous and discrete diffusion (“AtomDisc” within AGeDi), (2) a state-space trajectory-resolution framework in theoretical chemical kinetics (quasi-classical trajectory approaches), and (3) atom-level tokenizer quantization schemes for molecular LLMs. Each domain utilizes atom-level discretization to enable structured, interpretable, and scalable modeling across chemical domains.

1. AtomDisc as Joint Score-Based Atomistic Diffusion

The generative AtomDisc framework models the joint probability of an atomic configuration M=(R,z,S)\mathcal{M} = (\mathbf{R}, \mathbf{z}, \mathbf{S}), where RR3×N\mathbf{R} \in \mathbb{R}^{3\times N} encodes atomic positions, z{0,1,,zmax}N\mathbf{z} \in \{0,1,\dots,z_{\max}\}^N encodes atomic types (including a mask state), and S\mathbf{S} is a fixed periodic cell. The central idea is to execute two parallel, independent diffusion processes during both training and sampling:

  • A continuous-time stochastic differential equation (SDE) on atomic positions R\mathbf{R}, using either variance-preserving or variance-exploding SDEs,
  • A continuous-time discrete-state Markov chain (the "AtomDisc" process) on atomic types z\mathbf{z}, formulated as an absorbing-mask process.

For positions, denoising score matching is employed to train a GNN-based score network sθ(Mt,t)Rtlogpt(Rt)\mathbf{s}_\theta(\mathcal{M}_t, t) \approx \nabla_{\mathbf{R}_t} \log p_t(\mathbf{R}_t). Atom types are diffused using a generator Qt=σ(t)QQ_t = \sigma(t) Q (with only transitions from physical types into a mask state), and the score model learns to predict the 'concrete score' by minimizing a score-entropy loss. The overall objective is a weighted sum of position and type score losses:

Ltotal=λRLR(θ)+λzLz(θ)\mathcal{L}_{\text{total}} = \lambda_R \mathcal{L}^R(\theta) + \lambda_z \mathcal{L}^z(\theta)

Sampling applies joint denoising with classifier-free guidance, allowing structured steering toward properties such as crystal symmetry or element composition (Rønne et al., 24 Jul 2025).

2. State-Space Approach in Atom–Diatom Reaction Dynamics

Within rarefied-gas reaction dynamics, AtomDisc denotes a comprehensive, trajectory-validated state-space mapping, assigning transition probabilities to all atom–diatom collisions of the type A + BC(v,j)products(v, j) \rightarrow \text{products}, where RR3×N\mathbf{R} \in \mathbb{R}^{3\times N}0 and RR3×N\mathbf{R} \in \mathbb{R}^{3\times N}1 are the initial diatom vibrational and rotational quantum numbers. For each specific initial RR3×N\mathbf{R} \in \mathbb{R}^{3\times N}2 and collision energy RR3×N\mathbf{R} \in \mathbb{R}^{3\times N}3, a large ensemble of quasi-classical trajectories is propagated on "vetted," global, reactive potential-energy surfaces (PESs). Final state quantum numbers RR3×N\mathbf{R} \in \mathbb{R}^{3\times N}4 are extracted by analysis of actions and angular momenta.

Transition probabilities, RR3×N\mathbf{R} \in \mathbb{R}^{3\times N}5, are constructed as normalized trajectory counts, furnishing state-resolved probability maps. Integrated over impact parameter, these yield state-to-state cross sections, and further Boltzmann-averaged values provide thermal rate coefficients,

RR3×N\mathbf{R} \in \mathbb{R}^{3\times N}6

This state-space perspective enables direct closure of master equations for nonequilibrium kinetic modeling and suggests principled zone-based or machine-learned dimension reduction for tractable simulation in computational fluid dynamics (CFD) and direct simulation Monte Carlo (DSMC). Results demonstrate robust mapping across PES types and the ability to diagnose reactivity and energy flow by summarizing 10⁸+ trajectory events into interpretable data structures (Vijayakumar et al., 10 Dec 2025).

3. AtomDisc as Atom-Level Tokenizer for Molecular LLMs

In molecular LLMs, AtomDisc enables structure-aware tokenization of atom-level local environments. Each atom in a molecular graph is embedded via a graph neural network (GIN) encoder, incorporating features such as atomic number, degree, and local topology. These embeddings are discretized using vector-quantization (VQ-VAE) with a codebook of RR3×N\mathbf{R} \in \mathbb{R}^{3\times N}7 centers, where assignment is by nearest neighbor in feature space:

RR3×N\mathbf{R} \in \mathbb{R}^{3\times N}8

The resulting tokens RR3×N\mathbf{R} \in \mathbb{R}^{3\times N}9 are integrated into LLMs (LLMs, e.g., LLaMA-2-7B) by aligning codebook vectors with the LLM's embedding space via a trainable two-layer MLP projector. Downstream adapters (LoRA, z{0,1,,zmax}N\mathbf{z} \in \{0,1,\dots,z_{\max}\}^N0) are fine-tuned for tasks including property prediction, quantum regression, and molecular sequence generation.

Empirically, incorporating AtomDisc tokens yields state-of-the-art results on MoleculeNet (average ROC-AUC 84.7%), competitive quantum regression metrics (QM9: MAE for HOMO/LUMO/GAP 0.0033/0.0032/0.0042), and substantial gains in molecular generation tasks. Removal of atom-level tokens at pre-training or fine-tuning produces consistent performance drops (−10–20% absolute match) (Zhang et al., 28 Nov 2025).

4. Applications: Interpolation, Property Steering, and Interpretability

Practical applications of AtomDisc span diverse chemical domains. In generative modeling, interpolating atomic-type embeddings allows the creation of bimetallic clusters (Pt–Ni, Pd–Ag) not present in the training data, with generated structures lying smoothly between reference compositions. For two-dimensional materials, conditioning on layer-group symmetry and employing classifier-free guidance enables fine control over generated crystallographic symmetries, achieving up to 90% target-accuracy for optimal guidance weights. In reaction kinetics, the trajectory-based AtomDisc maps characterize the channel yields (elastic, inelastic, exchange, atomization) as functions of energy and quantum state, providing a basis for predictive non-equilibrium models.

For molecular LLMs, the interpretability of AtomDisc tokens is evidenced by chemically coherent token clusters (observed in t-SNE), sharp associations with functional groups, and direct attention on chemically relevant motifs. Token entropy analyses confirm low entropy for "pure" functional group assignments and mixture behavior in chemically ambiguous environments.

5. Software and Implementation

The AtomDisc generative framework is realized in the open-source AGeDi Python package (GPLv3). The architecture is modular, with the main components:

  • AtomsGraph: atom structure representation and equivariant neighbor graph construction,
  • ScoreModel: equivariant GNN with separate heads for position/type scores and optional properties,
  • Noiser: forward and reverse diffusion processes for atomic positions and types,
  • Diffusion: orchestrates training, batch-wise noising, loss aggregation, and joint sampling with guidance.

Extensions are supported through backbones (e.g., PaiNN, EGNN, NequIP), property conditioning, or alternative discrete diffusion schemes. Full documentation and resources are available at https://github.com/nronne/agedi and https://agedi.readthedocs.io (Rønne et al., 24 Jul 2025).

AtomDisc tokenizer workflows for LLMs are structured as discrete modules: GNN encoding (MoleculeSTM-GIN), VQ-VAE training, LLM embedding alignment via MLP projector, and modular LoRA adapter fine-tuning. Representative pseudocode blocks for graph encoding, quantization, and token projection are included in the original work to enable reproducibility (Zhang et al., 28 Nov 2025).

6. Significance and Outlook

AtomDisc unifies advanced atom-level discretization approaches across generative molecular modeling, non-equilibrium kinetic simulations, and LLM-driven chemical informatics. By explicitly modeling and leveraging atomistic environments—via score-based neural diffusion, trajectory-validated state-space analysis, or structure-aware tokenization—AtomDisc frameworks enable interpretability, algorithmic flexibility, and empirical gains. The trait of modular extension (in AGeDi, tokenizer design, and zone-grained kinetic mapping) suggests AtomDisc methodologies will be central to data-driven materials discovery, molecular generative design, and accurate simulation of complex chemical kinetics.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AtomDisc.