Papers
Topics
Authors
Recent
2000 character limit reached

UniGenX: Unified Scientific Data Generation

Updated 23 November 2025
  • UniGenX is a unified framework that combines autoregressive sequence modeling with conditional diffusion to generate symbolic and continuous scientific data with high accuracy.
  • It integrates an AR transformer decoder and a diffusion head, enabling joint modeling of discrete tokens like chemical formulas and continuous values such as atomic coordinates.
  • Empirical evaluations show significant improvements in material crystal prediction, small-molecule conformation, and conditional molecule generation benchmarks.

UniGenX is a unified framework for scientific data generation that combines autoregressive (AR) next-token prediction with conditional diffusion-based generative modeling. This architecture addresses the longstanding challenge of generating scientific data containing both symbolic sequences (such as chemical formulas, SMILES strings, and special markers) and high-precision continuous quantities (such as atomic coordinates and lattice vectors). By integrating an AR transformer decoder and a conditional diffusion head, UniGenX achieves the sequence flexibility typical of LLMs while simultaneously meeting the stringent precision requirements demanded by scientific tasks (Zhang et al., 9 Mar 2025).

1. Motivations and Problem Setting

Scientific data generation demands the simultaneous handling of two central requirements:

  • Extreme Numerical Precision: Continuous-valued quantities (e.g., atomic positions, molecular conformations, lattice parameters) must be modeled with sub-angstrom accuracy.
  • Flexible Sequence Modeling: The input space is inherently multimodal, comprising both symbolic tokens (e.g., chemical elements, SMILES, natural-language prompts) and numerical tokens.

Traditional AR models, such as GPT and its derivatives, excel in flexible and long-context sequence modeling but typically underperform on high-precision continuous value generation. Conversely, diffusion models offer high accuracy on complex continuous outputs but lack mechanisms to efficiently handle discrete sequences and contextual conditioning on symbolic tokens. UniGenX bridges these gaps by unifying a transformer-based AR backbone for sequence modeling with a lightweight conditional diffusion process, thus enabling joint modeling of words and numbers with scientific-grade precision.

2. Architectural Principles

UniGenX treats all data modalities as a single sequence of tokens, where each token xtx_t is either symbolic (words, atom symbols, special markers like <bos>, <eos>) or numerical (e.g., R3\mathbb{R}^3 positions).

  • Input Representation: Symbolic tokens are mapped via learned embeddings ew(xt)Rdmodele_w(x_t) \in \mathbb{R}^{d_{model}}; numerical tokens are mapped through a linear projection ev(xt)=Wvxte_v(x_t) = W_v x_t.
  • AR Backbone: A causal transformer decoder processes the token embeddings and produces hidden states ht=TransformerDecoder(e(x1t1))h_t = \text{TransformerDecoder}(e(x_{1…t-1})) for each position.
  • Parallel Output Heads:
    • Discrete Head: For symbolic positions, a linear projection and softmax readout produce a categorical distribution over the vocabulary: p(xtx<t)=softmax(Woht+bo)p(x_t|x_{<t}) = \text{softmax}(W_o h_t + b_o).
    • Diffusion Head: For numerical tokens, a conditional diffusion network (MLP with AdaLayerNorm and residual blocks) predicts denoising targets at each step in the diffusion process, conditioned on the AR context hth_t.

A summary of data flow is presented in the table below:

Component Input Output/Operation
Embedding Layer xtx_t (symbolic) ew(xt)e_w(x_t) (lookup)
Embedding Layer xtx_t (numerical) ev(xt)=Wvxte_v(x_t) = W_v x_t (linear)
Transformer Decoder e(x1t1)e(x_{1…t-1}) hth_t
Discrete Output Head hth_t Logits for softmax (xtx_t discrete)
Diffusion Output Head hth_t, noisy xtnx_t^n Predicted noise/denoised value (xtx_t continuous)

This unified token sequence framework supports the blending of symbolic and continuous modalities, enabling direct handling of mixed scientific data streams (Zhang et al., 9 Mar 2025).

3. Mathematical Formulation and Conditioning

Autoregressive Prediction

Let x=(x1,,xT)x = (x_1,\ldots,x_T) denote the mixed token sequence. The AR probability is factorized as

p(x)=t=1Tp(xtx<t).p(x) = \prod_{t=1}^T p(x_t | x_{<t}).

For discrete tokens, p(xtx<t)p(x_t|x_{<t}) is parameterized via the transformer decoder and softmax as above.

Conditional Diffusion for Numerical Tokens

For xtRkx_t \in \mathbb{R}^k, the denoising diffusion probabilistic model (DDPM) training objective is:

  • Forward process: q(xtnx0)=N(xtn;αtx0,σt2I)q(x_t^n|x_0) = N(x_t^n; \alpha_t x_0, \sigma_t^2 I), with xtn=αtx0+σtϵ,ϵN(0,I)x_t^n = \alpha_t x_0 + \sigma_t \epsilon,\, \epsilon \sim N(0, I).
  • The diffusion head, ϵθ(xtn,t;ht)\epsilon_\theta(x_t^n, t; h_t), predicts the noise component.
  • Inference uses the reverse diffusion step:

xt1=1αt(xtnσtϵ^)+noise.x_{t-1} = \frac{1}{\alpha_t}(x_t^n - \sigma_t \hat{\epsilon}) + \text{noise}.

Joint Objective

Joint training involves discrete cross-entropy loss and diffusion mean-squared error:

L=LAR+λLdiffL = L_{AR} + \lambda L_{diff}

where

LAR=t:xt discretelogp(xtx<t),Ldiff=t:xt continuousEϵ[ϵϵθ(αtxt+σtϵ,t;ht)2].L_{AR} = -\sum_{t:x_t \text{ discrete}} \log p(x_t | x_{<t}), \qquad L_{diff} = \sum_{t:x_t \text{ continuous}} \mathbb{E}_{\epsilon}[||\epsilon - \epsilon_\theta(\alpha_t x_t + \sigma_t \epsilon, t; h_t)||^2].

  • Forward Conditioning: The AR context vector hth_t encodes all prior symbolic and numeric information and is used as the conditional context for the diffusion head.
  • Backward Feedback: After sampling a continuous output, it is appended as an observed value for future AR and diffusion predictions, ensuring tight coupling across token types.

4. Training and Inference Procedures

Training and sampling are conducted as follows:

  • Training (Algorithm 1):
  1. Embed the entire sequence of symbolic and numeric tokens.
  2. Pass through the causal transformer to obtain hidden states.
  3. Compute cross-entropy loss for all discrete positions.
  4. For each continuous position, sample MM diffusion steps, predict noise, and accumulate losses.
  5. Sum per-type losses: L=LAR+λLdiffL = L_{AR} + \lambda L_{diff}.
  • Inference (Algorithm 2):
    • Run the AR decoder to produce hh for the next token.
    • If the next token is discrete, sample from softmax; if continuous, run the diffusion reverse chain conditioned on hh.
    • Append each output to the growing sequence and continue until <eos>.

This autoregressive-diffusion alternation achieves effective generation across mixed modalities without the need for separate models.

5. Empirical Evaluation and Performance

Material Crystal Structure Prediction

Benchmark datasets: MP-20 (20\leq 20 atoms), Carbon-24, MPTS-52 (52\leq 52 atoms). In fine-tuned settings with 400M parameters, UniGenX demonstrated:

  • MP-20: match-rate 10\uparrow 10 percentage points versus FlowMM, RMSD 35%\downarrow 35\%.
  • Carbon-24: match-rate 28\uparrow 28 pp, RMSD 45%\downarrow 45\%.
  • MPTS-52: match-rate 120\uparrow 120 pp, RMSD 62%\downarrow 62\%.

Performance scales with sequence length, revealing strong long-range context modeling even without explicit equivariance by architectural design (data augmentation is used instead).

De Novo Material Generation

For 10,000 unconditional samples, metrics include validity, recall/precision coverage, and Wasserstein distance (Wdist) for density and electron count. UniGenX achieves 0.065 and 0.04 Wdist in 200 steps, which are respective improvements of 73% and 52% over prior state of the art.

Small-Molecule Conformation Generation

On GEOM-QM9 and GEOM-Drugs datasets, metrics are Coverage (COV) and Matching (MAT) at 0.5 Å (QM9) and 1.25 Å (Drugs). UniGenX matches or surpasses DMCG on median metrics and produces lower MAE in ensemble energy prediction (0.146 eV versus 0.432 eV for DMCG). Generated ensembles closely match empirical distributions of ground-truth energy and HOMO/LUMO gaps.

Conditional Molecule Generation

On QM9, the property-conditioned regime uses both LDM and EDM sampling. UniGenX (100M parameters) is trained in all-in-one mode (no per-property retraining) and achieves new state-of-the-art MAEs across five quantum properties, with best-in-class EDM results on four out of six targets and a 53.6% reduction in dipole moment MAE.

Unified Training and Cross-Domain Generalization

Pretraining a single 100M parameter UniGenX model on mixed material (5M) and molecule (1M) data, then finetuning on specific benchmarks, yields superior or competitive results to domain-specialized state-of-the-art methods, demonstrating cross-domain generalization across materials and small molecules.

6. Strengths, Limitations, and Extensions

Strengths:

  • High numerical accuracy for continuous scientific data via conditional diffusion head.
  • Native support for sequence-level multimodal generation (words and numbers in a single AR framework).
  • Strong context modeling on long and complex sequences without the need for explicitly equivariant architectures.
  • Unified training for diverse data, supporting extensibility to other domains (proteins, DNA, energies, forces, natural-language instructions).

Limitations:

  • Diffusion head introduces additional model complexity and increased sampling steps compared to pure AR architectures.
  • No explicit equivariance/invariance; performance instead relies on data augmentation, with potential underperformance on symmetry-intensive tasks.
  • Diffusion scheduler hyperparameters (e.g., EDM mean/std) require careful tuning.

Potential Extensions:

  • Incorporation of SE(3) equivariance or symmetry-aware attention mechanisms.
  • Expansion to more complex scientific domains, including protein backbone and side-chain generation, RNA folding, and multi-protein assemblies.
  • Replacement of DDPM with faster, flow-matching objectives to accelerate sampling.
  • Integration with generalist LLMs to enable richer natural-language-to-structure generation.

7. Context and Impact

UniGenX introduces a hybrid generative paradigm leveraging both autoregressive and diffusion modeling paradigms to address the unique constraints of scientific data generation. It advances the state of the art in mixed-modality scientific generative modeling, achieving validated gains on material crystal structure prediction, de novo generation, small-molecule conformation, and conditional molecule generation. Its versatility and unified architecture suggest broad applicability across scientific data types and modalities (Zhang et al., 9 Mar 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to UniGenX Model.