UniGenX: Unified Scientific Data Generation
- UniGenX is a unified framework that combines autoregressive sequence modeling with conditional diffusion to generate symbolic and continuous scientific data with high accuracy.
- It integrates an AR transformer decoder and a diffusion head, enabling joint modeling of discrete tokens like chemical formulas and continuous values such as atomic coordinates.
- Empirical evaluations show significant improvements in material crystal prediction, small-molecule conformation, and conditional molecule generation benchmarks.
UniGenX is a unified framework for scientific data generation that combines autoregressive (AR) next-token prediction with conditional diffusion-based generative modeling. This architecture addresses the longstanding challenge of generating scientific data containing both symbolic sequences (such as chemical formulas, SMILES strings, and special markers) and high-precision continuous quantities (such as atomic coordinates and lattice vectors). By integrating an AR transformer decoder and a conditional diffusion head, UniGenX achieves the sequence flexibility typical of LLMs while simultaneously meeting the stringent precision requirements demanded by scientific tasks (Zhang et al., 9 Mar 2025).
1. Motivations and Problem Setting
Scientific data generation demands the simultaneous handling of two central requirements:
- Extreme Numerical Precision: Continuous-valued quantities (e.g., atomic positions, molecular conformations, lattice parameters) must be modeled with sub-angstrom accuracy.
- Flexible Sequence Modeling: The input space is inherently multimodal, comprising both symbolic tokens (e.g., chemical elements, SMILES, natural-language prompts) and numerical tokens.
Traditional AR models, such as GPT and its derivatives, excel in flexible and long-context sequence modeling but typically underperform on high-precision continuous value generation. Conversely, diffusion models offer high accuracy on complex continuous outputs but lack mechanisms to efficiently handle discrete sequences and contextual conditioning on symbolic tokens. UniGenX bridges these gaps by unifying a transformer-based AR backbone for sequence modeling with a lightweight conditional diffusion process, thus enabling joint modeling of words and numbers with scientific-grade precision.
2. Architectural Principles
UniGenX treats all data modalities as a single sequence of tokens, where each token is either symbolic (words, atom symbols, special markers like <bos>, <eos>) or numerical (e.g., positions).
- Input Representation: Symbolic tokens are mapped via learned embeddings ; numerical tokens are mapped through a linear projection .
- AR Backbone: A causal transformer decoder processes the token embeddings and produces hidden states for each position.
- Parallel Output Heads:
- Discrete Head: For symbolic positions, a linear projection and softmax readout produce a categorical distribution over the vocabulary: .
- Diffusion Head: For numerical tokens, a conditional diffusion network (MLP with AdaLayerNorm and residual blocks) predicts denoising targets at each step in the diffusion process, conditioned on the AR context .
A summary of data flow is presented in the table below:
| Component | Input | Output/Operation |
|---|---|---|
| Embedding Layer | (symbolic) | (lookup) |
| Embedding Layer | (numerical) | (linear) |
| Transformer Decoder | ||
| Discrete Output Head | Logits for softmax ( discrete) | |
| Diffusion Output Head | , noisy | Predicted noise/denoised value ( continuous) |
This unified token sequence framework supports the blending of symbolic and continuous modalities, enabling direct handling of mixed scientific data streams (Zhang et al., 9 Mar 2025).
3. Mathematical Formulation and Conditioning
Autoregressive Prediction
Let denote the mixed token sequence. The AR probability is factorized as
For discrete tokens, is parameterized via the transformer decoder and softmax as above.
Conditional Diffusion for Numerical Tokens
For , the denoising diffusion probabilistic model (DDPM) training objective is:
- Forward process: , with .
- The diffusion head, , predicts the noise component.
- Inference uses the reverse diffusion step:
Joint Objective
Joint training involves discrete cross-entropy loss and diffusion mean-squared error:
where
- Forward Conditioning: The AR context vector encodes all prior symbolic and numeric information and is used as the conditional context for the diffusion head.
- Backward Feedback: After sampling a continuous output, it is appended as an observed value for future AR and diffusion predictions, ensuring tight coupling across token types.
4. Training and Inference Procedures
Training and sampling are conducted as follows:
- Training (Algorithm 1):
- Embed the entire sequence of symbolic and numeric tokens.
- Pass through the causal transformer to obtain hidden states.
- Compute cross-entropy loss for all discrete positions.
- For each continuous position, sample diffusion steps, predict noise, and accumulate losses.
- Sum per-type losses: .
- Inference (Algorithm 2):
- Run the AR decoder to produce for the next token.
- If the next token is discrete, sample from softmax; if continuous, run the diffusion reverse chain conditioned on .
- Append each output to the growing sequence and continue until
<eos>.
This autoregressive-diffusion alternation achieves effective generation across mixed modalities without the need for separate models.
5. Empirical Evaluation and Performance
Material Crystal Structure Prediction
Benchmark datasets: MP-20 ( atoms), Carbon-24, MPTS-52 ( atoms). In fine-tuned settings with 400M parameters, UniGenX demonstrated:
- MP-20: match-rate percentage points versus FlowMM, RMSD .
- Carbon-24: match-rate pp, RMSD .
- MPTS-52: match-rate pp, RMSD .
Performance scales with sequence length, revealing strong long-range context modeling even without explicit equivariance by architectural design (data augmentation is used instead).
De Novo Material Generation
For 10,000 unconditional samples, metrics include validity, recall/precision coverage, and Wasserstein distance (Wdist) for density and electron count. UniGenX achieves 0.065 and 0.04 Wdist in 200 steps, which are respective improvements of 73% and 52% over prior state of the art.
Small-Molecule Conformation Generation
On GEOM-QM9 and GEOM-Drugs datasets, metrics are Coverage (COV) and Matching (MAT) at 0.5 Å (QM9) and 1.25 Å (Drugs). UniGenX matches or surpasses DMCG on median metrics and produces lower MAE in ensemble energy prediction (0.146 eV versus 0.432 eV for DMCG). Generated ensembles closely match empirical distributions of ground-truth energy and HOMO/LUMO gaps.
Conditional Molecule Generation
On QM9, the property-conditioned regime uses both LDM and EDM sampling. UniGenX (100M parameters) is trained in all-in-one mode (no per-property retraining) and achieves new state-of-the-art MAEs across five quantum properties, with best-in-class EDM results on four out of six targets and a 53.6% reduction in dipole moment MAE.
Unified Training and Cross-Domain Generalization
Pretraining a single 100M parameter UniGenX model on mixed material (5M) and molecule (1M) data, then finetuning on specific benchmarks, yields superior or competitive results to domain-specialized state-of-the-art methods, demonstrating cross-domain generalization across materials and small molecules.
6. Strengths, Limitations, and Extensions
Strengths:
- High numerical accuracy for continuous scientific data via conditional diffusion head.
- Native support for sequence-level multimodal generation (words and numbers in a single AR framework).
- Strong context modeling on long and complex sequences without the need for explicitly equivariant architectures.
- Unified training for diverse data, supporting extensibility to other domains (proteins, DNA, energies, forces, natural-language instructions).
Limitations:
- Diffusion head introduces additional model complexity and increased sampling steps compared to pure AR architectures.
- No explicit equivariance/invariance; performance instead relies on data augmentation, with potential underperformance on symmetry-intensive tasks.
- Diffusion scheduler hyperparameters (e.g., EDM mean/std) require careful tuning.
Potential Extensions:
- Incorporation of SE(3) equivariance or symmetry-aware attention mechanisms.
- Expansion to more complex scientific domains, including protein backbone and side-chain generation, RNA folding, and multi-protein assemblies.
- Replacement of DDPM with faster, flow-matching objectives to accelerate sampling.
- Integration with generalist LLMs to enable richer natural-language-to-structure generation.
7. Context and Impact
UniGenX introduces a hybrid generative paradigm leveraging both autoregressive and diffusion modeling paradigms to address the unique constraints of scientific data generation. It advances the state of the art in mixed-modality scientific generative modeling, achieving validated gains on material crystal structure prediction, de novo generation, small-molecule conformation, and conditional molecule generation. Its versatility and unified architecture suggest broad applicability across scientific data types and modalities (Zhang et al., 9 Mar 2025).