OMatG-IRL: RL for Material Generation
- The paper introduces a surrogate stochastic policy that enables RL on pretrained velocity-only flow models without explicit score computation.
- It leverages group-relative policy optimization with PPO-style clipping and KL regularization to guide energy minimization in crystal structure prediction.
- Empirical results demonstrate a reduction in energy per atom and a ≥10× sampling efficiency gain while maintaining competitive structure match rates.
Open Materials Generation with Inference-time Reinforcement Learning (OMatG-IRL) is a reinforcement learning (RL) framework that enables the direct steering of continuous-time generative models for crystalline materials toward explicit downstream objectives, such as minimizing energy per atom, at inference time. It achieves this by introducing a stochastic policy gradient approach that operates on pretrained velocity fields, thereby circumventing the need for explicit score computation—an obstacle previously limiting RL methods in this class of generative models. OMatG-IRL allows objective-driven sampling from generative models using only velocity predictions, leverages surrogate stochastic dynamics, and reinforces policies using group-relative policy optimization and KL regularization, while maintaining data-distributional properties and sampling efficiency competitive with score-dependent approaches (Hoellmer et al., 31 Jan 2026).
1. Foundations: Continuous-Time Generative Models for Crystals
Crystalline materials are represented as tuples , where:
- are atomic numbers (fixed per inference for crystal structure prediction, CSP),
- are fractional atomic positions,
- is the lattice matrix.
The generative process constructs conditioned on . Training employs the stochastic interpolant (SI) formalism, which specifies a stochastic path:
where is sampled from a simple base distribution, is sampled from the data, and the time-dependent coefficients interpolate between the two. A neural network is trained to predict the conditional velocity by minimizing the mean squared error:
At inference, the learned drift field defines the ODE:
with . In score-based models, an auxiliary denoiser reconstructs , yielding an SDE with time-dependent noise and score terms.
These models excel at learning stability via data-matching but cannot natively incorporate explicit objectives—such as energy minimization—into generation.
2. Obstacles to Direct Policy-Gradient RL in Velocity-Only Flow Models
Traditional policy-gradient RL interprets the generative sampler as an MDP, with transitions parameterized by a policy and a terminal reward . Policy-gradient estimators rely on the likelihood-ratio trick:
where encodes the terminal reward.
In score-based diffusion models, the transition kernel is Gaussian, with its mean determined by and . Obtaining log-likelihoods or their gradients thus necessitates access to the score. In contrast, velocity-based (flow-matching) models admit only deterministic ODE trajectories , resulting in degenerate (delta-function) densities, and rendering the standard policy-gradient estimator ill-defined. Injecting naive stochasticity breaks the marginal distributions. Therefore, standard policy-gradient RL is not directly applicable to flow-matching or SI-based generative models that lack an explicit score.
3. OMatG-IRL: RL on Velocity Fields via Surrogate Stochastic Policies
OMatG-IRL addresses these challenges by constructing a surrogate stochastic policy around a pretrained velocity-only flow model. The core approach:
- Introduces additive Gaussian noise of fixed schedule to the velocity-based transitions, yielding:
- Defines a reference stochastic policy as a Gaussian centered at the deterministic step with variance proportional to .
The RL-improved policy is initialized from and optionally learns its own noise schedule. In the CSP setting, reward functions are defined as (energy per atom from a surrogate model); invalid structures incur strongly negative rewards. Policy improvement uses group-relative policy optimization (GRPO): trajectories are grouped by composition, and a group-relative advantage is computed,
This is used in a PPO-style clipped RL objective, with an added KL regularizer keeping close to , resulting in the total objective:
A notable property, justified by Girsanov’s theorem, is that small ensures changes to the marginal distributions, preserving the original generative performance while enabling stochasticity required for RL-based optimization.
4. Algorithmic Structure and Implementation
OMatG-IRL operates as an inference-time RL procedure applied to a pretrained generative model. The primary algorithm comprises:
- Setting pretrained weights, small reference noise, and RL-specific hyperparameters (learning rate, group size, PPO clip, KL weight).
- Sampling compositions and generating trajectory batches per composition under the current stochastic policy.
- Computing energy-based rewards, group-relative advantages, and storing all transitions.
- Optimizing the PPO-style policy objective and KL regularization using standard stochastic gradient descent (e.g., Adam).
- Optionally learning time-dependent velocity annealing schedules via a residual neural net, with annealing-relevant rewards (e.g., based on cRMSE instead of energy).
This approach allows OMatG-IRL to reinforce the generative policy with respect to black-box objectives (e.g., predicted energy) and to learn nontrivial time-dependent schedules that outperform handcrafted annealing.
5. Empirical Results in Crystal Structure Prediction
Experiments were conducted on the MP-20 dataset (∼45,000 inorganic crystals, atoms, 89 elements). CSP is formulated as generating given fixed . Diversity arises naturally from conditioning on a broad composition set. The principal evaluation metrics include match rate (fraction matching relaxed reference structures), normalized RMSD, predicted energy per atom, and invalid energy rate.
Baseline models (e.g., OMatG Trig-SDE-Gamma with and velocity annealing) achieve match rate; omitting annealing () yields . OMatG-IRL is evaluated in both score-based (using denoiser ) and velocity-based (using only with surrogate noise) variants. Both approaches achieve reductions in mean energy per atom by eV, preserve or slightly improve RMSD, and maintain match rates () with only integration steps—yielding an order-of-magnitude improvement in sampling efficiency (≥10× faster sampling).
In further experiments, OMatG-IRL learns residual time-dependent annealing schedules , matching the accuracy of heavily annealed models (e.g., , cRMSE ) using a fraction of the integration steps ( versus ), and generalizing down to with minimal degradation. Handcrafted annealing fails in the low-step regime.
| Model/Setting | Match Rate | Mean Energy Δ (eV/atom) | N_t (Steps) | Sampling Speedup |
|---|---|---|---|---|
| OMatG Trig-SDE-Gamma | ~69% | Baseline | 740 | 1× |
| OMatG No Annealing | ~60% | Baseline | 50 | 15× |
| OMatG-IRL (Score/Velocity) | ~60% | –0.5 eV (improved) | 50 | 15× |
6. Limitations and Scope of Method
OMatG-IRL is an inference-time framework and avoids retraining the underlying generative model; however, it requires conducting many RL rollouts per composition in order to estimate group-relative advantages, imposing additional computational overhead. In CSP, diversity emerges from composition conditioning. For de-novo materials generation (DNG), handling discrete atom-type flows and enforcing diversity requires further methodological extensions, as simple conditioning may not suffice.
The current approach learns residual velocities (or annealing schedules), but it is plausible to combine this with joint learning of the noise schedule, or to transition to off-policy RL to maximize sample reuse. The surrogate policy only perturbs the velocity, but extensions to full SI models encompassing denoiser are straightforward; the score-based variant already implements this. Integration with end-to-end pipelines, including composition generation and constraint satisfaction, would be required for multi-objective inverse materials design.
7. Technical Contributions and Future Directions
The key technical contributions of OMatG-IRL are:
- Introduction of a surrogate stochastic policy (via SDE perturbation) for pretrained velocity-only models that preserves marginal distributions up to enabling RL-based exploration.
- Deployment of group-relative policy optimization (GRPO), PPO-style clipping, and KL regularization to enforce energy-based objectives during inference.
- Extension of RL to velocity-only models—without explicit score computations—addressing an open gap in flow-matching and SI generative modeling for scientific applications.
- Automated learning of time-dependent velocity-annealing schedules, leading to substantial (≥10×) sampling-efficiency gains, with matching or improved CSP accuracy.
A plausible implication is that OMatG-IRL can serve as a blueprint for RL-driven, property-optimized generative modeling in broader scientific domains, provided similar data-conditional flows are available. The methodology is positioned for expansion to multi-property, constraint-aware, and de-novo materials design scenarios (Hoellmer et al., 31 Jan 2026).