OMatG-IRL: RL for Material Generation

Updated 7 February 2026

The paper introduces a surrogate stochastic policy that enables RL on pretrained velocity-only flow models without explicit score computation.
It leverages group-relative policy optimization with PPO-style clipping and KL regularization to guide energy minimization in crystal structure prediction.
Empirical results demonstrate a reduction in energy per atom and a ≥10× sampling efficiency gain while maintaining competitive structure match rates.

Open Materials Generation with Inference-time Reinforcement Learning (OMatG-IRL) is a reinforcement learning (RL) framework that enables the direct steering of continuous-time generative models for crystalline materials toward explicit downstream objectives, such as minimizing energy per atom, at inference time. It achieves this by introducing a stochastic policy gradient approach that operates on pretrained velocity fields, thereby circumventing the need for explicit score computation—an obstacle previously limiting RL methods in this class of generative models. OMatG-IRL allows objective-driven sampling from generative models using only velocity predictions, leverages surrogate stochastic dynamics, and reinforces policies using group-relative policy optimization and KL regularization, while maintaining data-distributional properties and sampling efficiency competitive with score-dependent approaches (Hoellmer et al., 31 Jan 2026).

1. Foundations: Continuous-Time Generative Models for Crystals

Crystalline materials are represented as tuples $(A, X, L)$ , where:

$A \in \mathbb{N}^N$ are atomic numbers (fixed per inference for crystal structure prediction, CSP),
$X \in [0,1)^{N \times 3}$ are fractional atomic positions,
$L \in \mathbb{R}^{3 \times 3}$ is the lattice matrix.

The generative process constructs $(X, L)$ conditioned on $A$ . Training employs the stochastic interpolant (SI) formalism, which specifies a stochastic path:

$x_t = \alpha(t) x_0 + \beta(t) x_1 + \gamma(t) z, \quad z \sim \mathcal{N}(0, I),$

where $x_0 \sim \rho_0$ is sampled from a simple base distribution, $x_1 \sim \rho_1$ is sampled from the data, and the time-dependent coefficients $\alpha, \beta, \gamma$ interpolate between the two. A neural network $b^\theta(t, x_t)$ is trained to predict the conditional velocity $\partial_t x_t$ by minimizing the mean squared error:

$\mathcal{L}_b(\theta) = \mathbb{E}_{t, z, x_0, x_1} \left\| b^\theta(t, x_t) - \partial_t x_t \right\|^2.$

At inference, the learned drift field defines the ODE:

$\mathrm{d}X_t = b^\theta(t, X_t) \, \mathrm{d}t,$

with $X_0 \sim \rho_0$ . In score-based models, an auxiliary denoiser $z^\theta(t, x) \approx z$ reconstructs $\nabla_x \log \rho_t(x)$ , yielding an SDE with time-dependent noise and score terms.

These models excel at learning stability via data-matching but cannot natively incorporate explicit objectives—such as energy minimization—into generation.

2. Obstacles to Direct Policy-Gradient RL in Velocity-Only Flow Models

Traditional policy-gradient RL interprets the generative sampler as an MDP, with transitions parameterized by a policy $\pi^\theta(x_{t+\Delta t} | x_t)$ and a terminal reward $r(x_{t=1})$ . Policy-gradient estimators rely on the likelihood-ratio trick:

$\nabla_\theta \mathcal{J}_\mathrm{RL} = \mathbb{E}_{\tau}\left[\sum_t \nabla_\theta \log \pi^\theta(x_{t+\Delta t} | x_t) R(\tau)\right],$

where $R(\tau)$ encodes the terminal reward.

In score-based diffusion models, the transition kernel $p^\theta(x_{t+\Delta t} | x_t)$ is Gaussian, with its mean determined by $b^\theta$ and $\nabla_x \log p_t(x)$ . Obtaining log-likelihoods or their gradients thus necessitates access to the score. In contrast, velocity-based (flow-matching) models admit only deterministic ODE trajectories $x_{t+\Delta t} = x_t + b^\theta(t, x_t) \Delta t$ , resulting in degenerate (delta-function) densities, and rendering the standard policy-gradient estimator ill-defined. Injecting naive stochasticity breaks the marginal distributions. Therefore, standard policy-gradient RL is not directly applicable to flow-matching or SI-based generative models that lack an explicit score.

3. OMatG-IRL: RL on Velocity Fields via Surrogate Stochastic Policies

OMatG-IRL addresses these challenges by constructing a surrogate stochastic policy around a pretrained velocity-only flow model. The core approach:

Introduces additive Gaussian noise of fixed schedule $\sigma_\mathrm{ref}(t)$ to the velocity-based transitions, yielding:

$x_{t+\Delta t} = x_t + b^{\theta_\mathrm{ref}}(t, x_t) \Delta t + \sigma_\mathrm{ref}(t) \sqrt{\Delta t} \xi, \quad \xi \sim \mathcal{N}(0, I).$

Defines a reference stochastic policy $\pi_\mathrm{ref}$ as a Gaussian centered at the deterministic step with variance proportional to $\sigma^2_\mathrm{ref}(t) \Delta t$ .

The RL-improved policy $\pi^\theta$ is initialized from $\pi_\mathrm{ref}$ and optionally learns its own noise schedule. In the CSP setting, reward functions are defined as $r(x) = -E(x)$ (energy per atom from a surrogate model); invalid structures incur strongly negative rewards. Policy improvement uses group-relative policy optimization (GRPO): trajectories are grouped by composition, and a group-relative advantage is computed,

$\hat A^i = \frac{r^i - \mathrm{mean}\{r^j\}}{\mathrm{std}\{r^j\}}.$

This is used in a PPO-style clipped RL objective, with an added KL regularizer keeping $\pi^\theta$ close to $\pi_\mathrm{ref}$ , resulting in the total objective:

$\max_\theta \mathcal{J}_{\mathrm{GRPO}}(\theta) + \mathcal{J}_{\mathrm{KL}}(\theta).$

A notable property, justified by Girsanov’s theorem, is that small $\sigma_\mathrm{ref}$ ensures $O(\sigma^2)$ changes to the marginal distributions, preserving the original generative performance while enabling stochasticity required for RL-based optimization.

4. Algorithmic Structure and Implementation

OMatG-IRL operates as an inference-time RL procedure applied to a pretrained generative model. The primary algorithm comprises:

Setting pretrained weights, small reference noise, and RL-specific hyperparameters (learning rate, group size, PPO clip, KL weight).
Sampling compositions and generating trajectory batches per composition under the current stochastic policy.
Computing energy-based rewards, group-relative advantages, and storing all transitions.
Optimizing the PPO-style policy objective and KL regularization using standard stochastic gradient descent (e.g., Adam).
Optionally learning time-dependent velocity annealing schedules via a residual neural net, with annealing-relevant rewards (e.g., based on cRMSE instead of energy).

This approach allows OMatG-IRL to reinforce the generative policy with respect to black-box objectives (e.g., predicted energy) and to learn nontrivial time-dependent schedules that outperform handcrafted annealing.

5. Empirical Results in Crystal Structure Prediction

Experiments were conducted on the MP-20 dataset (∼45,000 inorganic crystals, $N \leq 20$ atoms, 89 elements). CSP is formulated as generating $(X, L)$ given fixed $A$ . Diversity arises naturally from conditioning on a broad composition set. The principal evaluation metrics include match rate (fraction matching relaxed reference structures), normalized RMSD, predicted energy per atom, and invalid energy rate.

Baseline models (e.g., OMatG Trig-SDE-Gamma with $N_t=740$ and velocity annealing) achieve $\sim69\%$ match rate; omitting annealing ( $N_t=50$ ) yields $\sim60\%$ . OMatG-IRL is evaluated in both score-based (using denoiser $z^\theta$ ) and velocity-based (using only $b^\theta$ with surrogate noise) variants. Both approaches achieve reductions in mean energy per atom by $\approx 0.5$ eV, preserve or slightly improve RMSD, and maintain match rates ( $\sim60\%$ ) with only $N_t=50$ integration steps—yielding an order-of-magnitude improvement in sampling efficiency (≥10× faster sampling).

In further experiments, OMatG-IRL learns residual time-dependent annealing schedules $s^\theta(t)$ , matching the accuracy of heavily annealed models (e.g., $\mathrm{METRe} \approx 70\%$ , cRMSE $\approx 0.19$ ) using a fraction of the integration steps ( $N_t=100$ versus $N_t=950$ ), and generalizing down to $N_t=10$ with minimal degradation. Handcrafted annealing fails in the low-step regime.

Model/Setting	Match Rate	Mean Energy Δ (eV/atom)	N_t (Steps)	Sampling Speedup
OMatG Trig-SDE-Gamma	~69%	Baseline	740	1×
OMatG No Annealing	~60%	Baseline	50	15×
OMatG-IRL (Score/Velocity)	~60%	–0.5 eV (improved)	50	15×

6. Limitations and Scope of Method

OMatG-IRL is an inference-time framework and avoids retraining the underlying generative model; however, it requires conducting many RL rollouts per composition in order to estimate group-relative advantages, imposing additional computational overhead. In CSP, diversity emerges from composition conditioning. For de-novo materials generation (DNG), handling discrete atom-type flows and enforcing diversity requires further methodological extensions, as simple conditioning may not suffice.

The current approach learns residual velocities (or annealing schedules), but it is plausible to combine this with joint learning of the noise schedule, or to transition to off-policy RL to maximize sample reuse. The surrogate policy only perturbs the velocity, but extensions to full SI models encompassing denoiser $z^\theta$ are straightforward; the score-based variant already implements this. Integration with end-to-end pipelines, including composition generation and constraint satisfaction, would be required for multi-objective inverse materials design.

7. Technical Contributions and Future Directions

The key technical contributions of OMatG-IRL are:

Introduction of a surrogate stochastic policy (via SDE perturbation) for pretrained velocity-only models that preserves marginal distributions up to $O(\sigma^2)$ enabling RL-based exploration.
Deployment of group-relative policy optimization (GRPO), PPO-style clipping, and KL regularization to enforce energy-based objectives during inference.
Extension of RL to velocity-only models—without explicit score computations—addressing an open gap in flow-matching and SI generative modeling for scientific applications.
Automated learning of time-dependent velocity-annealing schedules, leading to substantial (≥10×) sampling-efficiency gains, with matching or improved CSP accuracy.

A plausible implication is that OMatG-IRL can serve as a blueprint for RL-driven, property-optimized generative modeling in broader scientific domains, provided similar data-conditional flows are available. The methodology is positioned for expansion to multi-property, constraint-aware, and de-novo materials design scenarios (Hoellmer et al., 31 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Open Materials Generation with Inference-Time Reinforcement Learning (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Open Materials Generation with Inference-time Reinforcement Learning (OMatG-IRL).