Papers
Topics
Authors
Recent
Search
2000 character limit reached

OMatG-IRL: RL for Material Generation

Updated 7 February 2026
  • The paper introduces a surrogate stochastic policy that enables RL on pretrained velocity-only flow models without explicit score computation.
  • It leverages group-relative policy optimization with PPO-style clipping and KL regularization to guide energy minimization in crystal structure prediction.
  • Empirical results demonstrate a reduction in energy per atom and a ≥10× sampling efficiency gain while maintaining competitive structure match rates.

Open Materials Generation with Inference-time Reinforcement Learning (OMatG-IRL) is a reinforcement learning (RL) framework that enables the direct steering of continuous-time generative models for crystalline materials toward explicit downstream objectives, such as minimizing energy per atom, at inference time. It achieves this by introducing a stochastic policy gradient approach that operates on pretrained velocity fields, thereby circumventing the need for explicit score computation—an obstacle previously limiting RL methods in this class of generative models. OMatG-IRL allows objective-driven sampling from generative models using only velocity predictions, leverages surrogate stochastic dynamics, and reinforces policies using group-relative policy optimization and KL regularization, while maintaining data-distributional properties and sampling efficiency competitive with score-dependent approaches (Hoellmer et al., 31 Jan 2026).

1. Foundations: Continuous-Time Generative Models for Crystals

Crystalline materials are represented as tuples (A,X,L)(A, X, L), where:

  • ANNA \in \mathbb{N}^N are atomic numbers (fixed per inference for crystal structure prediction, CSP),
  • X[0,1)N×3X \in [0,1)^{N \times 3} are fractional atomic positions,
  • LR3×3L \in \mathbb{R}^{3 \times 3} is the lattice matrix.

The generative process constructs (X,L)(X, L) conditioned on AA. Training employs the stochastic interpolant (SI) formalism, which specifies a stochastic path:

xt=α(t)x0+β(t)x1+γ(t)z,zN(0,I),x_t = \alpha(t) x_0 + \beta(t) x_1 + \gamma(t) z, \quad z \sim \mathcal{N}(0, I),

where x0ρ0x_0 \sim \rho_0 is sampled from a simple base distribution, x1ρ1x_1 \sim \rho_1 is sampled from the data, and the time-dependent coefficients α,β,γ\alpha, \beta, \gamma interpolate between the two. A neural network bθ(t,xt)b^\theta(t, x_t) is trained to predict the conditional velocity txt\partial_t x_t by minimizing the mean squared error:

Lb(θ)=Et,z,x0,x1bθ(t,xt)txt2.\mathcal{L}_b(\theta) = \mathbb{E}_{t, z, x_0, x_1} \left\| b^\theta(t, x_t) - \partial_t x_t \right\|^2.

At inference, the learned drift field defines the ODE:

dXt=bθ(t,Xt)dt,\mathrm{d}X_t = b^\theta(t, X_t) \, \mathrm{d}t,

with X0ρ0X_0 \sim \rho_0. In score-based models, an auxiliary denoiser zθ(t,x)zz^\theta(t, x) \approx z reconstructs xlogρt(x)\nabla_x \log \rho_t(x), yielding an SDE with time-dependent noise and score terms.

These models excel at learning stability via data-matching but cannot natively incorporate explicit objectives—such as energy minimization—into generation.

2. Obstacles to Direct Policy-Gradient RL in Velocity-Only Flow Models

Traditional policy-gradient RL interprets the generative sampler as an MDP, with transitions parameterized by a policy πθ(xt+Δtxt)\pi^\theta(x_{t+\Delta t} | x_t) and a terminal reward r(xt=1)r(x_{t=1}). Policy-gradient estimators rely on the likelihood-ratio trick:

θJRL=Eτ[tθlogπθ(xt+Δtxt)R(τ)],\nabla_\theta \mathcal{J}_\mathrm{RL} = \mathbb{E}_{\tau}\left[\sum_t \nabla_\theta \log \pi^\theta(x_{t+\Delta t} | x_t) R(\tau)\right],

where R(τ)R(\tau) encodes the terminal reward.

In score-based diffusion models, the transition kernel pθ(xt+Δtxt)p^\theta(x_{t+\Delta t} | x_t) is Gaussian, with its mean determined by bθb^\theta and xlogpt(x)\nabla_x \log p_t(x). Obtaining log-likelihoods or their gradients thus necessitates access to the score. In contrast, velocity-based (flow-matching) models admit only deterministic ODE trajectories xt+Δt=xt+bθ(t,xt)Δtx_{t+\Delta t} = x_t + b^\theta(t, x_t) \Delta t, resulting in degenerate (delta-function) densities, and rendering the standard policy-gradient estimator ill-defined. Injecting naive stochasticity breaks the marginal distributions. Therefore, standard policy-gradient RL is not directly applicable to flow-matching or SI-based generative models that lack an explicit score.

3. OMatG-IRL: RL on Velocity Fields via Surrogate Stochastic Policies

OMatG-IRL addresses these challenges by constructing a surrogate stochastic policy around a pretrained velocity-only flow model. The core approach:

  • Introduces additive Gaussian noise of fixed schedule σref(t)\sigma_\mathrm{ref}(t) to the velocity-based transitions, yielding:

xt+Δt=xt+bθref(t,xt)Δt+σref(t)Δtξ,ξN(0,I).x_{t+\Delta t} = x_t + b^{\theta_\mathrm{ref}}(t, x_t) \Delta t + \sigma_\mathrm{ref}(t) \sqrt{\Delta t} \xi, \quad \xi \sim \mathcal{N}(0, I).

  • Defines a reference stochastic policy πref\pi_\mathrm{ref} as a Gaussian centered at the deterministic step with variance proportional to σref2(t)Δt\sigma^2_\mathrm{ref}(t) \Delta t.

The RL-improved policy πθ\pi^\theta is initialized from πref\pi_\mathrm{ref} and optionally learns its own noise schedule. In the CSP setting, reward functions are defined as r(x)=E(x)r(x) = -E(x) (energy per atom from a surrogate model); invalid structures incur strongly negative rewards. Policy improvement uses group-relative policy optimization (GRPO): trajectories are grouped by composition, and a group-relative advantage is computed,

A^i=rimean{rj}std{rj}.\hat A^i = \frac{r^i - \mathrm{mean}\{r^j\}}{\mathrm{std}\{r^j\}}.

This is used in a PPO-style clipped RL objective, with an added KL regularizer keeping πθ\pi^\theta close to πref\pi_\mathrm{ref}, resulting in the total objective:

maxθJGRPO(θ)+JKL(θ).\max_\theta \mathcal{J}_{\mathrm{GRPO}}(\theta) + \mathcal{J}_{\mathrm{KL}}(\theta).

A notable property, justified by Girsanov’s theorem, is that small σref\sigma_\mathrm{ref} ensures O(σ2)O(\sigma^2) changes to the marginal distributions, preserving the original generative performance while enabling stochasticity required for RL-based optimization.

4. Algorithmic Structure and Implementation

OMatG-IRL operates as an inference-time RL procedure applied to a pretrained generative model. The primary algorithm comprises:

  • Setting pretrained weights, small reference noise, and RL-specific hyperparameters (learning rate, group size, PPO clip, KL weight).
  • Sampling compositions and generating trajectory batches per composition under the current stochastic policy.
  • Computing energy-based rewards, group-relative advantages, and storing all transitions.
  • Optimizing the PPO-style policy objective and KL regularization using standard stochastic gradient descent (e.g., Adam).
  • Optionally learning time-dependent velocity annealing schedules via a residual neural net, with annealing-relevant rewards (e.g., based on cRMSE instead of energy).

This approach allows OMatG-IRL to reinforce the generative policy with respect to black-box objectives (e.g., predicted energy) and to learn nontrivial time-dependent schedules that outperform handcrafted annealing.

5. Empirical Results in Crystal Structure Prediction

Experiments were conducted on the MP-20 dataset (∼45,000 inorganic crystals, N20N \leq 20 atoms, 89 elements). CSP is formulated as generating (X,L)(X, L) given fixed AA. Diversity arises naturally from conditioning on a broad composition set. The principal evaluation metrics include match rate (fraction matching relaxed reference structures), normalized RMSD, predicted energy per atom, and invalid energy rate.

Baseline models (e.g., OMatG Trig-SDE-Gamma with Nt=740N_t=740 and velocity annealing) achieve 69%\sim69\% match rate; omitting annealing (Nt=50N_t=50) yields 60%\sim60\%. OMatG-IRL is evaluated in both score-based (using denoiser zθz^\theta) and velocity-based (using only bθb^\theta with surrogate noise) variants. Both approaches achieve reductions in mean energy per atom by 0.5\approx 0.5 eV, preserve or slightly improve RMSD, and maintain match rates (60%\sim60\%) with only Nt=50N_t=50 integration steps—yielding an order-of-magnitude improvement in sampling efficiency (≥10× faster sampling).

In further experiments, OMatG-IRL learns residual time-dependent annealing schedules sθ(t)s^\theta(t), matching the accuracy of heavily annealed models (e.g., METRe70%\mathrm{METRe} \approx 70\%, cRMSE 0.19\approx 0.19) using a fraction of the integration steps (Nt=100N_t=100 versus Nt=950N_t=950), and generalizing down to Nt=10N_t=10 with minimal degradation. Handcrafted annealing fails in the low-step regime.

Model/Setting Match Rate Mean Energy Δ (eV/atom) N_t (Steps) Sampling Speedup
OMatG Trig-SDE-Gamma ~69% Baseline 740
OMatG No Annealing ~60% Baseline 50 15×
OMatG-IRL (Score/Velocity) ~60% –0.5 eV (improved) 50 15×

6. Limitations and Scope of Method

OMatG-IRL is an inference-time framework and avoids retraining the underlying generative model; however, it requires conducting many RL rollouts per composition in order to estimate group-relative advantages, imposing additional computational overhead. In CSP, diversity emerges from composition conditioning. For de-novo materials generation (DNG), handling discrete atom-type flows and enforcing diversity requires further methodological extensions, as simple conditioning may not suffice.

The current approach learns residual velocities (or annealing schedules), but it is plausible to combine this with joint learning of the noise schedule, or to transition to off-policy RL to maximize sample reuse. The surrogate policy only perturbs the velocity, but extensions to full SI models encompassing denoiser zθz^\theta are straightforward; the score-based variant already implements this. Integration with end-to-end pipelines, including composition generation and constraint satisfaction, would be required for multi-objective inverse materials design.

7. Technical Contributions and Future Directions

The key technical contributions of OMatG-IRL are:

  1. Introduction of a surrogate stochastic policy (via SDE perturbation) for pretrained velocity-only models that preserves marginal distributions up to O(σ2)O(\sigma^2) enabling RL-based exploration.
  2. Deployment of group-relative policy optimization (GRPO), PPO-style clipping, and KL regularization to enforce energy-based objectives during inference.
  3. Extension of RL to velocity-only models—without explicit score computations—addressing an open gap in flow-matching and SI generative modeling for scientific applications.
  4. Automated learning of time-dependent velocity-annealing schedules, leading to substantial (≥10×) sampling-efficiency gains, with matching or improved CSP accuracy.

A plausible implication is that OMatG-IRL can serve as a blueprint for RL-driven, property-optimized generative modeling in broader scientific domains, provided similar data-conditional flows are available. The methodology is positioned for expansion to multi-property, constraint-aware, and de-novo materials design scenarios (Hoellmer et al., 31 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Open Materials Generation with Inference-time Reinforcement Learning (OMatG-IRL).