Flow-matching -- efficient coarse-graining of molecular dynamics without forces (2203.11167v4)

Published 21 Mar 2022 in physics.comp-ph, cs.LG, physics.bio-ph, and physics.chem-ph

Abstract: Coarse-grained (CG) molecular simulations have become a standard tool to study molecular processes on time- and length-scales inaccessible to all-atom simulations. Parameterizing CG force fields to match all-atom simulations has mainly relied on force-matching or relative entropy minimization, which require many samples from costly simulations with all-atom or CG resolutions, respectively. Here we present flow-matching, a new training method for CG force fields that combines the advantages of both methods by leveraging normalizing flows, a generative deep learning method. Flow-matching first trains a normalizing flow to represent the CG probability density, which is equivalent to minimizing the relative entropy without requiring iterative CG simulations. Subsequently, the flow generates samples and forces according to the learned distribution in order to train the desired CG free energy model via force matching. Even without requiring forces from the all-atom simulations, flow-matching outperforms classical force-matching by an order of magnitude in terms of data efficiency, and produces CG models that can capture the folding and unfolding transitions of small proteins.

Citations (43)

View on Semantic Scholar

Summary

The paper presents a two-stage flow-matching framework that bypasses direct force calculations in coarse-graining, significantly reducing data noise.
It employs a normalizing flow with latent variable augmentation followed by teacher-student force matching to achieve up to 70× data efficiency.
The method accurately reproduces equilibrium thermodynamics and folding kinetics in biomolecules, validated on systems like chignolin and capped alanine.

The paper presents a novel two‐stage methodology for generating coarse-grained (CG) force fields via “flow‐matching.” The approach circumvents the need for expensive force calculations from atomistic simulations and iterative CG re-sampling, by leveraging deep generative modeling with normalizing flows. Detailed below are the key aspects of the work:

The methodology is articulated as follows:

Dual Objective Framework:
- The first stage involves training a normalizing flow—an invertible neural network model—to approximate the CG equilibrium probability density via maximum likelihood density estimation. In this stage, the method effectively minimizes the relative entropy between the flow-generated CG distribution and the marginal distribution obtained from all-atom molecular dynamics (MD) trajectories. Notably, the flow is trained in an augmented latent variable space which relaxes strict bijectivity restrictions, thereby enhancing its expressiveness in representing complex, multimodal distributions characteristic of high-dimensional molecular systems.
- In the second stage, termed “teacher-student force-matching,” the trained flow is used to generate abundant synthetic samples alongside their corresponding forces, computed as the negative gradient of the log-density (or “energy”) of the flow model. These force samples, which are substantially less noisy than direct all-atom instantaneous forces, are then employed to train a separate CG potential model (often implemented via a modified CGnet architecture) via variational force-matching. The loss is formulated as the mean-squared error between the flow-generated instantaneous force and the gradient of the learned CG potential. This two-step schema essentially transfers information contained in the flow model (the “teacher”) to a physiochemically motivated CG force field (the “student”), thereby enabling more efficient learning.
Advantages over Classical Methods:
- Classical force matching requires storage or recomputation of atomistic forces, which are inherently noisy, while iterative relative entropy minimization demands expensive resampling during training. In contrast, the proposed method does not require these forces, significantly enhancing data efficiency and mitigating noise.
- The flow-matching framework efficiently generates uncorrelated samples and controls the sampling quality in regions of low Boltzmann probability, thereby reducing the effective data cost. In benchmark examples, the method attains comparable or superior performance using less than 10% of the data required by standard force-matching approaches. For instance, in the case of chignolin, the Flow-CGnet achieves performance equivalent to a CGnet trained on all available samples, corresponding to a ~70× data efficiency improvement.
Demonstrative Applications:
- The method is first validated on capped alanine (alanine dipeptide) where the CG mapping is performed at the level of backbone atoms and main-chain torsion angles. Quantitative metrics such as the Kullback–Leibler divergence and mean-square error (MSE) between discrete free energies over the Ramachandran space (ϕ/ψ distribution) clearly show that the flow and subsequent Flow-CGnet recover the all-atom free energy landscape more accurately than a traditional CGnet trained with classical force matching under low data regimes.
- The approach is also applied to four fast-folding proteins (e.g., chignolin, tryptophan cage, α/β protein BBA, and villin headpiece) at a one-bead-per-residue resolution using the Cα positions. The resulting CG potentials capture both the equilibrium folded conformations (RMSD deviations within 2.5 Å compared to experimental structures) and the essential thermodynamic features, including folding free energy barriers and relative state populations. Furthermore, the CG simulations reproduce the kinetics of folding and the sequence of native contact formation across distinct protein segments, thereby matching the mechanism revealed by extensive all-atom MD simulations.
Discussion on Limitations and Future Directions:
- Although the flow-matching strategy significantly enhances data efficiency and circumvents the need for atomistic force information, the authors note that using global internal coordinate representations in current normalizing flow architectures leads to sensitivity issues when scaling to larger macromolecules. Small deviations in torsional angles might induce steric clashes not present in the training data.
- The paper suggests potential future avenues, including the design of coupling flows with equivariant neural networks operating in Cartesian space that are informed by internal coordinates. An eventual goal is to extend transferability across molecules differing in size and topology by sharing parameters within the CG potential.
Technical Contributions:
- The formulation unifies aspects of force matching and relative entropy minimization under a unified machine learning framework, where the flow model is treated primarily as an efficient density estimator that indirectly provides mean forces for training the CG potential.
- Detailed derivations show that the gradients obtained from the variational force-matching objective are unbiased with respect to the parameters of the flow model. Moreover, the use of latent variable augmentations improves the capacity of the flow to capture multimodal free energy surfaces despite the imposed invertibility constraints.

In summary, the paper provides a comprehensive, technically sophisticated strategy for efficient bottom-up coarse-graining in molecular dynamics. By eliminating the dependence on atomistic forces and iterative re-sampling, the flow-matching method stands out in terms of data efficiency and has been demonstrated to accurately reproduce both the equilibrium thermodynamics and the kinetic pathways of small biomolecular systems.

PDF Markdown

Flow-matching -- efficient coarse-graining of molecular dynamics without forces (2203.11167v4)

Summary

Related Papers