Need for perceptual or adversarial losses in diffusion autoencoders

Determine whether diffusion autoencoders for protein structure reconstruction and generation can obviate the need for perceptual or adversarial losses, and specify the regimes under which such auxiliary losses are necessary or unnecessary to ensure semantic consistency and biophysical plausibility.

Background

The paper employs a diffusion autoencoder trained with a flow-matching objective for protein structures and discusses inference techniques such as classifier annealing to stay on-manifold while maintaining prompt fidelity. In computer vision, perceptual and adversarial losses have historically been used to enforce semantic consistency and realistic textures, but diffusion autoencoders aim to reduce reliance on such losses.

The authors note that in the biomolecular setting, small structural artifacts can invalidate molecules even when RMSD remains low. They explicitly state uncertainty about whether diffusion autoencoders truly eliminate the need for perceptual or adversarial losses, highlighting conflicting evidence from recent works and motivating a precise determination of when these losses are required.

References

Part of the original motivation behind diffusion autoencoders was to obviate the need for perceptual and adversarial losses. Whether this is true is still a little unclear; (Sargent et al., 2025) uses perceptual losses during the training process but (Chen et al., 2025) does not.

Adaptive Protein Tokenization  (2602.06418 - Dilip et al., 6 Feb 2026) in Appendix J (Classifier annealing)