One-step Latent-free Image Generation

Updated 1 February 2026

One-step latent-free image generation is a process that directly transforms noise into realistic images in a single network pass, bypassing iterative methods or latent space decompositions.
Recent techniques leverage GAN-based distillation, flow matching with explicit velocity fields, and soft embeddings to achieve state-of-the-art fidelity and accelerated sampling.
Advancements demonstrate lower FID scores and explore challenges like detailed rendering and high-resolution synthesis, opening avenues for efficient, multi-scale, and multimodal applications.

One-step latent-free image generation denotes the family of generative modeling techniques capable of mapping pure noise directly to high-fidelity images in a single network forward pass, without reliance on auxiliary latent variables, iterative denoising, or stepwise ODE/SDE integration. This paradigm breaks the traditional tradeoff between sample fidelity and sampling speed in diffusion, flow, and masked diffusion models by enabling “Gaussian-to-pixel” mappings with the efficiency, parameter sharing, and flexibility needed for contemporary large-scale applications. Recent research has produced a range of architectures and training regimes that realize true one-step, latent-free image generation, leveraging innovations across GAN-based distillation, flow-matching with explicit velocity fields, score-matching loss constructions, distributional objectives, and differentiable discrete token representations.

1. Fundamental Challenges in One-Step, Latent-Free Image Generation

The standard generative modeling pipeline for high-fidelity image synthesis decomposes the mapping from white noise to realistic images into either multi-step transformations (diffusion or flow models) or high-capacity latent-space generative models (VAE/tokenizer-based). This decomposition mitigates the high-dimensional, multimodal nature of image distributions by diffusing noise over many steps or projecting images into compressed representations that are easier to model. One-step latent-free approaches seek to directly parameterize the highly nontrivial transformation from isotropic noise to the data manifold, circumventing sequential inference or pre-encoded latents.

Attaining state-of-the-art fidelity in this setting is nontrivial due to (a) the path integration nature of traditional score-based and flow-based models, (b) the optimization mismatch between student and teacher in instance-level distillation, and (c) the lack of explicit lower-dimensional bottlenecks to regularize generation. Discrete token-based generators introduce additional challenges due to the non-differentiability of argmax or sampling operations and the loss of gradient flow necessary for post-distillation fine-tuning (Zhu et al., 26 Sep 2025).

2. Distributional Distillation and Exploiting Innate Diffusion Representations

“Diffusion Models Are Innate One-Step Generators” (Zheng et al., 2024) establishes that pre-trained diffusion U-Nets inherently possess the representational capacity necessary for single-step image generation. Instead of traditional instance-wise knowledge distillation, GAN Distillation at the Distributional level (GDD) reframes the training objective: a one-step student generator $G_\theta$ is trained adversarially via an exclusive distributional loss, matching the data distribution by fooling a discriminator $D$ without reference to per-sample teacher paths. This avoids the “local minimum mismatch” characteristic of instance-wise methods, permitting one-step students to achieve or exceed their multi-step teachers’ fidelity.

Layer-wise activation analysis reveals that most convolutional blocks specialize in specific temporal segments of the diffusion process. Consequently, one can freeze the majority of convolutional filters (∼85.8 %) and only fine-tune normalization layers, input/output layers, Q/K/V projections, and skip layers—a process referred to as “GDD-Innate” (GDD-I)—yielding even lower FID scores (e.g., CIFAR-10 FID 1.54). All experiments are performed in pixel space with no latent variables, confirming the “latent-free” nature.

Sampling is dramatically accelerated: image generation is reduced to $G_\theta(z)$ with $z \sim \mathcal N(0, I)$ and a single forward pass. GDD and GDD-I consistently outperform all prior one-step and even many multi-step samplers across datasets such as CIFAR-10, FFHQ, AFHQv2, and ImageNet-64 (Zheng et al., 2024).

3. MeanFlow and ODE-Driven Pixel-Space One-Step Sampling

“Pixel MeanFlow” (pMF) (Lu et al., 29 Jan 2026) advances latent-free one-step generation by formulating the pixel-space generator as a mapping from noise to the presumed image-manifold, with the loss enforced in “velocity-space” to maintain the correct generative flow dynamics. Separating network output space (denoised image, $x$ ) from loss space (velocity, $v$ ) is fundamental: the network predicts $x_\theta(z_t, r, t)$ , interpreted as a denoised image, while the training loss $\mathcal L_{\mathrm{pMF}}$ compares the derived mean velocity field $V_\theta(z_t, r, t)$ to the true instantaneous velocity $v(z_t, t)$ induced by the ODE.

Key relationships:

$z_t = (1-t)x_0 + te$ , $x_0 \sim p_{\rm data}$ , $e \sim p_{\rm prior}$ .
$u_\theta(z_t, r, t) = (z_t - x_\theta(z_t, r, t))/t$ (network output to velocity conversion).
$V_\theta(z_t, r, t) = u_\theta + (t-r) \frac{d}{dt}u_\theta|_{\mathrm{sg}}$ (full “MeanFlow” velocity; “sg” denotes stop-gradient).

The pMF models, implemented as high-capacity DiT-style transformers with patch-wise tokenization and auxiliary image heads, surpass prior work at scale: on ImageNet 256×256 the best one-step model reaches FID 2.22, while at 512×512, FID 2.48 is achieved—without auxiliary latent spaces or iterative solvers. Notably, only a single forward pass from $z \sim \mathcal N(0, I)$ is required at inference (Lu et al., 29 Jan 2026).

4. Alternative One-Step Distillation and Consistency-Based Methods

Score Implicit Matching (SIM) (Luo et al., 2024) presents a data-free, theoretically grounded approach to one-step distillation rooted in score-based divergences and the Score-Gradient Theorem. SIM alternately trains a generator $G_\theta$ and a student score model $s_\phi$ to match the marginal scores of a pre-trained teacher model, achieving robust single-step fidelity (e.g., CIFAR-10 FID 2.06). This loss construction collapses the trajectory-inversion challenge into a tractable, expectation-based gradient, bypassing the intractable Jacobians of the student’s implicit distribution.

Self-Cooperative Diffusion GANs (YOSO) (Luo et al., 2024) introduce a self-cooperative adversarial approach, where the generator is regularized by denoising and consistency losses in addition to GAN objectives. For text-to-image, YOSO incorporates latent perceptual loss, latent discriminators, and informative prior initialization, yielding both unconditional and text-conditional models with high-fidelity, rapid one-step synthesis.

Self-Corrected Flow Distillation (Dao et al., 2024) further integrates adversarial and consistency losses within the flow-matching framework, adding truncated consistency, reflow, and bidirectional losses. These components ensure that one-step outputs are both sharp and consistent with the teacher, with FID scores beating prior flow-distilled baselines on CelebA-HQ and COCO benchmarks.

5. Discrete Token-Based One-Step Models and Soft Embeddings

For discrete image synthesis, Soft-Di[M]O (Zhu et al., 26 Sep 2025) addresses the inability of prior one-step masked diffusion distillations to support adversarial or reward-based refinement due to non-differentiable token samplers. By replacing hard token samples with “soft embeddings”—continuous, differentiable mixtures over embedding matrices—Soft-Di[M]O enables end-to-end gradient flow through (a) backbone distillation with distributional losses, (b) adversarial GAN training, (c) differentiable reward tuning (e.g., for CLIP/ImageReward), and (d) test-time embedding optimization (TTEO).

Soft embeddings show high representation fidelity, stable gradients versus Gumbel-Softmax or REINFORCE, and robust plug-and-play compatibility with teacher backbones/tokenizers. Empirically, GAN-refined Soft-Di[M]O sets new state-of-the-art one-step FID (1.56) on ImageNet-256, and reward-tuned models consistently surpass multi-step masked diffusion teachers in prompt adherence and aesthetic metrics. TTEO further enhances text-to-image prompt alignment without training-time overhead (Zhu et al., 26 Sep 2025).

6. Comparative Performance and Practical Considerations

A comparative summary of prominent latent-free one-step methods is provided below.

Method	FID (CIFAR-10, 1-step)	FID (ImageNet-256, 1-step)	Latent-Free	GAN-Based	Train Time
GDD-I (Zheng et al., 2024)	1.54	1.16 (64×64)	Yes	Yes	6 hr / 8×A100
pMF-H/16 (Lu et al., 29 Jan 2026)	–	2.22	Yes	No	360 epochs
SIM (Luo et al., 2024)	2.06	–	Yes	No	100k iters
YOSO (Luo et al., 2024)	3.82	–	Yes	Yes	61M param
SCFlow (Dao et al., 2024)	8.06 (CelebA-HQ)	22.09 (COCO-1step)	No (latent)	Yes	–
Soft-Di[M]O (Zhu et al., 26 Sep 2025)	–	1.56 (w/GAN)	Yes (discrete)	Yes	–

GDD-I and pMF lead on continuous pixel space; Soft-Di[M]O dominates in discrete image/token models, primarily due to its gradient-friendly soft embedding design. Methods leveraging adversarial loss (GDD, YOSO, SCFlow, Soft-Di[M]O) consistently report sharper outputs and lower FID in the one-step regime.

Notable trade-offs include model size (pMF relies on large transformer backbones), training sample efficiency (e.g., GDD-I achieves SOTA with only 5M images), latent-space vs. pixel-space operation (SCFlow’s VAE bottleneck may limit ultimate fidelity), and flexibility for reward-based or post-hoc fine-tuning (unique to Soft-Di[M]O due to soft embeddings).

7. Limitations, Open Challenges, and Prospective Directions

The remaining limitations of current one-step, latent-free generators include:

Large backbone requirements (pMF, Soft-Di[M]O) that raise computational and parameter costs, potentially limiting mobile or edge deployment (Lu et al., 29 Jan 2026).
Occasional failures in fine detail rendering, especially for hands, faces, and high-resolution textures (noted for SIM and YOSO) (Luo et al., 2024, Luo et al., 2024).
Challenges in scaling to ultra-high resolutions, video, and other modalities where the low-dimensional manifold assumption may not fully hold (Lu et al., 29 Jan 2026, Dao et al., 2024).
For methods operating in discrete or latent spaces, ultimate fidelity is sometimes bounded by VAE or tokenizer capacity, necessitating further improvements or hybridization with pixel-space approaches (Dao et al., 2024, Zhu et al., 26 Sep 2025).

Potential research avenues include model size distillation for lightweight deployment, hybrid architectures combining one-step mapping with 2–3 step correctors, application to non-image modalities, and refined tokenization for discrete generative methods.

In conclusion, one-step latent-free image generation constitutes an emerging set of methodologies that have closed the performance gap with multi-step diffusion, matching or surpassing prior art in both continuous and discrete domains, while dramatically reducing sampling cost and unlocking new flexibility in train-time and inference refinements (Zheng et al., 2024, Lu et al., 29 Jan 2026, Luo et al., 2024, Luo et al., 2024, Dao et al., 2024, Zhu et al., 26 Sep 2025).