Papers
Topics
Authors
Recent
2000 character limit reached

Coarse-to-Fine Latent Diffusion (CFLD)

Updated 1 January 2026
  • The topic introduces CFLD as a generative paradigm that sequentially refines coarse semantic latents before fine-grained details.
  • It employs hierarchical latent factorization, asynchronous denoising, and cascaded conditioning to boost sample quality and convergence speed.
  • Empirical evaluations indicate CFLD achieves state-of-the-art results in image, audio, 3D, and graph generation tasks.

Coarse-to-Fine Latent Diffusion (CFLD) denotes a broad class of generative modeling paradigms that exploit the hierarchical or multi-scale structure of data in the latent space, enabling structured, principled generation from semantic or global structure down to local or high-frequency detail. CFLD combines hierarchical latent representations (often decoupling semantic and textural axes, or coarse and fine resolutions) with tailored diffusion processes, typically applying asynchronous or stagewise denoising and explicit hierarchical conditioning. This multi-stage approach is theoretically and empirically validated to provide improvements in sample quality, convergence speed, and controllability across a wide array of domains including image, audio, trajectory, 3D, and graph generation.

1. Fundamental Principles of Coarse-to-Fine Latent Diffusion

CFLD leverages the observation that many classes of data (natural images, 3D scenes, trajectories, proteins) are characterized by information concentrated at multiple scales or factors—frequently a separation between high-level semantics (e.g., shape, pose, global composition) and fine-grained detail (e.g., texture, local structure). Standard latent diffusion models often ignore this ordering, applying synchronous denoising or mixing structure and detail in a single process. CFLD formalizes and exploits a generative ordering in which:

  • Coarse (semantic, low-frequency) latents are formed and refined before or ahead of fine (textural, high-frequency) latents.
  • Hierarchy is imposed either explicitly via multi-level latent codings, decoupled noise schedules, or cascades of latent/conditional denoisers.
  • Asynchronous or causal denoising schedules ensure that coarse/semantic latent variables provide a conditional anchor or guiding signal during the refinement of fine/detail latents.

This paradigm is instantiated in diverse forms—semantic-first asynchronous denoising (Pan et al., 4 Dec 2025), cascaded latent transforms (Kutsuna, 25 Dec 2025, Woo et al., 1 Oct 2025), spectrum-preserving coarsening for graphs (Osman et al., 1 Dec 2025), progressive refinement for images and scenes (Zhong et al., 20 Nov 2025, Meng et al., 2024), and hybrid text/image/audio conditioning (Lu et al., 2024, Liu et al., 2024, Zhao et al., 30 Apr 2025).

2. Architectural Realizations and Design Variants

CFLD is realized through a variety of architectural and algorithmic components, notably:

  • Two-Path or Multi-Path Latent Factorization: Separation into semantic and texture latents, as in Semantic-First Diffusion (SFD): concatenation of a compact semantic latent (from a semantic VAE) and a standard VAE texture latent, each treated with independent or offset noise schedules (Pan et al., 4 Dec 2025).
  • Hierarchical Latent Trees and Levelwise Tokenization: Construction of multi-level latent variables, e.g., for 3D scenes via latent-tree TUDFs (Meng et al., 2024) or for images/videos via scale-independent multi-level tokenization (Zhong et al., 20 Nov 2025), with coarse-to-fine attention and decoding order.
  • Stagewise or Cascaded Conditioning: Cascaded frameworks in graph (Osman et al., 1 Dec 2025), trajectory (Guo et al., 8 Jul 2025), protein structure (Han et al., 2024), and hand pose (Woo et al., 1 Oct 2025), where a coarse output (e.g., skeleton, segmentation, VQ-code, joint configuration) serves as condition or guide for fine-level diffusion.
  • Asynchronous Denoising and Dual Timestep Scheduling: Mechanistically, asynchronous denoising is implemented using dual time indices or offsets (e.g., Δt in (Pan et al., 4 Dec 2025)), with deterministic or dynamically adapted scheduling ensuring semantic latents are always denoised to a cleaner state than textural counterparts.
  • Coarse-Fine Cross-Attention and Bias Injection: Biasing of cross-attention queries/keys with multi-granular features (e.g., hybrid-granularity attention for pose synthesis (Lu et al., 2024), textually co-guided skeleton feature diffusion (Zhao et al., 30 Apr 2025)), or injection of coarsened features in U-Net and Transformer layers at appropriate stages.

Collectively, these designs enforce a strict or soft causality between hierarchical or disentangled latent factors, anchoring high-frequency generation on robust, semantically meaningful structure.

3. Mathematical Formalism and Training Procedures

The mathematical backbone of CFLD blends standard DDPM/score-based latent diffusion with factorized, hierarchically coupled latent variables and modified objective formulations:

  • Independent or Conditional Diffusion Processes: For a two-part latent [s1,z1][s_1, z_1], separate diffusion schedules are parametrized as ts,tzt_s, t_z with ts≥tzt_s \geq t_z (semantic leads texture) (Pan et al., 4 Dec 2025). Multi-level latents undergo level-wise, possibly independently parametrized diffusion (Zhong et al., 20 Nov 2025, Meng et al., 2024).
  • Asynchronous Flow-Matching and Denoising Losses: SFD-style velocity prediction models the conditional flow vθ([sts,ztz],[ts,tz],y)v_\theta([s_{t_s},z_{t_z}], [t_s,t_z], y), with training loss:

Lvel=E∥zv−(z1−z0)∥2+β⋅E∥sv−(s1−s0)∥2,L_{vel} = \mathbb{E}\|z_v - (z_1 - z_0)\|^2 + \beta \cdot \mathbb{E}\|s_v - (s_1 - s_0)\|^2,

often augmented with representation-alignment terms (e.g., REPA loss (Pan et al., 4 Dec 2025)).

  • ELBO-Coupled Residual Modeling: Two-stage frameworks (e.g., Residual Prior Diffusion) rigorously couple the coarse prior (VAE) and the residual diffusion via a joint ELBO, reducing training to standard noise or velocity-prediction objectives with auxiliary variables that directly accelerate convergence (Kutsuna, 25 Dec 2025).
  • Cascaded/Hierarchical Conditioning: At each scale, the fine-level diffusion model is explicitly conditioned on the state of the coarser level, often via cross-attention or FiLM-based normalization (Zhong et al., 20 Nov 2025, Woo et al., 1 Oct 2025, Meng et al., 2024).
  • Stagewise Losses and Reconstruction Terms: Each stage includes both generative (MSE, cross-entropy, KL) and alignment/contrastive terms (for semantic or textual consistency), with empirical ablations consistently demonstrating the importance of each for controlling overfitting, hallucination, and detail fidelity (Pan et al., 4 Dec 2025, Lu et al., 2024, Zhao et al., 30 Apr 2025).

Pseudocode and algorithmic details provided in the primary references formalize both the training and inference pipelines, e.g., the three-phase denoising loop in SFD (Pan et al., 4 Dec 2025), cascaded mesh/joint reconstruction (Woo et al., 1 Oct 2025), and hierarchical coarse-to-fine U-Net sampling (Zhong et al., 20 Nov 2025, Meng et al., 2024).

4. Empirical Evaluation and Quantitative Evidence

CFLD yields robust performance gains in both sample quality and efficiency, supported by extensive evaluations:

  • Accelerated Convergence and SOTA FID: SFD achieves up to 100x faster convergence than original DiT and superior FID (1.04) on ImageNet, with classifier-free or AutoGuidance (Pan et al., 4 Dec 2025). Asynchronous denoising leads to substantial FID drop compared to synchronous baselines (e.g., 5.24 → 3.03).
  • Improved Fidelity in Conditional and Structured Tasks: Pose-guided person image synthesis achieves lower FID (6.804) and improved LPIPS/SSIM/PSNR compared to all prior baselines, with qualitatively superior reconstruction in extreme poses and garment detail (Lu et al., 2024). Skeleton action feature diffusion boosts downstream classifier generalization (Zhao et al., 30 Apr 2025).
  • State-of-the-Art for High-dimensional Structured Outputs: Hierarchical 3D scene generation delivers lower FID and higher coverage and diversity (COV↑ +20%, FID ↓ from 59→13 in (Meng et al., 2024)); protein backmapping reaches lower RMSD and maintains chemical validity compared to both coordinate-space and VQ-based backmapping (Han et al., 2024).
  • Combinatorial Structure and Cross-scale Metric Balance: Graph CFLD (LGDC) matches or outperforms autoregressive and one-shot alternatives, preserving both global spectral and local motif statistics with lower complexity (e.g., A.Ratio, V.U.N. scores (Osman et al., 1 Dec 2025)).
  • Heterogeneous Modalities and Robust Generalization: CFLD architectures span modalities—audio reconstruction from brain signals (Liu et al., 2024), trajectory synthesis with privacy/utility trade-off (Guo et al., 8 Jul 2025), and image generation with adaptively decoupled complexity from scale (Zhong et al., 20 Nov 2025). Ablation studies consistently show degradation when the hierarchy or scheduling is omitted.

5. Connections to Existing Diffusion Paradigms and Theoretical Considerations

CFLD advances beyond prior diffusion frameworks in several aspects:

  • Contrast with Classic Latent Diffusion (LDM): While LDMs operate purely in latent space, CFLD imposes explicit hierarchy or factorization (by semantics, resolution, or structure), with dependency-structured noise schedules and causal attention or conditioning (Kutsuna, 25 Dec 2025, Pan et al., 4 Dec 2025).
  • Relationship to Residual, Super-resolution, and Multi-Stage Diffusion: CFLD subsumes earlier notions of residual super-resolution (e.g., ResShift, Resfusion) but without requiring paired input or test-time conditioning; the coarse prior is intrinsic, and fine detail generation is fully or partially decoupled (Kutsuna, 25 Dec 2025).
  • Theoretical Acceleration and Optimization Guarantees: Introduction of auxiliary variables aligned to the coarse prior demonstrably reduces prediction error (as coarse prior quality improves) and enables rapid convergence, especially under limited sampling steps or complex data geometry (Kutsuna, 25 Dec 2025).
  • Potential for Further Hierarchical Generalization: Several CFLD instantiations (e.g., DCS-LDM (Zhong et al., 20 Nov 2025), LT3SD (Meng et al., 2024)) permit arbitrary extension to multi-level, scale-independent, or even adaptive-depth generative chains, supporting trade-offs between computational budget and output detail.

6. Applications and Generalization Across Domains

CFLD methodologies underpin advances across diverse tasks:

  • Image and Video Synthesis: Hierarchical/semantic-first latent diffusion sets new benchmarks in conditional and unconditional image generation, video frame modeling, and high-efficiency large-scale synthesis (Pan et al., 4 Dec 2025, Zhong et al., 20 Nov 2025).
  • Structured Object and Scene Generation: CFLD enables high-fidelity 3D scene, protein structure, and articulated object synthesis, outperforming baselines on domain-specific statistical and geometric metrics (Meng et al., 2024, Han et al., 2024, Woo et al., 1 Oct 2025).
  • Graph and Trajectory Modeling: CFLDs for combinatorial structures integrate spectrum-preserving coarsening with local motif expansion (Osman et al., 1 Dec 2025), while in spatiotemporal domains they reconcile efficiency, data validity, and privacy via hierarchical latent diffusion (Guo et al., 8 Jul 2025).
  • Audio and Multimodal Decoding: Multi-stage CFLD outperforms direct or single-stage latent decoding for reconstructing high-dimensional audio from brain signals and for extracting robust semantic priors in cross-modal tasks (Liu et al., 2024).

The generality of CFLD is further demonstrated by its adaptability: any domain supporting a semantically meaningful hierarchy or factorization admits CFLD instantiation, potentially with dynamic scheduling, alternate semantic compressors (PCA, CLIP), or more adaptive driving signals (Pan et al., 4 Dec 2025).

7. Limitations, Open Questions, and Future Directions

Despite empirical success, limitations persist:

  • Dependency on Coarse Prior Quality: Both convergence speed and final fidelity hinge on the expressive power of the semantic or coarse prior model—weak priors cap achievable detail or diversity, especially under small-step inference (Kutsuna, 25 Dec 2025).
  • Architectural Overhead and Complexity: Multi-stage CFLDs incur overhead from integrating and synchronizing multiple encoders/decoders, latent schedules, and auxiliary losses, and can be sensitive to hyperparameters such as temporal offset (Δt) or weighting terms (Pan et al., 4 Dec 2025).
  • Potential for Further Hierarchical Refinement: Exploring deeper or dynamically adaptive hierarchies, embedding CFLD within LDM pipelines, or jointly learning more interpretable/disentangled latent factors remains open.
  • Broader Modalities and Structure-aware Control: Extending the causal and asynchronous principles of CFLD to language, physics, or agent-based data may yield novel avenues for hierarchical generative control.

Advances in this family of models continuously blur the boundary between generic diffusion frameworks and task/domain-tailored hierarchies, with CFLD remaining a key theoretical and empirical driver in the evolution of data-driven generative modeling across the sciences and engineering (Pan et al., 4 Dec 2025, Kutsuna, 25 Dec 2025, Zhong et al., 20 Nov 2025, Lu et al., 2024, Woo et al., 1 Oct 2025, Liu et al., 2024, Zhao et al., 30 Apr 2025, Osman et al., 1 Dec 2025, Meng et al., 2024, Han et al., 2024, Lee et al., 2022, Guo et al., 8 Jul 2025).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Coarse-to-Fine Latent Diffusion (CFLD).