Two-Stage Geometric/Generative Architectures

Updated 23 March 2026

Two-stage geometric/generative architectures are a paradigm that decouples global structure estimation from fine-detail synthesis.
They utilize a deterministic geometry stage for stable scaffold extraction followed by a conditional generative stage to refine high-frequency details.
This modular approach enables precise control, improved interpretability, and performance gains across complex vision and 3D modeling tasks.

A two-stage geometric/generative architecture decomposes generation or prediction tasks into an explicit geometric or structural stage followed by a generative (typically stochastic or high-dimensional) refinement or synthesis stage. This paradigm systematically separates the estimation or synthesis of global structure ("geometry") from the generation of fine-scale details or final output ("generation"), leveraging the strengths of both model-based and learning-based techniques, and is now foundational across computer vision, graphics, 3D shape modeling, layout synthesis, and scientific domains.

1. Rationale and Conceptual Foundations

The motivation for two-stage geometric/generative architectures arises from the intrinsic ill-posedness and multi-scale character of many inverse or synthesis problems. In domains such as monocular depth estimation, scene layout generation, or 3D object synthesis, there exist severe ambiguities: multiple 3D or compositional explanations may yield identical observed signals. The traditional paradigms fall into either direct discriminative regression (stable, but lacking physical priors and sample efficiency) or generative modeling (capturing data priors but often producing stochastic, unstable, or semantically inconsistent outputs for geometry-centric tasks).

A two-stage decomposition leverages the observation that:

Coarse, global geometric structure is often more stable, lower-dimensional, and admits deterministic inference or planning.
Fine, high-dimensional details (texture, high-frequency geometry, or stochastic pattern) can be robustly synthesized conditionally on the geometric scaffold, using powerful generative models.

Recent works such as Lotus-2 for geometric dense prediction (He et al., 30 Nov 2025), UltraShape 1.0 for 3D shape generation (Jia et al., 24 Dec 2025), and dual-branch video-text models for scene understanding (Wu et al., 19 Mar 2026), exemplify this philosophy. The approach is rooted in classical multi-level design patterns but now formalized and optimized within deep generative model frameworks.

2. Architectural Patterns and Methodological Variants

A representative two-stage architecture has the following high-level decomposition:

Geometry Stage: Extracts or predicts explicit global structure. Forms include:
- Coarse depth/normal field prediction (e.g., via deterministic regression or rectified-flow prediction (He et al., 30 Nov 2025))
- Scene layout graph or box set from language or object list (LLM-based plan (Koch et al., 10 Nov 2025, Huang et al., 2024))
- Latent vector or codebook representation of 3D object structure (VQ-VAE (Rasoulzadeh et al., 2024), vector-set DiT (Jia et al., 24 Dec 2025))
- Symmetry group action sequences (e.g., the wreath process for drawings (Borsa et al., 2015))
- Physics-informed skeletons or spectral shape tokens (e.g., GHD for aneurysm meshes (Ding et al., 15 May 2025))
- Tokenized feature representations from a generative world simulator (e.g., video diffusion backbones in VEGA-3D (Wu et al., 19 Mar 2026))
Generative/Refinement Stage: Enhances, samples, or completes the output, conditioned on the structure from Stage 1. Typical modes include:
- Constrained flow or diffusion refinement (single-step to multi-step deterministic flows (He et al., 30 Nov 2025), hierarchical DDPM upsampling (Rasoulzadeh et al., 2024), voxel refinement (Jia et al., 24 Dec 2025))
- Layout-conditioned diffusion or GAN-based image generation (GLIGEN/ControlNet (Koch et al., 10 Nov 2025, Huang et al., 2024))
- Graph/message-passing-based boundary realization from centroid-determined topology (GFLAN (Abouagour et al., 18 Dec 2025))
- GAN or autoencoder-based mesh or flow field generation conditioned on geometric features (Xie et al., 2022, Khan et al., 2024)
- Adaptive feature fusion for multimodal reasoning (token-level gated fusion (Wu et al., 19 Mar 2026))

A schematic table summarizes several instantiations:

Domain/Problem	Geometry Stage	Generative/Refinement Stage
Monocular depth/normal estimation	Core predictor, deterministic flow	Multi-step flow refiner, deterministic sharpener (He et al., 30 Nov 2025)
3D shape generation	VQ encoder / vector-set latent	Hierarchical diffusion (voxel or chunk-based) (Jia et al., 24 Dec 2025, Rasoulzadeh et al., 2024)
Layout/image synthesis	LLM/graph-based layout planning	Layout-conditioned diffusion or GAN image synthesis (Koch et al., 10 Nov 2025)
Scene understanding	World simulator via video diffusion	Token-level fusion in MLLM (Wu et al., 19 Mar 2026)
Fluid/smoke illustration	LCS skeleton prediction (U-Net)	GAN velocity field synthesis conditioned on LCS (Xie et al., 2022)

3. Mathematical Formulation and Loss Design

Two-stage geometric/generative models are typically composed of explicitly structured objective terms and architectures for each stage:

Geometry Stage Loss: Often a deterministic regression, MSE or cross-entropy over latent representations, coordinates, or structured plan variables; may include physics-informed regularization (e.g., local continuity (He et al., 30 Nov 2025), morphing-energy alignment (Ding et al., 15 May 2025), geometric operator augmentation (Khan et al., 2024)).
Generative Stage Loss: Frequently a standard generative modeling loss conditional on the structural output, e.g., DDPM loss, adversarial (GAN) loss, reconstruction losses, or diversity/quality-promoting DPP terms (Khan et al., 2024).

Example for deterministic+refinement in Lotus-2 (He et al., 30 Nov 2025): $\begin{aligned} &\mathcal{L}_{\rm core} = \bigl\lVert \hat{\mathbf z}^y - \mathbf z^y \bigr\rVert^2 \ &\mathcal{L}_{\rm sharp} = \bigl\lVert g_\psi(\mathbf z_t,t) - (\mathbf z^{y_c} - \mathbf z^{y_f}) \bigr\rVert^2 \end{aligned}$

Physics-aware enrichment (e.g., in PaDGAN-GO (Khan et al., 2024)): $\mathrm{GO}(\mathcal G) = [P(\mathcal G),\,\mathcal M(\mathcal G),\,\mathcal K(\mathcal G),\,\mathcal F_T(\mathcal G)]$

$\mathcal L_{\mathrm{GAN}} + \gamma_1 \mathcal L_{\mathrm{DPP}}(q(x)=\|\mathrm{GO}(x)\|_1)$

A defining property is that gradient flow or training typically does not cross stage boundaries in an end-to-end manner. Instead, the first stage's output is used as a fixed or lightly-updated condition for the second.

4. Empirical Evidence and Comparative Strengths

Experimental results across various domains consistently demonstrate advantages for this two-stage decomposition:

Performance with Low Data Regimes: Lotus-2 achieves state-of-the-art monocular depth estimation using only 59K training images (<1% of leading discriminative methods' datasets) (He et al., 30 Nov 2025). UltraShape 1.0 produces watertight, normal-consistent 3D shapes with better Chamfer distance and F-score than CLAY or LATTICE with only 120K meshes (Jia et al., 24 Dec 2025).
Fidelity and Diversity: Hierarchical upsampling and refinement (e.g., ArchComplete (Rasoulzadeh et al., 2024), UltraShape 1.0 (Jia et al., 24 Dec 2025)) consistently yield higher geometric detail and coverage than pure single-stage autoregressive or end-to-end models.
Enforceability of Constraints: Explicit geometric planning allows precise control over object count, spatial arrangement, or clinical shape metrics (e.g., LLM-based layout yields object recall 99.9% vs. 57.2% for direct methods (Koch et al., 10 Nov 2025); AneuG's morphological conditioning (Ding et al., 15 May 2025)).
Interpretability and Editability: Intermediate geometric representations enable robust user or expert edit loops, incremental design, and explainable planning (e.g., PlantoGraphy’s chain-of-thought LLMs and decoupled layout/image stages (Huang et al., 2024), BuildingBlock’s JSON-rule interface (Huang et al., 7 May 2025)).

Limitations are also manifest:

Error Propagation: The refinement or generative stage may be irrevocably hamstrung by failures or errors in the initial geometric stage, with limited recourse for global correction (He et al., 30 Nov 2025).
Non-End-to-End Training: Absence of joint gradient optimization may bottleneck overall model capacity on either stage's weaknesses (Liu et al., 4 Mar 2026).
Domain-Specific Expertise: Stage design often requires intricate physics or geometry knowledge (e.g., GHD, morphing energy, explicit graph construction).

5. Application Domains and Generalization

Two-stage geometric/generative architectures are now foundational in diverse areas:

Dense prediction: Depth, normal, and reflectance estimation from images (He et al., 30 Nov 2025)
3D content creation: Architectural model generation (Rasoulzadeh et al., 2024), shape synthesis (Jia et al., 24 Dec 2025), hybrid PCG/learning workflows (Huang et al., 7 May 2025)
Structured layout/image synthesis: Layout-aware image generation (Koch et al., 10 Nov 2025), landscape rendering (Huang et al., 2024), omni-directional image synthesis with geometric distortion correction (Nakata et al., 2024)
Domain-specific modeling: Fluid pattern illustration (sketch→LCS→velocity field) (Xie et al., 2022), vascular mesh generation with clinical control (Ding et al., 15 May 2025), crystal structure prediction (LLM → flow-based decoder) (Liu et al., 4 Mar 2026)
Scene understanding and embodied AI: Video diffusion-derived 3D priors for multimodal models (Wu et al., 19 Mar 2026), floor plan generation (topology→geometry) (Abouagour et al., 18 Dec 2025)

This separation of topological/scaffold reasoning from metric/detail synthesis has permitted advanced modeling of tasks previously resistant to either discriminative or generative modeling alone.

6. Theoretical and Practical Implications

The structural decoupling characteristic of two-stage architectures has several implications:

Task Factorization: The explicit geometric stage aligns with mechanisms of structured reasoning, explicit planning, and symbolic representation; subsequent generative modeling can focus statistical capacity on conditional diversity and realism.
Reuse of Pretrained Models: Pretrained generative models (diffusion, VQGAN, transformers) can be exploited as powerful priors selectively, as in Lotus-2’s deterministic rectified-flow for geometry (He et al., 30 Nov 2025) or VEGA-3D’s feature extractor (Wu et al., 19 Mar 2026).
Modularity and Extensibility: Stages can be independently swapped or upgraded (e.g., improved LLM or flow backbone in Lang2Str (Liu et al., 4 Mar 2026), novel edge-aware GNNs in GFLAN (Abouagour et al., 18 Dec 2025)).
Constraint Integration: Hard geometric, semantic, or physics-based constraints can be enforced or monitored at the structural stage, difficult in monolithic generative frameworks.

Trade-offs persist regarding error robustness, the tension between flexibility and control (e.g., prompt fidelity vs. layout fidelity in layout-to-image systems (Koch et al., 10 Nov 2025)), and the degree to which staged optimization can match ideal end-to-end correctness.

7. Future Directions and Open Challenges

Active research problems include:

End-to-End Differentiability and Joint Optimization: Bridging the divide between the geometric and generative stages for global optimality, e.g., with differentiable controllers or backprop-against-structure mechanisms (Liu et al., 4 Mar 2026).
Adaptive Refinement Scheduling: Learning how and when to allocate computation between structure and detail, such as dynamic step-counts or learned schedule in detail sharpener modules (He et al., 30 Nov 2025).
Broader Generative Priors: Extracting structure from text-conditioned or multimodal diffusions (beyond purely visual backbones) (He et al., 30 Nov 2025).
Domain Generalization: Extending spectral or group-theoretic geometric encodings (e.g., GHD, wreath processes) to arbitrary topologies or scales (Borsa et al., 2015, Ding et al., 15 May 2025).
Human-in-the-loop and Editability: Supporting interactive feedback, iterative refinement, and collaborative editing workflows across stages (Huang et al., 2024, Huang et al., 7 May 2025).

Two-stage geometric/generative pipelines currently represent a state-of-the-art paradigm for both interpretable and high-fidelity synthesis and prediction in structured, spatially complex domains.