Lotus-2 Framework: Deterministic Geometric Prediction
- Lotus-2 is a deterministic two-stage framework that repurposes diffusion model priors with rectified-flow adaptations for accurate dense geometric prediction.
- It employs a Core Predictor for clean-data depth estimation and a Detail Sharpener to refine high-frequency information, ensuring globally coherent outputs.
- The framework achieves state-of-the-art zero-shot generalization on benchmark datasets like NYUv2, KITTI, and ETH3D while using significantly less supervision.
Lotus-2 is a deterministic two-stage framework for geometric dense prediction that leverages the world priors embedded within large pre-trained diffusion backbones for pixel-wise inference of depth and surface normals from single images. It introduces rectified-flow adaptations and local continuity constraints in order to synthesize globally coherent and sharp geometric structures, outperforming stochastic generative baselines and discriminative models even when trained on orders of magnitude less supervision. Lotus-2 achieves state-of-the-art zero-shot generalization in monocular depth and competitive performance in normal estimation, demonstrating the efficacy of deterministic world priors for physically plausible inference (He et al., 30 Nov 2025).
1. Motivation: Deterministic World Priors from Diffusion Models
Dense geometric prediction—recovering per-pixel depth, normals, or other structural cues from an RGB image—faces fundamental ill-posedness due to the non-injective mapping between 2D observations and 3D scene configurations. Conventional discriminative regressors (CNN, ViT) depend upon large-scale supervision, with declining robustness on out-of-distribution instances and limits in physical reasoning. In contrast, recent large-scale diffusion models such as FLUX 1.0, Stable Diffusion, and related architectures encode powerful implicit world priors through exposure to massive image-text datasets.
Stochastic diffusion pipelines, optimized for high-fidelity and diverse image generation, are suboptimal for deterministic geometric inference: they introduce noise-induced output variance, blurry structures, and a requirement for ensembling. Lotus-2 addresses this by repurposing pretrained generative weights for deterministic, accurate prediction, extracting clean mappings and refining structure in a noise-free protocol.
2. Mathematical Foundations
2.1 Rectified-Flow and Latent-Space Adaptation
Given image and geometric annotation (depth or normals), their VAE latent codes are denoted and . Rectified-flow models deterministic transport from source to target via the ODE: for and ideal velocity . The neural net is trained with
with the FLUX DiT backbone and spatially efficient Pack/Unpack operations.
2.2 Stage 1: Core Predictor
Lotus-2 replaces stochastic adaptation with deterministic image-to-geometry rectified flow: and matching loss: Optimal performance emerges for single-step adaptation (), yielding clean-data prediction with
Here, denotes the Local Continuity Module (LCM), which applies two convolutions with GELU activation: This suppresses grid artifacts inherent to FLUX’s latent manipulations, enforcing local smoothness across patch boundaries.
Core Algorithm (Editor's term: Core Predictor Chain)
- Input:
- DiT backbone, PackDiTUnpack, feature
- LCM: two convs + GELU yields
- Loss:
2.3 Stage 2: Detail Sharpener
Single-step regression captures global structure but loses high-frequency detail. Lotus-2 introduces a constrained multi-step rectified flow, decoupling coarse geometry () and fine annotation (): with matching loss: No stochastic noise is introduced; refinement operates within the manifold prescribed by .
Training Pseudocode
1 2 3 4 5 6 7 |
Input: FLUX DiT backbone g_θ (LoRA-adapted), coarse/fine pairs {(zᶜ_i, zᶠ_i)}, T′=10
For each minibatch:
Sample t ∈ {1/T′,2/T′,...,1}
Compute z_t = t·zᶜ + (1–t)·zᶠ
Predict flow = g_θ(z_t, t)
Compute L = ‖(zᶜ–zᶠ) – flow‖²
Backprop and update LoRA parameters |
3. Implementation Details
- Backbone: Public FLUX.1 VAE+DiT base; VAE yields $1/8$-grid latents (HW/4, channels4).
- DiT-base: 300M frozen params; LoRA adapters (rank-128 for depth, rank-256 for normals) in all attention/MLP sublayers, 1.5M trainable params.
- Dataset: 59K samples (Hypersim 39K indoor, Virtual KITTI 20K street), resolutions and .
- Optimization: Adam, learning rate , batch size 64, 8NVIDIA H100 (80GB); 20K iters for core predictor, 15K for sharpener, early stopping; 30min per stage.
- Inference: Single pass through core predictor for ; up to 10 Euler refinements for ; VAE decode for final RGB or geometry.
4. Quantitative Results: Depth and Normal Estimation
4.1 Monocular Depth (AbsRel↓, δ₁↑, affine alignment)
| Method | Train Data | NYUv2 | KITTI | ETH3D | ScanNet | DIODE | Avg Rank |
|---|---|---|---|---|---|---|---|
| DepthAnything V2 | 62.6 M | 4.5/97.9 | 7.4/94.6 | 13.1/86.5 | 4.2/97.8 | 26.5/73.4 | 7.3 |
| MoGe-2 | 8.9 M | 3.6/98.0 | 11.8/89.2 | 16.6/81.5 | 3.5/98.2 | 39.3/70.0 | 10.4 |
| Diffusion-E2E-FT | 74 K | 5.4/96.5 | 9.6/92.1 | 6.4/95.9 | 5.8/96.5 | 30.3/77.6 | 7.1 |
| Marigold (LCM) | 74 K | 6.1/95.8 | 9.8/91.8 | 6.8/95.6 | 6.9/94.6 | 30.7/77.5 | 10.5 |
| Lotus-2 | 59 K | 4.1/97.6 | 6.7/94.5 | 4.6/98.1 | 4.2/97.6 | 22.1/75.2 | 3.6 |
Lotus-2 ranks highest overall, setting new state-of-the-art on three of five datasets with <1% supervision compared to leading discriminative models.
4.2 Surface Normals (mean angle error↓, percent <11.25°↑)
| Method | Train Data | NYUv2 | ScanNet | iBims-1 | Sintel | Avg Rank |
|---|---|---|---|---|---|---|
| OASIS | 110 K | 29.2/23.8 | 32.8/15.4 | 32.6/23.5 | 43.1/7.0 | 13.5 |
| Omnidata V2 | 12.2 M | 17.2/55.5 | 16.2/60.2 | 18.2/63.9 | 40.5/14.7 | 8.1 |
| Diff-E2E-FT | 74 K | 16.5/60.4 | 14.7/66.1 | 16.1/69.7 | 33.5/22.3 | 3.4 |
| StableNormal | 250 K | 18.6/53.5 | 17.1/57.4 | 18.2/65.0 | 36.7/14.1 | 8.4 |
| MoGe-2 | 8.9 M | 14.7/62.3 | 12.8/68.4 | 14.7/70.4 | 29.3/24.8 | 1.1 |
| Lotus-2 | 59 K | 16.9/59.0 | 14.2/66.8 | 15.4/70.4 | 30.3/27.6 | 2.9 |
Lotus-2 matches or exceeds prior noise-based refiners, demonstrating robust high-frequency edge preservation and spectral power recovery.
5. Ablation Studies
5.1 Core Predictor Configurations (Depth)
| Configuration | NYU | KITTI | ETH3D | ScanNet |
|---|---|---|---|---|
| Stochastic-DA | 8.26/93.47 | 13.20/78.20 | 17.38/77.84 | 9.37/91.57 |
| + Deterministic-DA | 7.81/94.26 | 10.21/89.90 | 10.77/94.76 | 8.49/92.90 |
| + Single-Step (T=1) | 5.91/96.94 | 8.83/92.09 | 5.86/96.95 | 7.12/96.33 |
| + Clean-Data Prediction | 4.38/97.63 | 6.84/94.33 | 4.98/97.55 | 4.45/97.53 |
| + Local Continuity Module (LCM) | 4.13/97.61 | 6.58/94.68 | 4.63/98.00 | 4.17/97.58 |
| (w/o Pack-Unpack) * | 4.82/97.38 | 6.97/94.20 | 5.73/97.25 | 4.72/97.17 |
| + Detail Sharpener | 4.12/97.62 | 6.73/94.49 | 4.64/98.10 | 4.19/97.60 |
Each increment from stochastic adaptation to single-step, clean-data, and LCM leads to measurable accuracy gains and reduction of artifacts. Removing Pack/Unpack improves local smoothness but reduces efficiency and pretrained prior utility.
5.2 Training Step Analysis
Reducing training time-steps from improves depth accuracy across all data regimes, establishing single-step regression as optimal.
5.3 Detail Sharpener Effect
Qualitative analysis shows edge sharpening without hallucination; quantitative spectral analysis indicates retention and recovery of high-frequency detail with no loss in core metrics.
5.4 Number of Refinement Steps
| # steps | Avg AbsRel↓/δ₁↑ |
|---|---|
| 0 | 4.13/97.61 |
| 5 | 4.12/97.62 |
| 10 | 4.12/97.63 |
No significant gain beyond 10 steps, suggesting rapid convergence for practical use.
6. Significance and Implications
Lotus-2 demonstrates that latent world priors in diffusion models can be harnessed for stable, accurate geometric reasoning through deterministic rectified-flow mappings rather than stochastic generation protocols. Its dual-stage architecture—single-step clean-data core followed by manifold-restricted detail sharpening—achieves high fidelity in both global structure and high-frequency detail with dramatically reduced supervision. This suggests that future dense prediction frameworks may benefit from deterministic utilization of generative model priors and refined noise-free adaptation mechanisms, especially as foundation model scales continue to rise.
A plausible implication is that similar deterministic protocols could generalize to related ill-posed inference tasks, provided powerful backbone priors and modular refinement mechanisms are available. The integration of lightweight continuity modules further invites investigation into artifact suppression in latent-driven generative pipelines.
7. Related Work and Positioning
Lotus-2 builds upon the traditions of rectified-flow modeling [liu2022flow, lipman2022flow], latent diffusion adaptation (FLUX 1.0), and large pre-trained DiT architectures. The introduction of clean-data prediction and manifold-constrained sharpener components positions Lotus-2 distinctively against stochastic sampler-based fine-tuning paradigms (e.g., Marigold, GeoWizard), which suffer from instability and require extensive ensembling. In ablation and direct comparison, Lotus-2 achieves superior zero-shot generalization using less than 1% of the data required by leading methods, validating its protocol for efficient adaptation of generative world priors (He et al., 30 Nov 2025).