Papers
Topics
Authors
Recent
2000 character limit reached

Lotus-2 Framework: Deterministic Geometric Prediction

Updated 3 December 2025
  • Lotus-2 is a deterministic two-stage framework that repurposes diffusion model priors with rectified-flow adaptations for accurate dense geometric prediction.
  • It employs a Core Predictor for clean-data depth estimation and a Detail Sharpener to refine high-frequency information, ensuring globally coherent outputs.
  • The framework achieves state-of-the-art zero-shot generalization on benchmark datasets like NYUv2, KITTI, and ETH3D while using significantly less supervision.

Lotus-2 is a deterministic two-stage framework for geometric dense prediction that leverages the world priors embedded within large pre-trained diffusion backbones for pixel-wise inference of depth and surface normals from single images. It introduces rectified-flow adaptations and local continuity constraints in order to synthesize globally coherent and sharp geometric structures, outperforming stochastic generative baselines and discriminative models even when trained on orders of magnitude less supervision. Lotus-2 achieves state-of-the-art zero-shot generalization in monocular depth and competitive performance in normal estimation, demonstrating the efficacy of deterministic world priors for physically plausible inference (He et al., 30 Nov 2025).

1. Motivation: Deterministic World Priors from Diffusion Models

Dense geometric prediction—recovering per-pixel depth, normals, or other structural cues from an RGB image—faces fundamental ill-posedness due to the non-injective mapping between 2D observations and 3D scene configurations. Conventional discriminative regressors (CNN, ViT) depend upon large-scale supervision, with declining robustness on out-of-distribution instances and limits in physical reasoning. In contrast, recent large-scale diffusion models such as FLUX 1.0, Stable Diffusion, and related architectures encode powerful implicit world priors through exposure to massive image-text datasets.

Stochastic diffusion pipelines, optimized for high-fidelity and diverse image generation, are suboptimal for deterministic geometric inference: they introduce noise-induced output variance, blurry structures, and a requirement for ensembling. Lotus-2 addresses this by repurposing pretrained generative weights for deterministic, accurate prediction, extracting clean mappings and refining structure in a noise-free protocol.

2. Mathematical Foundations

2.1 Rectified-Flow and Latent-Space Adaptation

Given image xx and geometric annotation yy (depth or normals), their VAE latent codes are denoted zx=E(x)z^x = E(x) and zy=E(y)z^y = E(y). Rectified-flow models deterministic transport from source z1p1z_1 \sim p_1 to target z0p0z_0\sim p_0 via the ODE: dztdt=v(zt,t)\frac{d\mathbf{z}_t}{dt} = v(\mathbf{z}_t, t) for zt=tz1+(1t)z0\mathbf{z}_t = t\,\mathbf{z}_1 + (1-t)\,\mathbf{z}_0 and ideal velocity v(zt)=z1z0v^*(\mathbf{z}_t) = \mathbf{z}_1 - \mathbf{z}_0. The neural net fθ(,t)f_\theta(\cdot, t) is trained with

Lt=(z1z0)fθ(zt,t)2L_t = \big\|(\mathbf{z}_1 - \mathbf{z}_0) - f_\theta(\mathbf{z}_t, t)\big\|^2

with the FLUX DiT backbone and spatially efficient Pack/Unpack operations.

2.2 Stage 1: Core Predictor

Lotus-2 replaces stochastic adaptation with deterministic image-to-geometry rectified flow: zt=tzx+(1t)zy,v=zxzy\mathbf{z}_t = t\,\mathbf{z}^x + (1-t)\,\mathbf{z}^y,\quad v = \mathbf{z}^x - \mathbf{z}^y and matching loss: LtDA=(zxzy)fθ(zt,t)2L_t^\text{DA} = \big\|(\mathbf{z}^x - \mathbf{z}^y) - f_\theta(\mathbf{z}_t, t)\big\|^2 Optimal performance emerges for single-step adaptation (T=1,t=1T=1, t=1), yielding clean-data prediction with

Lcore=zyΛ(fθ(zx,1))2L^\text{core} = \big\| \mathbf{z}^y - \Lambda\bigl(f_\theta(\mathbf{z}^x,1)\bigr)\big\|^2

Here, Λ()\Lambda(\cdot) denotes the Local Continuity Module (LCM), which applies two 3×33\times3 convolutions with GELU activation: Λ(h)=ϕ2GELU(ϕ1h)\Lambda(h) = \phi_2 * \mathrm{GELU}(\phi_1 * h) This suppresses grid artifacts inherent to FLUX’s latent manipulations, enforcing local smoothness across patch boundaries.

Core Algorithm (Editor's term: Core Predictor Chain)

  • Input: zxz^x
  • DiT backbone, Pack\rightarrowDiT\rightarrowUnpack, feature hh
  • LCM: two 3×33\times3 convs + GELU yields z^y\hat{z}^y
  • Loss: Lcore=zyz^y2L^\text{core} = \|\mathbf{z}^y - \hat{z}^y\|^2

2.3 Stage 2: Detail Sharpener

Single-step regression captures global structure but loses high-frequency detail. Lotus-2 introduces a constrained multi-step rectified flow, decoupling coarse geometry (zycz^{y_c}) and fine annotation (zyfz^{y_f}): zt=tzyc+(1t)zyf,v=zyczyf\mathbf{z}_t = t\,\mathbf{z}^{y_c} + (1-t)\,\mathbf{z}^{y_f},\quad v = \mathbf{z}^{y_c} - \mathbf{z}^{y_f} with matching loss: Ltsharp=(zyczyf)gθ(zt,t)2,t{1/T,,1},  T=10L_t^\text{sharp} = \big\| (\mathbf{z}^{y_c} - \mathbf{z}^{y_f}) - g_\theta(\mathbf{z}_t, t) \big\|^2,\quad t\in\{1/T',\ldots,1\},\;T'=10 No stochastic noise is introduced; refinement operates within the manifold prescribed by zycz^{y_c}.

Training Pseudocode

1
2
3
4
5
6
7
Input: FLUX DiT backbone g_θ (LoRA-adapted), coarse/fine pairs {(zᶜ_i, zᶠ_i)}, T=10
For each minibatch:
    Sample t  {1/T,2/T,...,1}
    Compute z_t = t·zᶜ + (1t)·zᶠ
    Predict flow = g_θ(z_t, t)
    Compute L = (zᶜzᶠ)  flow²
    Backprop and update LoRA parameters
At inference: obtain z^yc\hat{z}^{y_c} from the core predictor, run 10\leq10 Euler steps on gθg_\theta to reach final refinement.

3. Implementation Details

  • Backbone: Public FLUX.1 VAE+DiT base; VAE yields $1/8$-grid latents (HW/4, channels×\times4).
  • DiT-base: \sim300M frozen params; LoRA adapters (rank-128 for depth, rank-256 for normals) in all attention/MLP sublayers, \sim1.5M trainable params.
  • Dataset: \sim59K samples (Hypersim 39K indoor, Virtual KITTI 20K street), resolutions 576×768576\times768 and 352×1216352\times1216.
  • Optimization: Adam, learning rate 1×1041 \times 10^{-4}, batch size 64, 8×\timesNVIDIA H100 (80GB); \sim20K iters for core predictor, \sim15K for sharpener, early stopping; \sim30min per stage.
  • Inference: Single pass through core predictor for zycz^{y_c}; up to 10 Euler refinements for zyfz^{y_f}; VAE decode for final RGB or geometry.

4. Quantitative Results: Depth and Normal Estimation

4.1 Monocular Depth (AbsRel↓, δ₁↑, affine alignment)

Method Train Data NYUv2 KITTI ETH3D ScanNet DIODE Avg Rank
DepthAnything V2 62.6 M 4.5/97.9 7.4/94.6 13.1/86.5 4.2/97.8 26.5/73.4 7.3
MoGe-2 8.9 M 3.6/98.0 11.8/89.2 16.6/81.5 3.5/98.2 39.3/70.0 10.4
Diffusion-E2E-FT 74 K 5.4/96.5 9.6/92.1 6.4/95.9 5.8/96.5 30.3/77.6 7.1
Marigold (LCM) 74 K 6.1/95.8 9.8/91.8 6.8/95.6 6.9/94.6 30.7/77.5 10.5
Lotus-2 59 K 4.1/97.6 6.7/94.5 4.6/98.1 4.2/97.6 22.1/75.2 3.6

Lotus-2 ranks highest overall, setting new state-of-the-art on three of five datasets with <1% supervision compared to leading discriminative models.

4.2 Surface Normals (mean angle error↓, percent <11.25°↑)

Method Train Data NYUv2 ScanNet iBims-1 Sintel Avg Rank
OASIS 110 K 29.2/23.8 32.8/15.4 32.6/23.5 43.1/7.0 13.5
Omnidata V2 12.2 M 17.2/55.5 16.2/60.2 18.2/63.9 40.5/14.7 8.1
Diff-E2E-FT 74 K 16.5/60.4 14.7/66.1 16.1/69.7 33.5/22.3 3.4
StableNormal 250 K 18.6/53.5 17.1/57.4 18.2/65.0 36.7/14.1 8.4
MoGe-2 8.9 M 14.7/62.3 12.8/68.4 14.7/70.4 29.3/24.8 1.1
Lotus-2 59 K 16.9/59.0 14.2/66.8 15.4/70.4 30.3/27.6 2.9

Lotus-2 matches or exceeds prior noise-based refiners, demonstrating robust high-frequency edge preservation and spectral power recovery.

5. Ablation Studies

5.1 Core Predictor Configurations (Depth)

Configuration NYU KITTI ETH3D ScanNet
Stochastic-DA 8.26/93.47 13.20/78.20 17.38/77.84 9.37/91.57
+ Deterministic-DA 7.81/94.26 10.21/89.90 10.77/94.76 8.49/92.90
+ Single-Step (T=1) 5.91/96.94 8.83/92.09 5.86/96.95 7.12/96.33
+ Clean-Data Prediction 4.38/97.63 6.84/94.33 4.98/97.55 4.45/97.53
+ Local Continuity Module (LCM) 4.13/97.61 6.58/94.68 4.63/98.00 4.17/97.58
(w/o Pack-Unpack) * 4.82/97.38 6.97/94.20 5.73/97.25 4.72/97.17
+ Detail Sharpener 4.12/97.62 6.73/94.49 4.64/98.10 4.19/97.60

Each increment from stochastic adaptation to single-step, clean-data, and LCM leads to measurable accuracy gains and reduction of artifacts. Removing Pack/Unpack improves local smoothness but reduces efficiency and pretrained prior utility.

5.2 Training Step Analysis

Reducing training time-steps from 50150\rightarrow1 improves depth accuracy across all data regimes, establishing single-step regression as optimal.

5.3 Detail Sharpener Effect

Qualitative analysis shows edge sharpening without hallucination; quantitative spectral analysis indicates retention and recovery of high-frequency detail with no loss in core metrics.

5.4 Number of Refinement Steps

# steps Avg AbsRel↓/δ₁↑
0 4.13/97.61
5 4.12/97.62
10 4.12/97.63

No significant gain beyond 10 steps, suggesting rapid convergence for practical use.

6. Significance and Implications

Lotus-2 demonstrates that latent world priors in diffusion models can be harnessed for stable, accurate geometric reasoning through deterministic rectified-flow mappings rather than stochastic generation protocols. Its dual-stage architecture—single-step clean-data core followed by manifold-restricted detail sharpening—achieves high fidelity in both global structure and high-frequency detail with dramatically reduced supervision. This suggests that future dense prediction frameworks may benefit from deterministic utilization of generative model priors and refined noise-free adaptation mechanisms, especially as foundation model scales continue to rise.

A plausible implication is that similar deterministic protocols could generalize to related ill-posed inference tasks, provided powerful backbone priors and modular refinement mechanisms are available. The integration of lightweight continuity modules further invites investigation into artifact suppression in latent-driven generative pipelines.

Lotus-2 builds upon the traditions of rectified-flow modeling [liu2022flow, lipman2022flow], latent diffusion adaptation (FLUX 1.0), and large pre-trained DiT architectures. The introduction of clean-data prediction and manifold-constrained sharpener components positions Lotus-2 distinctively against stochastic sampler-based fine-tuning paradigms (e.g., Marigold, GeoWizard), which suffer from instability and require extensive ensembling. In ablation and direct comparison, Lotus-2 achieves superior zero-shot generalization using less than 1% of the data required by leading methods, validating its protocol for efficient adaptation of generative world priors (He et al., 30 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Lotus-2 Framework.