Lotus-2 Framework: Deterministic Geometric Prediction

Updated 3 December 2025

Lotus-2 is a deterministic two-stage framework that repurposes diffusion model priors with rectified-flow adaptations for accurate dense geometric prediction.
It employs a Core Predictor for clean-data depth estimation and a Detail Sharpener to refine high-frequency information, ensuring globally coherent outputs.
The framework achieves state-of-the-art zero-shot generalization on benchmark datasets like NYUv2, KITTI, and ETH3D while using significantly less supervision.

Lotus-2 is a deterministic two-stage framework for geometric dense prediction that leverages the world priors embedded within large pre-trained diffusion backbones for pixel-wise inference of depth and surface normals from single images. It introduces rectified-flow adaptations and local continuity constraints in order to synthesize globally coherent and sharp geometric structures, outperforming stochastic generative baselines and discriminative models even when trained on orders of magnitude less supervision. Lotus-2 achieves state-of-the-art zero-shot generalization in monocular depth and competitive performance in normal estimation, demonstrating the efficacy of deterministic world priors for physically plausible inference (He et al., 30 Nov 2025).

1. Motivation: Deterministic World Priors from Diffusion Models

Dense geometric prediction—recovering per-pixel depth, normals, or other structural cues from an RGB image—faces fundamental ill-posedness due to the non-injective mapping between 2D observations and 3D scene configurations. Conventional discriminative regressors (CNN, ViT) depend upon large-scale supervision, with declining robustness on out-of-distribution instances and limits in physical reasoning. In contrast, recent large-scale diffusion models such as FLUX 1.0, Stable Diffusion, and related architectures encode powerful implicit world priors through exposure to massive image-text datasets.

Stochastic diffusion pipelines, optimized for high-fidelity and diverse image generation, are suboptimal for deterministic geometric inference: they introduce noise-induced output variance, blurry structures, and a requirement for ensembling. Lotus-2 addresses this by repurposing pretrained generative weights for deterministic, accurate prediction, extracting clean mappings and refining structure in a noise-free protocol.

2. Mathematical Foundations

2.1 Rectified-Flow and Latent-Space Adaptation

Given image $x$ and geometric annotation $y$ (depth or normals), their VAE latent codes are denoted $z^x = E(x)$ and $z^y = E(y)$ . Rectified-flow models deterministic transport from source $z_1 \sim p_1$ to target $z_0\sim p_0$ via the ODE: $\frac{d\mathbf{z}_t}{dt} = v(\mathbf{z}_t, t)$ for $\mathbf{z}_t = t\,\mathbf{z}_1 + (1-t)\,\mathbf{z}_0$ and ideal velocity $v^*(\mathbf{z}_t) = \mathbf{z}_1 - \mathbf{z}_0$ . The neural net $f_\theta(\cdot, t)$ is trained with

$L_t = \big\|(\mathbf{z}_1 - \mathbf{z}_0) - f_\theta(\mathbf{z}_t, t)\big\|^2$

with the FLUX DiT backbone and spatially efficient Pack/Unpack operations.

2.2 Stage 1: Core Predictor

Lotus-2 replaces stochastic adaptation with deterministic image-to-geometry rectified flow: $\mathbf{z}_t = t\,\mathbf{z}^x + (1-t)\,\mathbf{z}^y,\quad v = \mathbf{z}^x - \mathbf{z}^y$ and matching loss: $L_t^\text{DA} = \big\|(\mathbf{z}^x - \mathbf{z}^y) - f_\theta(\mathbf{z}_t, t)\big\|^2$ Optimal performance emerges for single-step adaptation ( $T=1, t=1$ ), yielding clean-data prediction with

$L^\text{core} = \big\| \mathbf{z}^y - \Lambda\bigl(f_\theta(\mathbf{z}^x,1)\bigr)\big\|^2$

Here, $\Lambda(\cdot)$ denotes the Local Continuity Module (LCM), which applies two $3\times3$ convolutions with GELU activation: $\Lambda(h) = \phi_2 * \mathrm{GELU}(\phi_1 * h)$ This suppresses grid artifacts inherent to FLUX’s latent manipulations, enforcing local smoothness across patch boundaries.

Core Algorithm (Editor's term: Core Predictor Chain)

Input: $z^x$
DiT backbone, Pack $\rightarrow$ DiT $\rightarrow$ Unpack, feature $h$
LCM: two $3\times3$ convs + GELU yields $\hat{z}^y$
Loss: $L^\text{core} = \|\mathbf{z}^y - \hat{z}^y\|^2$

2.3 Stage 2: Detail Sharpener

Single-step regression captures global structure but loses high-frequency detail. Lotus-2 introduces a constrained multi-step rectified flow, decoupling coarse geometry ( $z^{y_c}$ ) and fine annotation ( $z^{y_f}$ ): $\mathbf{z}_t = t\,\mathbf{z}^{y_c} + (1-t)\,\mathbf{z}^{y_f},\quad v = \mathbf{z}^{y_c} - \mathbf{z}^{y_f}$ with matching loss: $L_t^\text{sharp} = \big\| (\mathbf{z}^{y_c} - \mathbf{z}^{y_f}) - g_\theta(\mathbf{z}_t, t) \big\|^2,\quad t\in\{1/T',\ldots,1\},\;T'=10$ No stochastic noise is introduced; refinement operates within the manifold prescribed by $z^{y_c}$ .

Training Pseudocode

Input: FLUX DiT backbone g_θ (LoRA-adapted), coarse/fine pairs {(zᶜ_i, zᶠ_i)}, T′=10
For each minibatch:
    Sample t ∈ {1/T′,2/T′,...,1}
    Compute z_t = t·zᶜ + (1–t)·zᶠ
    Predict flow = g_θ(z_t, t)
    Compute L = ‖(zᶜ–zᶠ) – flow‖²
    Backprop and update LoRA parameters

At inference: obtain

\hat{z}^{y_c}

from the core predictor, run

\leq10

Euler steps on

g_\theta

to reach final refinement.

3. Implementation Details

Backbone: Public FLUX.1 VAE+DiT base; VAE yields $1/8$-grid latents (HW/4, channels $\times$ 4).
DiT-base: $\sim$ 300M frozen params; LoRA adapters (rank-128 for depth, rank-256 for normals) in all attention/MLP sublayers, $\sim$ 1.5M trainable params.
Dataset: $\sim$ 59K samples (Hypersim 39K indoor, Virtual KITTI 20K street), resolutions $576\times768$ and $352\times1216$ .
Optimization: Adam, learning rate $1 \times 10^{-4}$ , batch size 64, 8 $\times$ NVIDIA H100 (80GB); $\sim$ 20K iters for core predictor, $\sim$ 15K for sharpener, early stopping; $\sim$ 30min per stage.
Inference: Single pass through core predictor for $z^{y_c}$ ; up to 10 Euler refinements for $z^{y_f}$ ; VAE decode for final RGB or geometry.

4. Quantitative Results: Depth and Normal Estimation

4.1 Monocular Depth (AbsRel↓, δ₁↑, affine alignment)

Method	Train Data	NYUv2	KITTI	ETH3D	ScanNet	DIODE	Avg Rank
DepthAnything V2	62.6 M	4.5/97.9	7.4/94.6	13.1/86.5	4.2/97.8	26.5/73.4	7.3
MoGe-2	8.9 M	3.6/98.0	11.8/89.2	16.6/81.5	3.5/98.2	39.3/70.0	10.4
Diffusion-E2E-FT	74 K	5.4/96.5	9.6/92.1	6.4/95.9	5.8/96.5	30.3/77.6	7.1
Marigold (LCM)	74 K	6.1/95.8	9.8/91.8	6.8/95.6	6.9/94.6	30.7/77.5	10.5
Lotus-2	59 K	4.1/97.6	6.7/94.5	4.6/98.1	4.2/97.6	22.1/75.2	3.6

Lotus-2 ranks highest overall, setting new state-of-the-art on three of five datasets with <1% supervision compared to leading discriminative models.

4.2 Surface Normals (mean angle error↓, percent <11.25°↑)

Method	Train Data	NYUv2	ScanNet	iBims-1	Sintel	Avg Rank
OASIS	110 K	29.2/23.8	32.8/15.4	32.6/23.5	43.1/7.0	13.5
Omnidata V2	12.2 M	17.2/55.5	16.2/60.2	18.2/63.9	40.5/14.7	8.1
Diff-E2E-FT	74 K	16.5/60.4	14.7/66.1	16.1/69.7	33.5/22.3	3.4
StableNormal	250 K	18.6/53.5	17.1/57.4	18.2/65.0	36.7/14.1	8.4
MoGe-2	8.9 M	14.7/62.3	12.8/68.4	14.7/70.4	29.3/24.8	1.1
Lotus-2	59 K	16.9/59.0	14.2/66.8	15.4/70.4	30.3/27.6	2.9

Lotus-2 matches or exceeds prior noise-based refiners, demonstrating robust high-frequency edge preservation and spectral power recovery.

5. Ablation Studies

5.1 Core Predictor Configurations (Depth)

Configuration	NYU	KITTI	ETH3D	ScanNet
Stochastic-DA	8.26/93.47	13.20/78.20	17.38/77.84	9.37/91.57
+ Deterministic-DA	7.81/94.26	10.21/89.90	10.77/94.76	8.49/92.90
+ Single-Step (T=1)	5.91/96.94	8.83/92.09	5.86/96.95	7.12/96.33
+ Clean-Data Prediction	4.38/97.63	6.84/94.33	4.98/97.55	4.45/97.53
+ Local Continuity Module (LCM)	4.13/97.61	6.58/94.68	4.63/98.00	4.17/97.58
(w/o Pack-Unpack) *	4.82/97.38	6.97/94.20	5.73/97.25	4.72/97.17
+ Detail Sharpener	4.12/97.62	6.73/94.49	4.64/98.10	4.19/97.60

Each increment from stochastic adaptation to single-step, clean-data, and LCM leads to measurable accuracy gains and reduction of artifacts. Removing Pack/Unpack improves local smoothness but reduces efficiency and pretrained prior utility.

5.2 Training Step Analysis

Reducing training time-steps from $50\rightarrow1$ improves depth accuracy across all data regimes, establishing single-step regression as optimal.

5.3 Detail Sharpener Effect

Qualitative analysis shows edge sharpening without hallucination; quantitative spectral analysis indicates retention and recovery of high-frequency detail with no loss in core metrics.

# steps	Avg AbsRel↓/δ₁↑
0	4.13/97.61
5	4.12/97.62
10	4.12/97.63

No significant gain beyond 10 steps, suggesting rapid convergence for practical use.

6. Significance and Implications

Lotus-2 demonstrates that latent world priors in diffusion models can be harnessed for stable, accurate geometric reasoning through deterministic rectified-flow mappings rather than stochastic generation protocols. Its dual-stage architecture—single-step clean-data core followed by manifold-restricted detail sharpening—achieves high fidelity in both global structure and high-frequency detail with dramatically reduced supervision. This suggests that future dense prediction frameworks may benefit from deterministic utilization of generative model priors and refined noise-free adaptation mechanisms, especially as foundation model scales continue to rise.

A plausible implication is that similar deterministic protocols could generalize to related ill-posed inference tasks, provided powerful backbone priors and modular refinement mechanisms are available. The integration of lightweight continuity modules further invites investigation into artifact suppression in latent-driven generative pipelines.

Lotus-2 builds upon the traditions of rectified-flow modeling [liu2022flow, lipman2022flow], latent diffusion adaptation (FLUX 1.0), and large pre-trained DiT architectures. The introduction of clean-data prediction and manifold-constrained sharpener components positions Lotus-2 distinctively against stochastic sampler-based fine-tuning paradigms (e.g., Marigold, GeoWizard), which suffer from instability and require extensive ensembling. In ablation and direct comparison, Lotus-2 achieves superior zero-shot generalization using less than 1% of the data required by leading methods, validating its protocol for efficient adaptation of generative world priors (He et al., 30 Nov 2025).

Markdown Upgrade to Chat

References (1)

Lotus-2: Advancing Geometric Dense Prediction with Powerful Image Generative Model (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Lotus-2 Framework.

Lotus-2 Framework: Deterministic Geometric Prediction

1. Motivation: Deterministic World Priors from Diffusion Models

2. Mathematical Foundations

2.1 Rectified-Flow and Latent-Space Adaptation

2.2 Stage 1: Core Predictor

Core Algorithm (Editor's term: Core Predictor Chain)

2.3 Stage 2: Detail Sharpener

Training Pseudocode

3. Implementation Details

4. Quantitative Results: Depth and Normal Estimation

4.1 Monocular Depth (AbsRel↓, δ₁↑, affine alignment)

4.2 Surface Normals (mean angle error↓, percent <11.25°↑)

5. Ablation Studies

5.1 Core Predictor Configurations (Depth)

5.2 Training Step Analysis

5.3 Detail Sharpener Effect

5.4 Number of Refinement Steps

6. Significance and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Lotus-2 Framework: Deterministic Geometric Prediction

1. Motivation: Deterministic World Priors from Diffusion Models

2. Mathematical Foundations

2.1 Rectified-Flow and Latent-Space Adaptation

2.2 Stage 1: Core Predictor

Core Algorithm (Editor's term: Core Predictor Chain)

2.3 Stage 2: Detail Sharpener

Training Pseudocode

3. Implementation Details

4. Quantitative Results: Depth and Normal Estimation

4.1 Monocular Depth (AbsRel↓, δ₁↑, affine alignment)

4.2 Surface Normals (mean angle error↓, percent <11.25°↑)

5. Ablation Studies

5.1 Core Predictor Configurations (Depth)

5.2 Training Step Analysis

5.3 Detail Sharpener Effect

5.4 Number of Refinement Steps

6. Significance and Implications

7. Related Work and Positioning

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics