Core Predictor for Geometric Dense Mapping

Updated 3 December 2025

Core Predictor is a deterministic model that maps image latent codes to geometric annotations using a single-step rectified flow, ensuring globally coherent 3D estimations.
It employs a DiT-based module combined with a Local Continuity Module (LCM) to suppress artifacts and promote smooth, structure-preserving regression.
Integrated in the two-stage Lotus-2 pipeline, it delivers state-of-the-art performance with minimal training data, significantly improving accuracy in benchmark depth estimation.

A core predictor is a deterministic model component designed to map image representations to per-pixel geometric annotations (such as depth or surface normals) within the Lotus-2 geometric dense prediction framework. Unlike stochastic diffusion models optimized for diverse image generation, the core predictor leverages a single-step rectified-flow mechanism for stable, structure-preserving regression from image latent to annotation latent. This design provides globally coherent geometric predictions while preventing grid artifacts, serving as the foundational first stage in a two-stage pipeline for high-fidelity geometric reasoning (He et al., 30 Nov 2025).

1. Background and Motivation

Recovering dense 3D geometry from a single image is fundamentally ill-posed, as disparate 3D configurations can yield equivalent 2D projections. Discriminative regression models rely on extensive annotation (e.g., millions of samples in DepthAnything or MoGe-2) but remain susceptible to errors in rare or ambiguous cases. Modern diffusion models such as Stable Diffusion and FLUX possess implicit world priors acquired from large-scale image-text pretraining; however, their expressive capacity is primarily harnessed through stochastic, multi-step sampling—suboptimal for deterministic geometric inference. The Lotus-2 architecture reconceptualizes such diffusion backbones as deterministic world priors, with the core predictor providing a direct, noise-free mapping tailored to geometric regression (He et al., 30 Nov 2025).

2. Rectified-Flow Formulation and Core Predictor Architecture

The core predictor replaces the usual stochastic (noise-injected) sampling of diffusion models with a single-step, deterministic rectified-flow formulated in a variational autoencoder (VAE) latent space. Let $E$ and $D$ denote frozen VAE encoder and decoder, mapping images $x$ to latents $\mathbf{z}^x$ and ground-truth geometric annotations $y$ to latents $\mathbf{z}^y$ . The core predictor constructs a straight-line latent interpolation at $t=1$ :

$\mathbf{z}_t = t\,\mathbf{z}^x + (1-t)\,\mathbf{z}^y, \quad \mathbf{v} = \mathbf{z}^x - \mathbf{z}^y$

Empirically, optimal regression is achieved at $t=1$ , meaning that the mapping is performed directly from the image latent: $\mathbf{z}_1 = \mathbf{z}^x$ . The predictor is implemented as a DiT-based module $f_\theta$ (inherited from the FLUX architecture), followed by a Local Continuity Module (LCM) for artifact suppression.

The prediction objective eschews residual learning in favor of clean-data regression, as residual-based losses introduce appearance artifacts into geometric outputs. The core loss is:

$L_{\text{core}} = \left\|\mathbf{z}^y - \Lambda(f_\theta(\mathbf{z}^x, 1))\right\|^2$

where $\Lambda(\cdot)$ is the LCM, defined as two sequential $3 \times 3$ convolutional layers with GELU activation, promoting local smoothness and removing checkerboard effects from intermediary pixel shuffling operations.

3. Inference and Practical Implementation

The core predictor is utilized as the initial stage in the Lotus-2 inference pipeline. Its single-step deterministic mapping ensures rapid and globally coherent estimation. The pipeline operates as follows:

def lotus2_core_inference(x):
    z_x = E.encode(x)            # encode input image
    z_yc = core.predict(z_x)     # deterministic core prediction (i.e., LCM(f_theta(z_x, 1)))
    y_hat = D.decode(z_yc)       # decode geometric map
    return y_hat

The output of the core predictor exhibits robust global structure, free from grid artifacts—functioning as an accurate, noise-free geometric regressor. Optionally, in the second Lotus-2 stage, this output is further refined by a manifold-constrained, multi-step detail sharpener.

4. Quantitative Evaluation and Ablation Results

Core predictor efficacy is substantiated through ablation studies and benchmarking on standard depth estimation datasets. The ablation on NYUv2, KITTI, ETH3D, and ScanNet confirms the following progressive improvements (as measured by AbsRel error):

Component	NYUv2 AbsRel ↓	KITTI ↓	ETH3D ↓	ScanNet ↓
Stochastic-DA (multi-step, noisy)	8.26	13.20	17.38	9.37
+ deterministic flow	7.81	10.21	10.77	8.49
+ single-step ( $T=1$ )	5.91	8.83	5.86	7.12
+ clean-data prediction	4.38	6.84	4.98	4.45
+ LCM (local continuity)	4.13	6.58	4.63	4.17

Introducing determinism (single-step flow), eschewing residual targets, and applying LCM collectively yield substantial gains in geometric accuracy and artifact suppression. Removing Pack–Unpack operations without LCM degrades performance, evidencing LCM's essential role. Final-stage refinement (detail sharpener) achieves marginal further improvements (He et al., 30 Nov 2025).

5. Significance, Extensions, and Theoretical Implications

The core predictor operationalizes diffusion world priors in a fully deterministic, interpretable fashion, enabling highly data-efficient geometric prediction—state-of-the-art results were achieved using only 59K synthetic training samples (<1% of leading baselines). The two-stage decoupled mechanism isolates global structure prediction in Stage 1 (core predictor) from high-frequency detail recovery in Stage 2 (detail sharpener), circumventing instability traditionally seen in end-to-end diffusion adaptation.

Extensions of the core predictor mechanism (single-step deterministic core, followed by manifold-constrained detail refinement) are applicable to diverse dense labeling tasks including albedo estimation, optical flow, and semantic segmentation. Adopting advanced rectified-flow backbones (e.g., SD 3.x or AuraFlow) is anticipated to further enhance fidelity. A plausible implication is that the deterministic recasting of pretrained diffusion models may become a foundational paradigm for data-efficient dense prediction across vision tasks (He et al., 30 Nov 2025).

6. Summary

The core predictor within Lotus-2 represents a distinctive deterministic, single-step regressor from image latent to annotation latent, built atop rectified-flow theory and DiT architectures. It achieves global geometric coherence, artifact-free output, and robust performance on challenging benchmarks with minimal data requirements. Utilization of the core predictor as a standalone estimator or as the initialization for geometric detail refinement demonstrates the broader applicability and theoretical import of deterministic world priors for computer vision dense prediction (He et al., 30 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Lotus-2: Advancing Geometric Dense Prediction with Powerful Image Generative Model (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Core Predictor.