MonoSelfRecon: Self-Supervised 3D Scene Reconstruction

Updated 14 March 2026

MonoSelfRecon is a self-supervised framework for explicit 3D indoor scene reconstruction from monocular RGB videos, using a voxel-based signed distance function and Neural Radiance Field guidance.
It leverages a 2D-to-3D encoder with attentional view fusion and GRU-based sequential fusion to aggregate multi-view features robustly across temporal fragments.
The framework establishes new benchmarks in depth prediction and mesh reconstruction, demonstrating high accuracy and generalizability on indoor scene datasets.

MonoSelfRecon is a framework for explicit 3D scene reconstruction from monocular RGB videos under purely self-supervised training, providing generalizable mesh outputs for indoor environments. It achieves the longstanding goal of combining pure self-supervision, across-scene generalization, and explicit mesh generation, using a voxel-based signed distance function (SDF) as its core 3D representation, augmented by a generalizable Neural Radiance Field (NeRF) to guide training through novel self-supervised consistency losses. MonoSelfRecon is not limited to a specific neural architecture and establishes new performance benchmarks for self-supervised depth prediction and 3D mesh reconstruction on realistic scene datasets (Li et al., 2024).

1. System Overview and Motivation

MonoSelfRecon addresses limitations of previous monocular 3D reconstruction approaches that suffer from one or more of the following: requirement for full supervision, poor generalization to unseen environments, or reliance on implicit representations. Its innovation is to reconstruct explicit 3D meshes by fusing features from monocular RGB videos, without any ground truth depth or SDF labels.

The framework takes a sequence of RGB frames and known camera poses and processes them via three principal modules:

A 2D-to-3D encoder that extracts and fuses image features into a volumetric representation over a voxel grid by combining attentional view fusion and sequential fragment fusion using a GRU.
A voxel-SDF decoder that regresses multi-scale signed distance values for each voxel, enabling mesh extraction with Marching Cubes at inference.
A generalizable NeRF decoder, conditioned on image features, that renders color and depth to supervise SDF predictions and enforce cross-view consistency.

All training supervision is internally synthesized through loss functions operating on photometric, geometric, and structural consistency principles. At test time, only the voxel-SDF decoder is active for mesh extraction.

2. Architectural Components

2.1 2D-3D Encoder with Attentional View Fusion

The encoder processes $N$ RGB keyframes $\{I_i, T_i\}_{i=1}^N$ (images and poses) through a ResNet-18 backbone, outputting multi-scale 2D image features. These are spatially projected, via camera intrinsics and extrinsics, onto a local 3D voxel grid of size $L^3$ for each fragment. Attentional view fusion is then performed: features corresponding to a voxel from all $N$ views are concatenated and passed through a lightweight Vision Transformer (layernorm, multi-head attention, MLP), weighting each view’s contribution and aggregating via pooling. The resulting fused feature $z(x) \in \mathbb{R}^C$ for voxel $x$ encodes the local appearance and geometry.

2.2 GRU-based Sequential Fragment Fusion

To handle sequences longer than a single fragment, MonoSelfRecon divides the video into spatial/temporal fragments. Each is processed sequentially; at each pyramid level $\ell \in \{1,2,3\}$ , fragment features $G_t^{(\ell)}$ are fused into global hidden-state volumes $H_t^{(\ell)}$ via a 3D sparse-convolutional GRU:

$\begin{aligned} z_t &= \sigma\left(\mathrm{SC}\left([H_{t-1}, G_t]; W_z\right)\right), \ r_t &= \sigma\left(\mathrm{SC}\left([H_{t-1}, G_t]; W_r\right)\right), \ \tilde{H}_t &= \tanh\left(\mathrm{SC}\left([r_t \odot H_{t-1}, G_t]; W_h\right)\right), \ H_t &= (1 - z_t) \odot H_{t-1} + z_t \odot \tilde{H}_t, \end{aligned}$

where $\{I_i, T_i\}_{i=1}^N$ 0 denotes a sparse 3D convolution. This mechanism permits incremental, coarse-to-fine volumetric aggregation as new video fragments are viewed.

2.3 Voxel-SDF Decoder

Each pyramidal level’s final 3D feature volume is predictively mapped to voxel-wise signed distance values by $\{I_i, T_i\}_{i=1}^N$ 1 (a sparse conv layer plus linear map):

$\{I_i, T_i\}_{i=1}^N$ 2

Three levels of voxel grids (e.g., $\{I_i, T_i\}_{i=1}^N$ 3 at the finest) are used. After all fragments, the final SDF grid is processed via Marching Cubes, extracting the zero level set mesh.

2.4 Generalizable NeRF Decoder

Parallel to the voxel pathway, a generalizable NeRF decoder, closely following MPI-NeRF, conditions on per-plane 2D features from the encoder at multiple view-aligned disparities $\{I_i, T_i\}_{i=1}^N$ 4. For each plane and pixel, features are extracted and passed through an MLP $\{I_i, T_i\}_{i=1}^N$ 5 with standard positional encoding to yield color and density:

$\{I_i, T_i\}_{i=1}^N$ 6

where $\{I_i, T_i\}_{i=1}^N$ 7 is the view direction. Volume rendering produces predicted RGB $\{I_i, T_i\}_{i=1}^N$ 8 and depth $\{I_i, T_i\}_{i=1}^N$ 9. Global scene awareness is achieved by conditioning on image features rather than per-scene codes, facilitating generalization across scenes.

3. Self-Supervised Losses

MonoSelfRecon achieves supervision through four complementary loss terms, all computable from monocular RGB input with known poses:

3.1 SDF-Photometric Rendering Loss

Photometric consistency is enforced by linking SDF-predicted surface points across image pairs. A voxel center $L^3$ 0 with predicted $L^3$ 1 is backprojected to a 3D point, reprojected to both source and target frames, and the photometric error is computed:

$L^3$ 2

averaged over valid correspondences.

3.2 Planar-Consistency Loss

Indoors, surfaces are often planar. Superpixels $L^3$ 3 are identified per frame, and the corresponding surface points $L^3$ 4 are fit to planes $L^3$ 5 via least squares. Co-planarity is enforced by:

$L^3$ 6

3.3 Depth-Consistency (NeRF Guidance) Loss

The depth predicted by NeRF ( $L^3$ 7) is scale-aligned with SDF-derived pseudo-depth ( $L^3$ 8) by a global factor:

$L^3$ 9

linking the NeRF and voxel-SDF to geometric agreement.

3.4 NeRF RGB, Smoothness, and SSIM Losses

Standard NeRF supervision includes:

$N$ 0

The total NeRF loss combines these, and the overall objective merges all losses with empirically chosen weights: $N$ 1

Eikonal loss, standard in many SDF pipelines, is not needed; smoothness is enforced by volumetric and planar consistency.

4. Training and Inference Pipeline

The training pipeline is as follows:

Input: Monocular RGB video with known/corrected poses.
Fragmentation: Video is divided into fragments covering $N$ 20.3 m or $N$ 3 each; $N$ 4 frames per fragment.
Warm-up Stage: First 20 epochs train on fragments individually (no GRU fusion), enforcing only intra-fragment losses ( $N$ 5, $N$ 6, $N$ 7).
Main Stage: Subsequent 30 epochs incorporate GRU fusion and add inter-fragment depth consistency ( $N$ 8).
Optimization: Adam optimizer, learning rate $N$ 9. Image encoder is initialized from scratch.
Inference: Only the voxel-SDF decoder is retained. A mesh is directly extracted via Marching Cubes, producing explicit, metrically accurate geometry.

5. Quantitative and Qualitative Results

5.1 Depth and Mesh Reconstruction Performance

Evaluated on ScanNet (test split), MonoSelfRecon achieves:

Model/Setting	AbsRel	RMSE	$z(x) \in \mathbb{R}^C$ 0	F-score (mesh)
MonoSelfRecon (self-sup)	0.143	0.333	79.2%	0.260
MonoSelfRecon (weak sup)	—	—	—	0.358
NeuralRecon (fully supervised)	0.065	—	94.8%	0.494
NeuralRecon (weak sup)	—	—	—	0.205
StructDepth (NYUv2 leaderboard)	0.219	—	70.9%	—
P2Net (NYUv2 leaderboard)	0.253	—	65.1%	—

Completeness, recall, accuracy, and precision at 5 cm also confirm MonoSelfRecon’s strong mesh quality relative to prior self-supervised methods.

5.2 Few-shot Generalization

When evaluated on 7Scenes with no new labels (ScanNet pretrain), MonoSelfRecon attains F-score 0.323, surpassing several fully supervised voxel-based baselines.

5.3 Qualitative Results

Recovered meshes demonstrate high completeness and fidelity for complex furniture and indoor geometry, surpassing prior self-supervised methods in rendering crisp furniture edges, walls, and floor boundaries.
Weak supervision using a coarse mask removes remaining artifacts, yielding mesh quality close to that of fully supervised methods.

6. Supervised and Weakly Supervised Extensions

MonoSelfRecon’s purely self-supervised consistency losses are directly compatible with external supervision where available:

Voxel SDF $z(x) \in \mathbb{R}^C$ 1 loss may be added where ground-truth SDF is present.
Sparse or coarse occupancy masks (e.g., as in NeuralRecon) yield a “weakly supervised” variant boosting mesh F-score above that of many fully supervised methods.

This design enables seamless hybridization with existing supervised pipelines to further improve accuracy, but without any supervision MonoSelfRecon already approaches fully supervised voxel SDF models on indoor scenes (Li et al., 2024).

7. Ablations, Limitations, and Future Work

7.1 Ablation Findings

Ablation studies demonstrate that each architectural element (GRU fragment fusion, NeRF depth and color guidance, and attentional feature lifting) produces clear performance gains. For instance, omitting all three results in AbsRel 0.358 and F-score 0.171, while the full model attains AbsRel 0.143 and F-score 0.260.

7.2 Limitations

Restriction to indoor scenes with bounded depth due to fixed voxel-grid size and stride.
SDF is defined discretely on voxels; achieving continuous SDF necessitates further development, such as integrating a continuous MLP decoder or hybrid SDF-NeRF schemes.

7.3 Prospective Directions

Continuous SDF–NeRF Fusion: Developing continuous SDF predictions via MLPs, potentially overcoming voxel discretization.
Scaling to Outdoor Scenes: Employing multi-scale voxel grids, octrees, or adaptive resolutions to address larger and more varied environments.
Semantic and Normal Priors: Integrating learned priors may further stabilize and enhance self-supervision.

8. Relation to Other Self-Supervised Reconstruction Approaches

Prior art such as SelfRecon (Jiang et al., 2022) targets monocular reconstruction of clothed humans by fusing explicit template-free mesh evolution and implicit SDF, but requires foreground masks and normal predictions as supervision, and is focused on animatable human avatars. In contrast, MonoSelfRecon is oriented toward holistic indoor scene mesh reconstruction, is self-supervised from raw RGB with only pose supervision, and yields a generalizable architecture suitable for any voxel-SDF based approach (Li et al., 2024).

MonoSelfRecon establishes a new paradigm for self-supervised 3D reconstruction by explicitly coupling volumetric SDF representations with neural rendering and geometric priors, producing metrically accurate, generalizable mesh reconstructions from monocular RGB video.

Markdown Report Issue Upgrade to Chat

References (2)

MonoSelfRecon: Purely Self-Supervised Explicit Generalizable 3D Reconstruction of Indoor Scenes from Monocular RGB Views (2024)

SelfRecon: Self Reconstruction Your Digital Avatar from Monocular Video (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MonoSelfRecon.

MonoSelfRecon: Self-Supervised 3D Scene Reconstruction

1. System Overview and Motivation

2. Architectural Components

2.1 2D-3D Encoder with Attentional View Fusion

2.2 GRU-based Sequential Fragment Fusion

2.3 Voxel-SDF Decoder

2.4 Generalizable NeRF Decoder

3. Self-Supervised Losses

3.1 SDF-Photometric Rendering Loss

3.2 Planar-Consistency Loss

3.3 Depth-Consistency (NeRF Guidance) Loss

3.4 NeRF RGB, Smoothness, and SSIM Losses

4. Training and Inference Pipeline

5. Quantitative and Qualitative Results

5.1 Depth and Mesh Reconstruction Performance

5.2 Few-shot Generalization

5.3 Qualitative Results

6. Supervised and Weakly Supervised Extensions

7. Ablations, Limitations, and Future Work

7.1 Ablation Findings

7.2 Limitations

7.3 Prospective Directions

8. Relation to Other Self-Supervised Reconstruction Approaches

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

MonoSelfRecon: Self-Supervised 3D Scene Reconstruction

1. System Overview and Motivation

2. Architectural Components

2.1 2D-3D Encoder with Attentional View Fusion

2.2 GRU-based Sequential Fragment Fusion

2.3 Voxel-SDF Decoder

2.4 Generalizable NeRF Decoder

3. Self-Supervised Losses

3.1 SDF-Photometric Rendering Loss

3.2 Planar-Consistency Loss

3.3 Depth-Consistency (NeRF Guidance) Loss

3.4 NeRF RGB, Smoothness, and SSIM Losses

4. Training and Inference Pipeline

5. Quantitative and Qualitative Results

5.1 Depth and Mesh Reconstruction Performance

5.2 Few-shot Generalization

5.3 Qualitative Results

6. Supervised and Weakly Supervised Extensions

7. Ablations, Limitations, and Future Work

7.1 Ablation Findings

7.2 Limitations

7.3 Prospective Directions

8. Relation to Other Self-Supervised Reconstruction Approaches

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research