Papers
Topics
Authors
Recent
Search
2000 character limit reached

MonoSelfRecon: Self-Supervised 3D Scene Reconstruction

Updated 14 March 2026
  • MonoSelfRecon is a self-supervised framework for explicit 3D indoor scene reconstruction from monocular RGB videos, using a voxel-based signed distance function and Neural Radiance Field guidance.
  • It leverages a 2D-to-3D encoder with attentional view fusion and GRU-based sequential fusion to aggregate multi-view features robustly across temporal fragments.
  • The framework establishes new benchmarks in depth prediction and mesh reconstruction, demonstrating high accuracy and generalizability on indoor scene datasets.

MonoSelfRecon is a framework for explicit 3D scene reconstruction from monocular RGB videos under purely self-supervised training, providing generalizable mesh outputs for indoor environments. It achieves the longstanding goal of combining pure self-supervision, across-scene generalization, and explicit mesh generation, using a voxel-based @@@@1@@@@ (SDF) as its core 3D representation, augmented by a generalizable Neural Radiance Field (NeRF) to guide training through novel self-supervised consistency losses. MonoSelfRecon is not limited to a specific neural architecture and establishes new performance benchmarks for self-supervised depth prediction and 3D mesh reconstruction on realistic scene datasets (Li et al., 2024).

1. System Overview and Motivation

MonoSelfRecon addresses limitations of previous monocular 3D reconstruction approaches that suffer from one or more of the following: requirement for full supervision, poor generalization to unseen environments, or reliance on implicit representations. Its innovation is to reconstruct explicit 3D meshes by fusing features from monocular RGB videos, without any ground truth depth or SDF labels.

The framework takes a sequence of RGB frames and known camera poses and processes them via three principal modules:

  • A 2D-to-3D encoder that extracts and fuses image features into a volumetric representation over a voxel grid by combining attentional view fusion and sequential fragment fusion using a GRU.
  • A voxel-SDF decoder that regresses multi-scale signed distance values for each voxel, enabling mesh extraction with Marching Cubes at inference.
  • A generalizable NeRF decoder, conditioned on image features, that renders color and depth to supervise SDF predictions and enforce cross-view consistency.

All training supervision is internally synthesized through loss functions operating on photometric, geometric, and structural consistency principles. At test time, only the voxel-SDF decoder is active for mesh extraction.

2. Architectural Components

2.1 2D-3D Encoder with Attentional View Fusion

The encoder processes NN RGB keyframes {Ii,Ti}i=1N\{I_i, T_i\}_{i=1}^N (images and poses) through a ResNet-18 backbone, outputting multi-scale 2D image features. These are spatially projected, via camera intrinsics and extrinsics, onto a local 3D voxel grid of size L3L^3 for each fragment. Attentional view fusion is then performed: features corresponding to a voxel from all NN views are concatenated and passed through a lightweight Vision Transformer (layernorm, multi-head attention, MLP), weighting each view’s contribution and aggregating via pooling. The resulting fused feature z(x)RCz(x) \in \mathbb{R}^C for voxel xx encodes the local appearance and geometry.

2.2 GRU-based Sequential Fragment Fusion

To handle sequences longer than a single fragment, MonoSelfRecon divides the video into spatial/temporal fragments. Each is processed sequentially; at each pyramid level {1,2,3}\ell \in \{1,2,3\}, fragment features Gt()G_t^{(\ell)} are fused into global hidden-state volumes Ht()H_t^{(\ell)} via a 3D sparse-convolutional GRU:

zt=σ(SC([Ht1,Gt];Wz)), rt=σ(SC([Ht1,Gt];Wr)), H~t=tanh(SC([rtHt1,Gt];Wh)), Ht=(1zt)Ht1+ztH~t,\begin{aligned} z_t &= \sigma\left(\mathrm{SC}\left([H_{t-1}, G_t]; W_z\right)\right), \ r_t &= \sigma\left(\mathrm{SC}\left([H_{t-1}, G_t]; W_r\right)\right), \ \tilde{H}_t &= \tanh\left(\mathrm{SC}\left([r_t \odot H_{t-1}, G_t]; W_h\right)\right), \ H_t &= (1 - z_t) \odot H_{t-1} + z_t \odot \tilde{H}_t, \end{aligned}

where SC\mathrm{SC} denotes a sparse 3D convolution. This mechanism permits incremental, coarse-to-fine volumetric aggregation as new video fragments are viewed.

2.3 Voxel-SDF Decoder

Each pyramidal level’s final 3D feature volume is predictively mapped to voxel-wise signed distance values by fθ()f^{(\ell)}_\theta (a sparse conv layer plus linear map):

SDF(x)=fθ()(Ht()[x])R.\mathrm{SDF}(x) = f^{(\ell)}_\theta\left(H_t^{(\ell)}[x]\right) \in \mathbb{R}.

Three levels of voxel grids (e.g., L=96L=96 at the finest) are used. After all fragments, the final SDF grid is processed via Marching Cubes, extracting the zero level set mesh.

2.4 Generalizable NeRF Decoder

Parallel to the voxel pathway, a generalizable NeRF decoder, closely following MPI-NeRF, conditions on per-plane 2D features from the encoder at multiple view-aligned disparities djd_j. For each plane and pixel, features are extracted and passed through an MLP gϕg_\phi with standard positional encoding to yield color and density:

(cj(u),σj(u))=gϕ(z^j(u),γ(dj),γ(d)),(c_j(\mathbf{u}), \sigma_j(\mathbf{u})) = g_\phi\left(\hat{z}_j(\mathbf{u}), \gamma(d_j), \gamma(\mathbf{d})\right),

where d\mathbf{d} is the view direction. Volume rendering produces predicted RGB I^(u)\hat{I}(\mathbf{u}) and depth D^(u)\hat{D}(\mathbf{u}). Global scene awareness is achieved by conditioning on image features rather than per-scene codes, facilitating generalization across scenes.

3. Self-Supervised Losses

MonoSelfRecon achieves supervision through four complementary loss terms, all computable from monocular RGB input with known poses:

3.1 SDF-Photometric Rendering Loss

Photometric consistency is enforced by linking SDF-predicted surface points across image pairs. A voxel center xx with predicted SDF(x)\mathrm{SDF}(x) is backprojected to a 3D point, reprojected to both source and target frames, and the photometric error is computed:

Lsdf=vvxXIv(p(x))Iv(p(x))\mathcal{L}_{\mathrm{sdf}} = \sum_{v \neq v'} \sum_{x \in \mathcal{X}} \left| I_v(p(x)) - I_{v'}(p'(x)) \right|

averaged over valid correspondences.

3.2 Planar-Consistency Loss

Indoors, surfaces are often planar. Superpixels {SPm}\{\mathrm{SP}_m\} are identified per frame, and the corresponding surface points {Si}\{S_i\} are fit to planes AmA_m via least squares. Co-planarity is enforced by:

Lplane=miSPmAmSi12.\mathcal{L}_{\mathrm{plane}} = \sum_{m} \sum_{i \in \mathrm{SP}_m} \left| A_m^\top S_i - 1 \right|^2.

3.3 Depth-Consistency (NeRF Guidance) Loss

The depth predicted by NeRF (D^NeRF\hat{D}_{\mathrm{NeRF}}) is scale-aligned with SDF-derived pseudo-depth (DSDFD_{\mathrm{SDF}}) by a global factor:

Ldepth=pDSDF(p)αD^NeRF(p)\mathcal{L}_{\mathrm{depth}} = \sum_{p} \left| D_{\mathrm{SDF}}(p) - \alpha \hat{D}_{\mathrm{NeRF}}(p) \right|

linking the NeRF and voxel-SDF to geometric agreement.

3.4 NeRF RGB, Smoothness, and SSIM Losses

Standard NeRF supervision includes:

Lrgb=I^I1,Lsmooth=jcjcj+11,LSSIM=1SSIM(I^,I).\mathcal{L}_{\mathrm{rgb}} = \lVert \hat{I} - I \rVert_1,\qquad \mathcal{L}_{\mathrm{smooth}} = \sum_j \lVert c_j - c_{j+1} \rVert_1, \qquad \mathcal{L}_{\mathrm{SSIM}} = 1 - \mathrm{SSIM}(\hat{I}, I).

The total NeRF loss combines these, and the overall objective merges all losses with empirically chosen weights: Ltotal=λsdfLsdf+λplaneLplane+λdepthLdepth+λNeRFLNeRF\mathcal{L}_{\mathrm{total}} = \lambda_{\mathrm{sdf}} \mathcal{L}_{\mathrm{sdf}} + \lambda_{\mathrm{plane}} \mathcal{L}_{\mathrm{plane}} + \lambda_{\mathrm{depth}} \mathcal{L}_{\mathrm{depth}} + \lambda_{\mathrm{NeRF}} \mathcal{L}_{\mathrm{NeRF}}

Eikonal loss, standard in many SDF pipelines, is not needed; smoothness is enforced by volumetric and planar consistency.

4. Training and Inference Pipeline

The training pipeline is as follows:

  • Input: Monocular RGB video with known/corrected poses.
  • Fragmentation: Video is divided into fragments covering \sim0.3 m or 1515^\circ each; N=9N=9 frames per fragment.
  • Warm-up Stage: First 20 epochs train on fragments individually (no GRU fusion), enforcing only intra-fragment losses (Lsdf\mathcal{L}_{\mathrm{sdf}}, Lplane\mathcal{L}_{\mathrm{plane}}, LNeRF\mathcal{L}_{\mathrm{NeRF}}).
  • Main Stage: Subsequent 30 epochs incorporate GRU fusion and add inter-fragment depth consistency (Ldepth\mathcal{L}_{\mathrm{depth}}).
  • Optimization: Adam optimizer, learning rate 1e41\mathrm{e}{-4}. Image encoder is initialized from scratch.
  • Inference: Only the voxel-SDF decoder is retained. A mesh is directly extracted via Marching Cubes, producing explicit, metrically accurate geometry.

5. Quantitative and Qualitative Results

5.1 Depth and Mesh Reconstruction Performance

Evaluated on ScanNet (test split), MonoSelfRecon achieves:

Model/Setting AbsRel RMSE δ<1.25\delta<1.25 F-score (mesh)
MonoSelfRecon (self-sup) 0.143 0.333 79.2% 0.260
MonoSelfRecon (weak sup) 0.358
NeuralRecon (fully supervised) 0.065 94.8% 0.494
NeuralRecon (weak sup) 0.205
StructDepth (NYUv2 leaderboard) 0.219 70.9%
P2Net (NYUv2 leaderboard) 0.253 65.1%

Completeness, recall, accuracy, and precision at 5 cm also confirm MonoSelfRecon’s strong mesh quality relative to prior self-supervised methods.

5.2 Few-shot Generalization

When evaluated on 7Scenes with no new labels (ScanNet pretrain), MonoSelfRecon attains F-score 0.323, surpassing several fully supervised voxel-based baselines.

5.3 Qualitative Results

  • Recovered meshes demonstrate high completeness and fidelity for complex furniture and indoor geometry, surpassing prior self-supervised methods in rendering crisp furniture edges, walls, and floor boundaries.
  • Weak supervision using a coarse mask removes remaining artifacts, yielding mesh quality close to that of fully supervised methods.

6. Supervised and Weakly Supervised Extensions

MonoSelfRecon’s purely self-supervised consistency losses are directly compatible with external supervision where available:

  • Voxel SDF L1L_1 loss may be added where ground-truth SDF is present.
  • Sparse or coarse occupancy masks (e.g., as in NeuralRecon) yield a “weakly supervised” variant boosting mesh F-score above that of many fully supervised methods.

This design enables seamless hybridization with existing supervised pipelines to further improve accuracy, but without any supervision MonoSelfRecon already approaches fully supervised voxel SDF models on indoor scenes (Li et al., 2024).

7. Ablations, Limitations, and Future Work

7.1 Ablation Findings

Ablation studies demonstrate that each architectural element (GRU fragment fusion, NeRF depth and color guidance, and attentional feature lifting) produces clear performance gains. For instance, omitting all three results in AbsRel 0.358 and F-score 0.171, while the full model attains AbsRel 0.143 and F-score 0.260.

7.2 Limitations

  • Restriction to indoor scenes with bounded depth due to fixed voxel-grid size and stride.
  • SDF is defined discretely on voxels; achieving continuous SDF necessitates further development, such as integrating a continuous MLP decoder or hybrid SDF-NeRF schemes.

7.3 Prospective Directions

  • Continuous SDF–NeRF Fusion: Developing continuous SDF predictions via MLPs, potentially overcoming voxel discretization.
  • Scaling to Outdoor Scenes: Employing multi-scale voxel grids, octrees, or adaptive resolutions to address larger and more varied environments.
  • Semantic and Normal Priors: Integrating learned priors may further stabilize and enhance self-supervision.

8. Relation to Other Self-Supervised Reconstruction Approaches

Prior art such as SelfRecon (Jiang et al., 2022) targets monocular reconstruction of clothed humans by fusing explicit template-free mesh evolution and implicit SDF, but requires foreground masks and normal predictions as supervision, and is focused on animatable human avatars. In contrast, MonoSelfRecon is oriented toward holistic indoor scene mesh reconstruction, is self-supervised from raw RGB with only pose supervision, and yields a generalizable architecture suitable for any voxel-SDF based approach (Li et al., 2024).

MonoSelfRecon establishes a new paradigm for self-supervised 3D reconstruction by explicitly coupling volumetric SDF representations with neural rendering and geometric priors, producing metrically accurate, generalizable mesh reconstructions from monocular RGB video.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MonoSelfRecon.