Papers
Topics
Authors
Recent
Search
2000 character limit reached

REST3D: Dual Approaches in 3D Scene Analysis

Updated 31 May 2026
  • REST3D is a dual-framework approach that advances 3D scene analysis by integrating semi-supervised 3D referring expression segmentation and physically stable 3D scene reconstruction.
  • In the segmentation paradigm, REST3D employs a dual teacher-student network, pseudo-labeling via TSCS, and dynamic weighting (QDW) to significantly boost mask annotation efficiency and mIoU.
  • For reconstruction, the pipeline reconstructs physically consistent 3D scenes from a single RGB image by combining visual language models, canonicalization, and physics-constrained optimization for realistic VR applications.

REST3D refers to two distinct state-of-the-art frameworks in 3D scene analysis and understanding: (1) a semi-supervised baseline for 3D Referring Expression Segmentation, also known as 3DResT, and (2) a pipeline for reconstructing physically stable 3D scenes from a single RGB image. Both approaches advance the state of the art in their respective domains by addressing limitations in annotation efficiency or physical plausibility, and each introduces novel algorithmic constructs to resolve longstanding bottlenecks (Chen et al., 17 Apr 2025, Ma et al., 28 May 2026).

1. REST3D for Semi-Supervised 3D Referring Expression Segmentation

1.1. Problem Formulation

The 3D Referring Expression Segmentation (3D-RES) task receives a set of 3D point clouds VRn×3V \in \mathbb{R}^{n \times 3}, corresponding natural language referring expressions TT, and produces a segmentation mask Y{0,1}nY \in \{0,1\}^n indicating points belonging to the object referenced by the text. A major barrier in scaling 3D-RES is the cost of mask annotation for every language expression. REST3D mitigates this by enabling semi-supervised learning, leveraging a labeled set Dl={(Vil,Til),Yil}i=1Nl\mathcal{D}_l = \{(V_i^l, T_i^l), Y_i^l\}_{i=1}^{N_l} and a large unlabeled set Du={(Vju,Tju)}j=1Nu\mathcal{D}_u = \{(V_j^u, T_j^u)\}_{j=1}^{N_u} with NlNuN_l \ll N_u, and seeks to optimize model parameters θ\theta via

minθ  L(θ;Dl,Du)\min_\theta \; \mathcal{L}(\theta; \mathcal{D}_l, \mathcal{D}_u)

(Chen et al., 17 Apr 2025).

1.2. Framework Architecture

The architecture consists of a dual-network teacher-student layout, with both sharing modules: a point-based 3D encoder, a Bi-LSTM or Transformer language encoder, a fusion module via cross-attention between visual and language features, and an upsampling mask decoder. The overall flow at each iteration involves:

  1. Sampling mini-batches from Dl\mathcal{D}_l and Du\mathcal{D}_u.
  2. Teacher network (parameters TT0) processes weakly augmented TT1, producing pseudo-labels TT2.
  3. Student network (TT3) consumes strongly augmented batches, yielding outputs for both labeled and unlabeled inputs.
  4. Supervised and unsupervised losses are computed, after which TT4 is updated by gradient descent, and TT5 updated as an exponential moving average (EMA) of TT6:

TT7

  1. Teacher-Student Consistency-Based Sampling (TSCS) is run periodically (Chen et al., 17 Apr 2025).

1.3. Teacher-Student Consistency-Based Sampling (TSCS) and Quality-Driven Dynamic Weighting (QDW)

TSCS defines candidate high-quality pseudo-labels by measuring Teacher-Student IoU correlation: TT8 Samples with TT9 (e.g., Y{0,1}nY \in \{0,1\}^n0) are treated as genuine labels and incorporated into Y{0,1}nY \in \{0,1\}^n1. QDW further assigns a weight Y{0,1}nY \in \{0,1\}^n2 to each unlabeled sample, ensuring that all pseudo-labels—even low-quality ones—contribute with appropriate strength to the loss (Chen et al., 17 Apr 2025).

1.4. Loss Functions

The total loss is

Y{0,1}nY \in \{0,1\}^n3

Supervised loss combines BCE and Dice for labeled points; unsupervised loss is a weighted combination (by Y{0,1}nY \in \{0,1\}^n4) of BCE and Dice for each unlabeled sample, with an optional feature attention loss: Y{0,1}nY \in \{0,1\}^n5 (Chen et al., 17 Apr 2025).

1.5. Empirical Results

In experiments on ScanRefer (800 ScanNet scenes, 51,583 expressions), REST3D achieved 25.41% mIoU using only 1% labeled data, an improvement of +8.34 mIoU over fully supervised methods. Ablation revealed +1.61 mIoU from TSCS alone and +0.15 from QDW, with their combination achieving the headline result. REST3D generalizes to other referring tasks and can incorporate RGB-D or vision-language pretraining. Limitations include the necessity for careful threshold and EMA tuning and some additional bookkeeping for labeled/unlabeled splits (Chen et al., 17 Apr 2025).

2. REST3D for Physically Stable 3D Scene Reconstruction from a Single Image

2.1. Problem Overview

Given a single indoor RGB image Y{0,1}nY \in \{0,1\}^n6, the objective is to recover a set of 3D object meshes Y{0,1}nY \in \{0,1\}^n7 and their 6-DoF poses Y{0,1}nY \in \{0,1\}^n8 that are both visually consistent with Y{0,1}nY \in \{0,1\}^n9 and physically stable (i.e., no floating, penetration, or instability in simulation). Prior works such as FACTORED3D, Gen3DSR, and SAM3D reconstruct plausible 3D geometry but neglect scene-level physical consistency, resulting in objects that float, intersect, or topple in physics simulation. Conversely, scene-generation models with strong priors, such as DigitalCousins or SAGE, can misalign with the image (Ma et al., 28 May 2026).

2.2. Agentic Physical Scene Understanding

REST3D introduces a VLM-driven analysis pipeline to explicitly construct a hierarchical "scene tree" Dl={(Vil,Til),Yil}i=1Nl\mathcal{D}_l = \{(V_i^l, T_i^l), Y_i^l\}_{i=1}^{N_l}0 mapping objects and their support relations (ground, wall, ceiling, ground–wall). Steps:

  1. Open-vocabulary detection: Google Gemini 3 Flash VLM enumerates salient objects with attributes.
  2. Agentic segmentation: A "segmentation agent" employs SAM 3 for object masks, verified or refined via a "verifier agent" (VLM pass).
  3. Scene-tree induction: For each object, the VLM assigns a parent support node and relation ("on," "hanging," "attached"), forming Dl={(Vil,Til),Yil}i=1Nl\mathcal{D}_l = \{(V_i^l, T_i^l), Y_i^l\}_{i=1}^{N_l}1 such that every object is assigned to one canonical support. This yields, for vertical support,

Dl={(Vil,Til),Yil}i=1Nl\mathcal{D}_l = \{(V_i^l, T_i^l), Y_i^l\}_{i=1}^{N_l}2

where Dl={(Vil,Til),Yil}i=1Nl\mathcal{D}_l = \{(V_i^l, T_i^l), Y_i^l\}_{i=1}^{N_l}3 is the parent surface normal and Dl={(Vil,Til),Yil}i=1Nl\mathcal{D}_l = \{(V_i^l, T_i^l), Y_i^l\}_{i=1}^{N_l}4 is a signed offset (Ma et al., 28 May 2026).

2.3. Scene Initialization and Canonicalization

Each Dl={(Vil,Til),Yil}i=1Nl\mathcal{D}_l = \{(V_i^l, T_i^l), Y_i^l\}_{i=1}^{N_l}5 with initial Dl={(Vil,Til),Yil}i=1Nl\mathcal{D}_l = \{(V_i^l, T_i^l), Y_i^l\}_{i=1}^{N_l}6 is reconstructed using an image-to-3D backbone (SAM 3D). The alignment pipeline consists of:

  • Gravity alignment: Estimation of the upright direction Dl={(Vil,Til),Yil}i=1Nl\mathcal{D}_l = \{(V_i^l, T_i^l), Y_i^l\}_{i=1}^{N_l}7 via plane fitting. Objects are rotated so that Dl={(Vil,Til),Yil}i=1Nl\mathcal{D}_l = \{(V_i^l, T_i^l), Y_i^l\}_{i=1}^{N_l}8.
  • Support constraint enforcement: For each "on" edge in Dl={(Vil,Til),Yil}i=1Nl\mathcal{D}_l = \{(V_i^l, T_i^l), Y_i^l\}_{i=1}^{N_l}9, the object's vertical coordinate is adjusted so its mesh bottom aligns with the parent's mesh top, minimizing the canonicalization loss:

Du={(Vju,Tju)}j=1Nu\mathcal{D}_u = \{(V_j^u, T_j^u)\}_{j=1}^{N_u}0

After this canonicalization, the initial scene respects coarse gravity and support, although collisions and local errors may remain (Ma et al., 28 May 2026).

2.4. Physics-Constrained Optimization

The central refinement involves searching over pose perturbations for all objects so that post-physics simulation configuration is both stable and image-consistent. REST3D utilizes the Cross-Entropy Method (CEM) optimizer to minimize a composite energy function, summing:

  • Stability drift: Object displacement and rotation post-settlement,

Du={(Vju,Tju)}j=1Nu\mathcal{D}_u = \{(V_j^u, T_j^u)\}_{j=1}^{N_u}1

  • Velocity: Early-step velocities for each object,

Du={(Vju,Tju)}j=1Nu\mathcal{D}_u = \{(V_j^u, T_j^u)\}_{j=1}^{N_u}2

  • Penetration: Number of inter-object convex-hull intersections before and after simulation,

Du={(Vju,Tju)}j=1Nu\mathcal{D}_u = \{(V_j^u, T_j^u)\}_{j=1}^{N_u}3

  • Layout fidelity: Pose change from canonicalized initialization,

Du={(Vju,Tju)}j=1Nu\mathcal{D}_u = \{(V_j^u, T_j^u)\}_{j=1}^{N_u}4

  • Combined:

Du={(Vju,Tju)}j=1Nu\mathcal{D}_u = \{(V_j^u, T_j^u)\}_{j=1}^{N_u}5

where Du={(Vju,Tju)}j=1Nu\mathcal{D}_u = \{(V_j^u, T_j^u)\}_{j=1}^{N_u}6 coefficients are set as Du={(Vju,Tju)}j=1Nu\mathcal{D}_u = \{(V_j^u, T_j^u)\}_{j=1}^{N_u}7 in experiments (Ma et al., 28 May 2026).

To scale to many objects, optimization is performed locally within scene-tree groups, followed by a global refinement of group poses.

2.5. Experimental Results and Benchmarks

Evaluated on Replica (synthetic), ScanNet++ (real), and an Internet+stylized set, REST3D achieves:

Dataset Collision Rate Stability Rate Drift (m) LinVel (m/s) AngVel (rad/s) CD F-Score B-IoU
Replica 0.0% 95.8% 0.094 0.152 0.557 0.007 0.919 0.37
ScanNet++ 5.9% 93.6% 0.080 0.159 1.039 0.019 0.807 0.20
Custom 1.2% 95.5% 0.017 0.140 0.468

Notably, the prior SAM3D† baseline had 66.7% collision and only 8.3% stability on Replica, signifying REST3D's advancement in simulation fidelity (Ma et al., 28 May 2026).

Ablation studies confirm the necessity of canonicalization for collision reduction and the importance of all energy terms for balanced physical and geometric fidelity.

2.6. VR Applications and Limitations

REST3D scenes are directly imported into a real-time VR system, enabling physically correct interactions with reconstructed objects via Meta Quest Pro and Isaac Gym. Demonstrations show robust static behavior and natural hand-object manipulation.

Limitations are primarily related to occasional detection misses by the VLM (e.g., wall-mounted shelves), lack of articulated or non-rigid object support, and that wall-attached objects may drift to the floor because walls are not explicitly optimized as rigid supports in the CEM (Ma et al., 28 May 2026).

3. Comparative Insights and Outlook

Both REST3D paradigms address critical bottlenecks at the intersection of data efficiency and physical validity in 3D scene understanding. REST3D for segmentation transforms semi-supervised 3D language grounding by activating pseudo-labels with dynamic trust, while REST3D for reconstruction offers a physics-aware single-image pipeline that efficiently marries image fidelity to simulation readiness.

Potential future extensions for both systems include multi-modal RGB-D reasoning, 3D visual grounding, 3D visual question-answering, and more extensive use of foundation vision-LLMs or simulated agents, provided architectural constraints and dataset annotations are managed (Chen et al., 17 Apr 2025, Ma et al., 28 May 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to REST3D.