Continuous 3D Perception Model with Persistent State (2501.12387v1)

Published 21 Jan 2025 in cs.CV

Abstract: We present a unified framework capable of solving a broad range of 3D tasks. Our approach features a stateful recurrent model that continuously updates its state representation with each new observation. Given a stream of images, this evolving state can be used to generate metric-scale pointmaps (per-pixel 3D points) for each new input in an online fashion. These pointmaps reside within a common coordinate system, and can be accumulated into a coherent, dense scene reconstruction that updates as new images arrive. Our model, called CUT3R (Continuous Updating Transformer for 3D Reconstruction), captures rich priors of real-world scenes: not only can it predict accurate pointmaps from image observations, but it can also infer unseen regions of the scene by probing at virtual, unobserved views. Our method is simple yet highly flexible, naturally accepting varying lengths of images that may be either video streams or unordered photo collections, containing both static and dynamic content. We evaluate our method on various 3D/4D tasks and demonstrate competitive or state-of-the-art performance in each. Project Page: https://cut3r.github.io/

Summary

The paper introduces CUT3R, a continuous updating transformer that maintains and refines a persistent state for online 3D reconstruction.
It leverages bidirectional interactions between visual tokens and state tokens through interconnected transformer decoders to predict metric-scale pointmaps and camera poses.
The model achieves state-of-the-art performance with real-time processing (~17 FPS), robustly handling dynamic scenes and sparse image collections.

This paper introduces CUT3R (Continuous Updating Transformer for 3D Reconstruction), a unified framework for online 3D perception that processes image streams (videos or photo collections) to generate dense 3D reconstructions and estimate camera parameters without prior camera information. The core idea is a stateful recurrent model that maintains and continuously updates a persistent internal state, representing the accumulated understanding of the 3D scene.

Methodology: CUT3R

Input & Output: Takes a stream of RGB images as input. For each new image, it outputs:
- A metric-scale pointmap (per-pixel 3D points) in the current camera's coordinate frame (P_self).
- A metric-scale pointmap in a common world coordinate frame (P_world, defined by the first image).
- The 6-DoF camera pose (T_world<-self) transforming the current camera frame to the world frame.
Persistent State: The model maintains a state represented as a set of learnable tokens (s). This state is initialized once and updated recurrently.
State-Input Interaction: For each incoming image I:
- The image is encoded into visual tokens F using a ViT encoder (Encoder_i).
- A learnable "pose token" z_in is prepended to F.
- F and s interact bidirectionally using two interconnected Transformer decoders. This performs both:
  - State Update: Integrates information from the current image F into the state tokens s.
  - State Readout: Retrieves context from the past observations stored in s to enrich the image tokens F and the output pose token z_out.
- The updated state s_new replaces s for the next step.
- Prediction heads operate on the enriched tokens F and z_out to produce P_self, P_world, and T_world<-self. Predicting both P_self and P_world along with T_world<-self simplifies training, allowing direct supervision on datasets with only partial annotations (e.g., only depth or only pose).

# Pseudocode for one step
function process_image(I, s_old):
  F = Encoder_i(I)
  z_in = learnable_pose_token()

  # Simultaneous update and readout
  [z_out, F_enriched], s_new = Decoders([z_in, F], s_old)

  # Predict outputs
  P_self = Head_p(F_enriched)
  P_world = Head_pw(F_enriched, z_out)
  T_world_self = Head_T(z_out)

  return P_self, P_world, T_world_self, s_new

Inferring Unseen Regions: The model can infer geometry for unobserved viewpoints.
- A virtual camera view is provided as a query, represented by a raymap R (encoding ray origins and directions per pixel).
- The raymap is encoded into tokens F_r using a separate, lightweight encoder (Encoder_r).
- F_r interacts with the current state s via the same decoders, but only performs state readout (the state s is not updated).
- Prediction heads are used on the resulting F_r_enriched and z_out to predict the corresponding pointmap (P_world) and color (C) for the queried view. This leverages the scene priors captured in the state s.

Training

Objective: Trained end-to-end using:
- A confidence-aware L2 regression loss (L_conf) for both predicted pointmaps (P_self, P_world), comparing against ground truth points. The loss incorporates predicted confidence scores (c) per point: $L_{conf} \propto c \cdot || \hat{x}/\hat{s} - x/s ||_2 - \alpha \log c$ . Metric scale is enforced by setting predicted scale s_hat equal to ground truth scale s when available.
- An L2 loss (L_pose) on the predicted pose (parameterized as quaternion and translation) compared to ground truth.
- An MSE loss (L_rgb) on predicted color when querying with raymaps.
Strategy:
- Datasets: Trained on a large, diverse collection of 32 datasets (e.g., CO3Dv2, ARKitScenes, ScanNet++, TartanAir, Waymo, MegaDepth, DynamicStereo) covering static/dynamic, indoor/outdoor, real/synthetic scenes, including datasets with only partial annotations (e.g., pose-only like RealEstate10K, single-view depth like Synscapes).
- Curriculum Learning: Employed a multi-stage training strategy:
  1. Initial training on 4-view sequences from static datasets at 224x224 resolution.
  2. Incorporate dynamic scene datasets and partially annotated datasets.
  3. Increase resolution (max side 512px, varied aspect ratios).
  4. Freeze the encoder, train decoders/heads on longer sequences (4-64 views) to improve long-context reasoning.
- Architecture: ViT-Large encoder (initialized from DUSt3R), ViT-Base decoders, 768 state tokens (dim 768), lightweight raymap encoder.
- Hardware: Trained on 8xA100 GPUs (80GB).

Applications and Evaluation

Tasks: Evaluated on monocular depth estimation, video depth estimation (consistency), camera pose estimation, and 3D reconstruction from sparse views.
Performance: Achieves state-of-the-art or competitive results, particularly strong among online methods that don't require offline optimization or global alignment post-processing.
- Outperforms online baseline Spann3R significantly.
- Competitive with or surpasses optimization-based methods like DUSt3R-GA and MonST3R-GA in some settings, while being much faster (e.g., ~17 FPS vs <1 FPS).
- Handles dynamic scenes effectively, unlike methods assuming static scenes.
- Demonstrates effective reconstruction from sparse image collections (2-5 frames).
State Update: An analysis ("revisiting" experiment) shows that processing images again using the final state (which has seen all images) improves reconstruction accuracy, confirming the state effectively accumulates and refines scene information over time.
Unseen Region Inference: Qualitative results show the ability to generate plausible metric-scale geometry for queried unseen views, including structures not directly visible in the input images, demonstrating captured 3D priors.

Implementation Considerations

Online Processing: The model operates sequentially, processing each frame as it arrives and updating the reconstruction, suitable for real-time applications like robotics or AR.
Flexibility: Naturally handles varying numbers of input images, from single images to long videos or unordered photo collections. It does not require known camera intrinsics or extrinsics.
Metric Scale: Directly outputs pointmaps and poses in metric units (meters), simplifying integration with real-world systems.
Computational Cost: Uses ViT-Large/Base architectures. Inference speed reported around 17 FPS on an A100 GPU for 512px inputs, significantly faster than offline optimization methods but potentially demanding for resource-constrained devices. Training requires significant resources (8xA100s).
Code Availability: A project page is mentioned (\url{https://cut3r.github.io/}), suggesting code might be available.

Limitations

Drift: Like many online methods, it may suffer from accumulated drift over very long sequences without global bundle adjustment or loop closure mechanisms.
Generation Quality: Inferring unseen regions is done via deterministic regression, which can lead to blurry or overly smooth results compared to generative approaches, especially for large viewpoint changes.
Training Time: Training complex recurrent models on large diverse datasets is computationally intensive.

PDF Markdown

Related Papers

GitHub

CUT3R: Continuous 3D Perception Model with Persistent State

Tweets

https://twitter.com/kwangmoo_yi/status/1882565513274139025

https://twitter.com/akanazawa/status/1884118429059407983

https://twitter.com/zhenjun_zhao/status/1881948039176470990

https://twitter.com/QianqianWang5/status/1883939650835997151

https://twitter.com/Almorgand/status/1883811292177424703

https://twitter.com/arxivsanitybot/status/1882252284551205367