Point3R: Streaming 3D Reconstruction with Explicit Spatial Pointer Memory (2507.02863v1)

Published 3 Jul 2025 in cs.CV, cs.AI, and cs.LG

Abstract: Dense 3D scene reconstruction from an ordered sequence or unordered image collections is a critical step when bringing research in computer vision into practical scenarios. Following the paradigm introduced by DUSt3R, which unifies an image pair densely into a shared coordinate system, subsequent methods maintain an implicit memory to achieve dense 3D reconstruction from more images. However, such implicit memory is limited in capacity and may suffer from information loss of earlier frames. We propose Point3R, an online framework targeting dense streaming 3D reconstruction. To be specific, we maintain an explicit spatial pointer memory directly associated with the 3D structure of the current scene. Each pointer in this memory is assigned a specific 3D position and aggregates scene information nearby in the global coordinate system into a changing spatial feature. Information extracted from the latest frame interacts explicitly with this pointer memory, enabling dense integration of the current observation into the global coordinate system. We design a 3D hierarchical position embedding to promote this interaction and design a simple yet effective fusion mechanism to ensure that our pointer memory is uniform and efficient. Our method achieves competitive or state-of-the-art performance on various tasks with low training costs. Code is available at: https://github.com/YkiWu/Point3R.

Summary

The paper presents a novel streaming 3D reconstruction framework that utilizes explicit spatial pointer memory to dynamically scale with scene complexity.
The method employs ViT-based encoders and decoders with 3D hierarchical position embeddings, achieving state-of-the-art performance on benchmarks like 7-scenes and NYU-v2.
Its online memory fusion mechanism maintains interpretability and efficiency, offering practical benefits for robotics, AR/VR, and autonomous navigation applications.

Point3R: Streaming 3D Reconstruction with Explicit Spatial Pointer Memory

Point3R introduces a novel online framework for dense 3D scene reconstruction from image sequences, addressing the limitations of prior memory-based and global-attention paradigms. The core innovation is an explicit spatial pointer memory, which directly associates memory units with 3D positions in the global coordinate system, enabling efficient, scalable, and interpretable streaming reconstruction.

Methodological Contributions

Point3R departs from implicit, fixed-capacity memory mechanisms by maintaining a dynamic set of 3D pointers, each linked to a spatial feature and a specific 3D location. This design ensures that memory capacity naturally scales with the explored scene, mitigating information loss and redundancy inherent in previous approaches such as Spann3R and CUT3R.

The architecture comprises:

ViT-based Image Encoder: Each input frame is encoded into image tokens.
Pointer-Image Interaction Decoders: Two intertwined ViT-based decoders facilitate explicit interaction between image tokens and spatial pointer memory, leveraging a learnable pose token to bridge local and global coordinate systems.
3D Hierarchical Position Embedding: An extension of RoPE, this embedding injects continuous 3D spatial priors into the attention mechanism, enhancing the spatial awareness of memory querying and fusion.
Memory Encoder and Fusion Mechanism: New pointers are generated from the current frame and integrated into memory via a distance-based fusion strategy, ensuring spatial uniformity and efficiency. The fusion threshold adapts dynamically to the spatial extent of the scene.

The explicit pointer memory is updated online as new frames are processed, supporting both static and dynamic scenes, as well as unordered or sparsely sampled image collections.

Experimental Results

Point3R is evaluated on a comprehensive suite of tasks: dense 3D reconstruction, monocular and video depth estimation, and camera pose estimation. The method is benchmarked against both optimization-based (e.g., DUSt3R-GA, MASt3R-GA) and online memory-based (e.g., Spann3R, CUT3R) baselines.

3D Reconstruction: On 7-scenes and NRGBD datasets, Point3R achieves state-of-the-art or highly competitive results. For example, on 7-scenes, it attains a mean accuracy of 0.124 and mean completion of 0.139, outperforming CUT3R and Spann3R in most metrics. The method demonstrates robustness to sparse inputs and minimal frame overlap, highlighting the efficacy of explicit spatial memory.

Monocular and Video Depth Estimation: On NYU-v2, Sintel, Bonn, and KITTI, Point3R consistently matches or surpasses prior methods. Notably, on NYU-v2, it achieves an Abs Rel of 0.079 and δ<1.25 of 92.0, outperforming all baselines. In video depth estimation, Point3R excels in both scale-invariant and metric-scale settings, particularly on dynamic datasets.

Camera Pose Estimation: While Point3R performs comparably to other online methods, a performance gap remains relative to optimization-based approaches, especially in static scenes. This is attributed to the growing spatial extent of pointer memory, which can introduce interference in pose estimation for long sequences.

Implementation and Training

Point3R is implemented with ViT-Large encoders and ViT-Base decoders, initialized from DUSt3R pre-trained weights. The memory encoder is lightweight, and the overall model is trained on 8 H100 GPUs for 7 days, reflecting a low computational cost relative to the scale and diversity of the training data (14 datasets spanning static/dynamic, indoor/outdoor, real/synthetic scenes).

The training strategy involves three stages, progressively increasing input sequence length and resolution. The memory fusion mechanism is disabled in early training to stabilize learning, then enabled for efficiency in later stages.

Ablation and Analysis

Ablation studies confirm the importance of both the memory fusion mechanism and the 3D hierarchical position embedding. Removing the fusion mechanism increases memory size and runtime, while omitting the position embedding degrades reconstruction accuracy and normal consistency.

The explicit pointer memory provides interpretability and adaptability, as each memory unit corresponds to a real 3D location. The fusion mechanism effectively controls memory growth and computational cost, with minimal impact on reconstruction quality.

Implications and Future Directions

Point3R's explicit spatial pointer memory paradigm offers several practical advantages:

Scalability: Memory grows with scene exploration, supporting large-scale and long-horizon reconstruction without fixed-capacity bottlenecks.
Efficiency: Online processing and memory fusion maintain tractable runtime and memory usage.
Generalization: The method is agnostic to scene dynamics and input ordering, making it suitable for embodied agents and real-world robotics.
Interpretability: Explicit spatial association of memory units facilitates debugging and adaptation to downstream tasks.

The primary limitation is the potential for pointer proliferation in very large or complex scenes, which can affect pose estimation. Future work may focus on more sophisticated pointer management, hierarchical memory organization, or improved pointer-image interaction to further enhance scalability and accuracy.

Broader Impact

Point3R advances the state of streaming 3D reconstruction, with direct applications in robotics, AR/VR, autonomous navigation, and digital twin creation. Its explicit, interpretable memory design aligns with the need for transparent and adaptable AI systems in safety-critical and interactive environments. The framework's low training cost and strong empirical performance position it as a practical foundation for future research in online 3D perception and scene understanding.

PDF Markdown

Related Papers

GitHub

GitHub - YkiWu/Point3R (13 stars)

Tweets

https://twitter.com/zhenjun_zhao/status/1941070055468601528

https://twitter.com/ducha_aiki/status/1942883274289737907

https://twitter.com/ArxivToday/status/1942266016412635182

https://twitter.com/ArxivToday/status/1941178793281941506