Ray-Aware Pointer Memory with Adaptive Updates for Streaming 3D Reconstruction

Published 7 May 2026 in cs.CV | (2605.05749v1)

Abstract: Dense 3D reconstruction from continuous image streams requires both accurate geometric aggregation and stable long-term memory management. Recent feed-forward reconstruction frameworks integrate observations through persistent memory representations, yet most rely primarily on appearance-based similarity when updating memory. Such appearance-driven integration often leads to redundant accumulation of observations and unstable geometry when viewpoint changes occur. In this work, we propose a ray-aware pointer memory for streaming 3D reconstruction that explicitly models both spatial location and viewing direction within a unified memory representation. Each memory pointer stores its 3D position, associated ray direction, and feature embedding, allowing the system to reason jointly about geometric proximity and viewpoint consistency. Based on this representation, we introduce an adaptive pointer update strategy that replaces traditional fusion-based memory compression with a retain-or-replace mechanism. Instead of averaging nearby observations, the system selectively retains informative pointers while discarding redundant ones, preserving distinctive geometric structures while maintaining bounded memory growth. Furthermore, the joint reasoning over spatial distance and ray-direction discrepancy enables the system to distinguish between local redundancy, novel observations, and potential loop revisits in a unified manner. When loop candidates are detected, pose refinement is triggered to enforce global geometric consistency across the reconstruction. Extensive experiments demonstrate that the proposed ray-aware memory design significantly improves long-term reconstruction stability and camera pose accuracy while maintaining efficient streaming inference. Our approach provides a principled framework for scalable and drift-resistant online 3D reconstruction from image streams.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces a ray-aware pointer memory that integrates ray geometry with dense feature extraction, enabling robust incremental 3D reconstruction.
It achieves state-of-the-art accuracy and reduced memory footprint by using a stochastic retain-or-replace update mechanism to handle redundant spatial pointers.
Empirical results on multiple benchmarks demonstrate enhanced pose estimation and depth prediction, with significant improvements over prior online methods.

Ray-Aware Pointer Memory for Streaming 3D Reconstruction

Methodological Innovations

This work introduces a pointer-based streaming 3D reconstruction framework that explicitly incorporates ray geometry into the long-term scene memory. The representation augments each pointer with not only a 3D position and learned features but also the input ray direction and timestamp. This design enables geometrically-aware matching and robustifies incremental scene fusion against challenges caused by viewpoint variation and appearance aliasing.

Upon arrival of a new RGB frame, dense features are extracted and candidate 3D points are predicted in a unified world frame. Novelty emerges from the pointer–image interaction module coupled with a memory encoder that aggregates these candidates into a persistent pointer memory, enabling joint camera pose, depth, and point cloud prediction.

Figure 1: The streaming pipeline uses dense frame encoding, a 3D pointer memory, and ray-aware interactions to incrementally refine reconstruction in long causal sequences.

Unlike prior approaches (e.g., Point3R), memory management eschews traditional fusion or averaging strategies, which degrade local geometric structure. Instead, a retain-or-replace mechanism stochastically selects, for each spatially-redundant pointer–observation pair, whether the new or prior pointer is kept. The explicit encoding of viewing direction allows for unambiguous discrimination between viewpoint redundancy, novel structure, and loop revisiting, also enabling dynamic triggering of geometric pose refinement during loop closure events.

Unified Geo-Visual Reasoning

The pointer matching process employs a tunable metric combining Euclidean distance in position and cosine angular distance in ray direction. This supports three operational regimes for pointer association:

Local Redundancy: Small position and angular distances signal redundant measurements from similar viewpoints, enabling conservative pointer update to avoid oversmoothing.
Loop Revisiting: Small position but large angular separation identifies revisits from new directions, which can trigger pose graph optimization for drift correction.
Novel Geometry: Large positional separation adds new spatial landmarks to the scene representation.

The stochastic retain-or-replace update policy avoids systematic bias in pointer retention, increases geometric diversity, and maintains bounded memory growth. Importantly, this contrasts with fixed-capacity or fusion methods, where spurious merging of innovative geometry is frequent, and memory inflation with redundant information is common.

Figure 2: Comparison of the merge (fusion) update and the retain-or-replace pointer update; the latter achieves higher accuracy with fewer memory pointers.

Empirical Evaluation

Experiments span multiple indoor and outdoor benchmarks (7-Scenes, NRGBD, NYU-v2, Sintel, Bonn, KITTI, ScanNet, TUM-dynamic), addressing dense 3D reconstruction, monocular depth, and online camera pose estimation. The method is benchmarked against both optimization-heavy (global alignment) and online memory-based baselines, including DUSt3R, MASt3R, Spann3R, CUT3R, and Point3R.

Key findings:

3D Reconstruction: The approach achieves best-in-class accuracy and completeness on both 7-Scenes and NRGBD, with mean Acc/Comp noticeably lower than all online baselines [Point3R value: 0.085 vs Ours: 0.035 on 7-Scenes; see Table 1 in the paper].
Memory Efficiency: The selective adaptive update reduces memory footprint by 20–40% against merging strategies, especially in large or highly repetitive environments.
Figure 3: Qualitative reconstructions on NRGBD and 7-Scenes, showing fine geometric details and robustness to viewpoint variation.
Pose Estimation: The method is competitive with global optimization pipelines, while remaining strictly causal, and outperforms prior memory-based online methods in ATE, RPE trans, and RPE rot metrics.
Depth Estimation: Zero-shot generalization to both indoor and outdoor datasets is demonstrated, with superior or comparable absolute relative error and inlier rates.

Ablation studies show that deterministic retain or replace policies decrease performance relative to the stochastic scheme, and that feature merging (as in Point3R) leads to degraded geometric and memory efficiency. The approach achieves a significantly better trade-off, preserving geometric discriminability and supporting robust loop closure handling.

Figure 4: Reserved memory comparison for merged (Point3R) and retain-or-replace methods; the latter yields greater stability and consistently lower GPU memory consumption.

Theoretical Implications

By integrating viewing direction, this framework moves from pure appearance and position-driven association to a more complete geometric context, thus reducing the ambiguity in pointer matching under severe occlusions, lighting changes, or texture repetition. The memory update mechanism discards fusion-induced drift and provides a probabilistic guarantee on scene coverage and redundancy.

Loop closure is robustly handled in a fully causal pipeline—a significant advance over post-hoc optimization or methods incapable of geometric loop reasoning. Pose refinement is triggered adaptively upon detection of ray-direction-based revisiting, enforcing global consistency during long-horizon mapping.

Practical Impact and Future Work

The memory and compute savings achieved are crucial for practical deployment in robotics, AR, and real-time digital twin systems. The bounded, redundant-free memory cache supports streaming on resource-constrained platforms, while the competitive geometric accuracy and pose stability enable adoption in demanding SLAM and scene understanding pipelines.

Limitations include sensitivity to outlier pose estimates during incremental integration and reliance on a globally accurate initial camera pose. The update policy could potentially be enhanced using information-theoretic ranking or task-specific labeling for further content-aware pointer selection. Generalization to unconstrained outdoor mapping, more complex loop closure environments, and integration with NeRF-style implicit representations represent promising future directions.

Conclusion

Ray-aware pointer memory with adaptive updates delivers a principled, efficient, and robust streaming 3D reconstruction architecture. By unifying position- and ray-direction-aware reasoning with a selective stochastic update mechanism, it achieves substantial gains in memory efficiency, geometric accuracy, and pose consistency over prior online and even global alignment methods, paving the way for scalable and drift-resistant incremental scene reconstruction (2605.05749).

Markdown Report Issue