RemixFusion: Mixed Representation for RGB-D Reconstruction
- RemixFusion is a residual-based mixed representation framework that combines a coarse explicit TSDF grid with a neural implicit module to capture high-frequency details in large-scale scenes.
- It employs residual bundle adjustment and adaptive gradient amplification to refine camera pose estimation and overcome common smoothing challenges in neural reconstructions.
- The local moving volume strategy ensures scalability and real-time performance by dynamically processing only active regions, thereby reducing memory demands.
RemixFusion denotes a residual-based mixed representation framework for large-scale online RGB-D reconstruction, designed to reconcile the advantages of both explicit and implicit scene representations in mapping and camera pose estimation contexts. It is characterized by integrating a coarse explicit Truncated Signed Distance Function (TSDF) grid with a neural implicit module that models residuals, thereby enabling both detailed reconstruction and efficient computation. RemixFusion further innovates with residual-based bundle adjustment for joint pose optimization, adaptive gradient amplification to enhance optimization convergence, and a local moving volume strategy that maintains scalability for large environments (Lan et al., 23 Jul 2025).
1. Motivation and Overview
Online dense RGB-D reconstruction has advanced with the introduction of neural implicit representations (such as neural radiance fields and hash-based encoding), which encode 3D geometry compactly and with high completeness. However, existing neural-based methods often suffer from excessive smoothing and high computational cost when reconstructing fine geometric details in large-scale scenes. In contrast, explicit TSDF grids afford efficient fusion and robust camera tracking, but at prohibitive memory costs as scene size scales.
RemixFusion is devised to combine these complementary strengths. It maintains a coarse, memory-efficient explicit TSDF grid to capture base scene structure while deploying a small implicit neural module, tasked solely with learning high-frequency residuals that the grid fails to represent. This division of labor provides an explicit backbone for robust mapping and tracking, while leveraging neural flexibility and memory efficiency for detail enhancement.
2. Residual-based Mixed Scene Representation
The core scene representation in RemixFusion follows a hierarchical, additive formulation:
where is the explicit coarse TSDF-based geometry (realized as a low-resolution grid ), and is a residual field learned by a neural implicit module parameterized by .
For a given 3D scene query point , the output property (e.g., TSDF value or RGB color) is evaluated as:
where denotes standard trilinear interpolation over the grid, and is the prediction from the neural decoder applied to a hash and positional encoding of .
This structure allows high-fidelity geometry to be represented without forcing the neural network to encode coarse spatial structure, which is handled efficiently by the explicit grid. As a result, reconstruction detail is greatly enhanced while memory and computational demands remain bounded.
3. Residual-based Camera Pose Estimation
Camera tracking and joint pose optimization in RemixFusion are also based on a residual philosophy. The method first estimates an initial pose for each frame using a fast front-end, operating within a local moving volume. Subsequently, instead of optimizing per-frame poses () directly, RemixFusion refines them via residual bundle adjustment (RBA):
where is a small MLP predicting residual pose corrections, and is a normalized frame index. This strategy focuses the optimization on pose increments rather than absolute values, improving both local convergence and global consistency when jointly aligning multiple views.
4. Adaptive Gradient Amplification in Bundle Adjustment
To further enhance joint pose refinement, RemixFusion incorporates adaptive gradient amplification (GA) into the optimization procedure. Particularly near the TSDF zero-crossing—where surfaces occur—gradients may be too shallow or even discontinuous, stalling convergence and causing suboptimal alignment. RemixFusion addresses this by amplifying the gradient magnitude near these critical regions:
Here, amplifies the clamping threshold, ensuring stronger gradients and thus more assertive corrections during optimization. This mechanism yields better exploration of pose space and improved global minima attainment.
5. Moving Volume and Divide-and-Conquer Scalability
To manage very large environments, RemixFusion applies a local moving volume strategy. The explicit TSDF grid is made to "move" with the camera, shifting or duplicating as necessary when the camera travels beyond a certain threshold from the anchor point. This sliding window preserves spatial locality and overlap, ensuring both continuity in mapping and bounded local memory requirements.
Combined with the neural residual module, this divide-and-conquer decomposition supports efficient online reconstruction, as only the currently relevant volume is processed and stored at high fidelity while the rest of the scene is coarsely maintained or dynamically offloaded.
6. Experimental Evaluation
RemixFusion was evaluated against both state-of-the-art explicit and implicit methods on challenging datasets such as BS3D and uHumans2. Results indicate that the mixed residual-based representation improved mapping quality in terms of accuracy (measured in centimeters), completeness, F1-score, and depth error metrics (D-L1). Tracking performance, as assessed by the Absolute Trajectory Error (ATE RMSE), also improved due to the RBA and adaptive GA innovations. Importantly, RemixFusion sustained real-time frame rates (with a lightweight setting achieving over 25 FPS) and reduced GPU memory footprint compared to other neural or 3D Gaussian splatting techniques. These results substantiate the claim that RemixFusion’s architecture and methodology are robust and scalable for high-quality, large-scale online reconstruction.
7. Mathematical Formulations and Technical Summary
Key mathematical descriptions underpin RemixFusion’s methodology:
- Scene Representation:
- Query Mapping:
- Pose Refinement:
- Gradient Amplification:
The collective framework allows RemixFusion to robustly handle high-frequency geometric details, scalable mapping, and pose optimization in large-scale online settings, bridging the performance–efficiency gap left by prior methods. Extensive experimental evidence demonstrates its superior performance in accuracy, efficiency, and robustness for both mapping and tracking in real-time 3D scene reconstruction tasks (Lan et al., 23 Jul 2025).