SVDC: Consistent Direct Time-of-Flight Video Depth Completion with Frequency Selective Fusion

Published 3 Mar 2025 in cs.CV | (2503.01257v1)

Abstract: Lightweight direct Time-of-Flight (dToF) sensors are ideal for 3D sensing on mobile devices. However, due to the manufacturing constraints of compact devices and the inherent physical principles of imaging, dToF depth maps are sparse and noisy. In this paper, we propose a novel video depth completion method, called SVDC, by fusing the sparse dToF data with the corresponding RGB guidance. Our method employs a multi-frame fusion scheme to mitigate the spatial ambiguity resulting from the sparse dToF imaging. Misalignment between consecutive frames during multi-frame fusion could cause blending between object edges and the background, which results in a loss of detail. To address this, we introduce an adaptive frequency selective fusion (AFSF) module, which automatically selects convolution kernel sizes to fuse multi-frame features. Our AFSF utilizes a channel-spatial enhancement attention (CSEA) module to enhance features and generates an attention map as fusion weights. The AFSF ensures edge detail recovery while suppressing high-frequency noise in smooth regions. To further enhance temporal consistency, We propose a cross-window consistency loss to ensure consistent predictions across different windows, effectively reducing flickering. Our proposed SVDC achieves optimal accuracy and consistency on the TartanAir and Dynamic Replica datasets. Code is available at https://github.com/Lan1eve/SVDC.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

Video Depth Completion via Frequency Selective Fusion for Mobile Devices

The study under review focuses on depth completion for mobile devices using direct Time-of-Flight (dToF) sensors. Given the inherent limitations posed by sensor technology, such as sparsity and noise in depth maps, the authors introduce a novel method called SVDC, designed to integrate sparse dToF data with RGB guidance. This approach aims to enhance both the accuracy and temporal consistency of depth video. This paper presents significant contributions to the field of depth sensing by proposing innovative solutions for the challenges associated with using lightweight dToF sensors in mobile applications.

SVDC utilizes a multi-frame fusion scheme to tackle spatial ambiguities resulting from the sparse nature of point clouds derived from dToF sensors. A key feature of this method is the Adaptive Frequency Selective Fusion (AFSF) module, which intelligently selects convolution kernel sizes based on frequency characteristics to fuse multi-frame features effectively. This module, in tandem with the Channel-Spatial Enhancement Attention (CSEA) module, enables the precise recovery of edge details and reduces high-frequency noise interference in smooth regions. Through these components, SVDC not only maintains edge integrity but also enhances the overall quality of the depth map.

The proposed cross-window consistency loss is another critical advancement, ensuring temporal consistency across video frames and mitigating flickering — a common problem in existing video depth estimation methods. The application of a window-based approach allows independent frame features to be fused, while maintaining consistency both within and between windows, which is crucial for seamless video depth rendering in dynamic environments.

The SVDC approach demonstrates optimal accuracy and superior temporal consistency on benchmark datasets such as TartanAir and Dynamic Replica. The metrics used for evaluation include RMSE and OPW, which collectively highlight the improvements in both spatial accuracy and temporal stability of the proposed method compared to state-of-the-art techniques. The quantitative results underscore SVDC's ability to outperform existing per-frame processing methodologies by leveraging its novel multi-frame aggregation strategy and frequency-selective fusion process.

From a practical standpoint, the implications of SVDC are wide-reaching for real-world applications in augmented reality (AR) and virtual reality (VR), where mobile devices are often used for interactive 3D environment mapping. The theoretical advancements provide insights into the frequency-dependent fusion of information in the context of depth imaging, which could inform future developments in low-power, high-efficiency depth sensing for portable devices.

Further exploration into integrating more sophisticated optical flow networks and adaptive feature fusion could yield even better performance, especially in scenarios involving significant motion or intricate textures. In conclusion, this research represents an important step towards robust, real-time depth completion for mobile-based 3D applications, setting a foundation for ongoing advancements in mobile computational photography and real-time spatial mapping.