Sparse4D v2: Recurrent Temporal Fusion with Sparse Model (2305.14018v2)

Published 23 May 2023 in cs.CV

Abstract: Sparse algorithms offer great flexibility for multi-view temporal perception tasks. In this paper, we present an enhanced version of Sparse4D, in which we improve the temporal fusion module by implementing a recursive form of multi-frame feature sampling. By effectively decoupling image features and structured anchor features, Sparse4D enables a highly efficient transformation of temporal features, thereby facilitating temporal fusion solely through the frame-by-frame transmission of sparse features. The recurrent temporal fusion approach provides two main benefits. Firstly, it reduces the computational complexity of temporal fusion from $O(T)$ to $O(1)$, resulting in significant improvements in inference speed and memory usage. Secondly, it enables the fusion of long-term information, leading to more pronounced performance improvements due to temporal fusion. Our proposed approach, Sparse4Dv2, further enhances the performance of the sparse perception algorithm and achieves state-of-the-art results on the nuScenes 3D detection benchmark. Code will be available at \url{https://github.com/linxuewu/Sparse4D}.

Citations (46)

View on Semantic Scholar

Summary

The paper introduces a recurrent temporal fusion mechanism that reduces computational complexity from O(T) to O(1), enhancing long-term data integration.
It implements Efficient Deformable Aggregation as a single CUDA operator to decrease memory usage and boost inference speed.
Integration of explicit camera parameter encoding and dense depth supervision improves generalization and achieves state-of-the-art performance on the nuScenes benchmark.

Sparse4Dv2: Advancements in Recurrent Temporal Fusion for Multi-View Perception

The paper on Sparse4Dv2 outlines enhancements to the Sparse4D algorithm, focusing on efficient temporal fusion in multi-view perception tasks. This work aims to improve the computational efficiency and performance metrics of the perception module, a critical component of autonomous driving systems.

Technical Innovations

Sparse4Dv2 introduces several pivotal modifications to the original Sparse4D framework:

Recurrent Temporal Fusion: Departing from the traditional multi-frame sampling approach, Sparse4Dv2 adopts a recurrent mechanism for temporal fusion. This adjustment reduces computational complexity from $O(T)$ to $O(1)$ , enhancing inference speed and memory utilization, and allowing effective incorporation of long-term temporal data.
Efficient Deformable Aggregation (EDA): By optimizing the deformable aggregation operation into a single CUDA operator, Sparse4Dv2 significantly decreases memory usage and improves inference speed. This efficiency makes it particularly suited for high-resolution applications.
Camera Parameter Encoding: Sparse4Dv2 explicitly integrates camera parameters into the network, which enhances generalization and accuracy in orientation estimation.
Dense Depth Supervision: Introducing dense depth estimation using LiDAR point clouds as auxiliary supervision helps improve convergence capability during training.

Numerical Results

Sparse4Dv2 achieves state-of-the-art performance on the nuScenes 3D detection benchmark. In terms of evaluation metrics, it outperforms existing algorithms like SOLOFusion and StreamPETR. Notably, Sparse4Dv2 improves the NuScenes Detection Score (NDS) while also offering competitive inference speeds across varying input resolutions.

Implications and Future Directions

The modifications in Sparse4Dv2 suggest significant implications for sparse algorithms in multi-view perception:

Computational Efficiency: The reduced complexity positions Sparse4Dv2 as a viable option for real-time applications in autonomous driving, particularly in long-range and high-resolution scenarios.
Integration Potential: The decoupled architecture allows seamless integration with other graph-based models, paving the way for end-to-end autonomous driving solutions.
Generalization and Robustness: The enhancements in camera parameter handling and depth supervision might improve robustness across different environments, suggesting broader applicability in diverse driving conditions.

Future research could explore generalization to other perception tasks, extending Sparse4Dv2's applicability in HD map construction and trajectory prediction. Sparse4Dv2's optimized framework presents a promising baseline for future advancements in sparse temporal fusion methodologies, solidifying its relevance in autonomous perception systems.

PDF Markdown

Related Papers

GitHub

GitHub - linxuewu/Sparse4D: Sparse4D v1 & v2 (375 stars)