Object-Sequenced LSTM for 3D Detection
- Object-Sequenced LSTM is a deep learning architecture that leverages sequential LiDAR data to achieve robust temporal 3D object detection in autonomous driving.
- It integrates a sparse 3D convolutional LSTM module with a U-Net backbone, efficiently fusing per-frame features, memory states, and temporal consistency.
- The method outperforms traditional approaches by improving mAP by up to 7.5% while ensuring real-time performance with low computational overhead.
An Object-Sequenced LSTM is a deep neural architecture designed for temporal 3D object detection in LiDAR point clouds, particularly in autonomous driving contexts. It robustly leverages sequential LiDAR data by integrating features, memory states, and temporal consistency, employing a sparse 3D convolutional LSTM module with a U-Net backbone. This approach outperforms traditional frame-wise and early-fusion methods, improving multi-frame 3D object detection metrics while maintaining computational efficiency (Huang et al., 2020).
1. End-to-End Architecture and Data Flow
The Object-Sequenced LSTM employs a pipeline comprising six stages at each temporal step :
- Input:
- Raw LiDAR sweep with points.
- Memory states from the previous frame: hidden features and cell features for “high-score” points.
- Per-frame SparseConv U-Net Backbone:
- Voxelizes with a grid size of .
- Encoder: 6 blocks of sparse 3D convs and max-pooling, with dimension progression .
- Decoder: symmetric structure with upsampling and skip-connections.
- De-voxelizes to obtain per-point features , .
- Joint Voxelization for LSTM Fusion:
- Applies ego-motion compensation for .
- Merges , , ; voxelizes them together.
- Concatenates backed features within each occupied voxel: .
- 3D Sparse-Conv LSTM Module:
- Gate computations use a lightweight U-Net ($1$ encoder block, $1$ bottleneck, $1$ decoder block).
- Updates memory and hidden state via 3D convolutional LSTM gates.
- De-voxelizes to point sets of .
- Per-Point 3D Bounding Box Proposal Head:
- Input: .
- Three sparse conv layers per attribute (center, size, rotation, objectness).
- De-voxelizes predictions to all points; uses integrated box-corner regression and dynamic objectness classification.
- Graph-Convolution Smoothing and NMS:
- Builds a -NN graph () among detected centers.
- Propagates/re-weights box predictions via learned edge weights.
- Farthest-point sampling yields the top (512) proposals.
- 3D non-maximum suppression with produces final detections .
- Memory Sub-selection:
- Filters top points from by objectness score, yielding and for the next frame.
2. Sparse-Conv LSTM Formulation
LSTM cell operations are reformulated with sparse 3D U-Net convolutions, substituting standard fully connected gates with spatially aware gated updates: \begin{align*} i_t &= \sigma(W_i \ast [x_t, h_{t-1}] + b_i) \ f_t &= \sigma(W_f \ast [x_t, h_{t-1}] + b_f) \ \tilde{c}t &= \tanh(W_c \ast [x_t, h{t-1}] + b_c) \ c_t &= f_t \odot c_{t-1} + i_t \odot \tilde{c}t \ o_t &= \sigma(W_o \ast [x_t, h{t-1}] + b_o) \ h_t &= o_t \odot \tanh(c_t) \end{align*} Here, “” denotes 3D sparse convolution, and all gate and cell outputs are spatial feature volumes, not vectors.
Definitions and Typical Dimensions:
| Variable | Definition | Dimension |
|---|---|---|
| Raw LiDAR points | () | |
| Per-point features | ||
| Hidden/cell features | () | |
| Sequence length | typically $4$ (range $1$–$7$) |
The spatially structured LSTM is realized by one $128$-channel encoder, one $128$-channel bottleneck, one $256$-channel decoder (U-Net configuration).
3. 3D Proposal Regression, Losses, and Training Objective
Each candidate point regresses:
- Center:
- Size:
- Rotation: ( matrix)
- Objectness score:
Box corners are computed in a differentiable manner for corner-based regression. The total loss function combines regression and classification over all sequence steps: where are predicted and ground-truth box corners.
The objectness classification loss uses dynamic ground-truth assignment ( if , $0$ otherwise):
The training objective per frame is: with and , summed across to in the temporal window.
4. Empirical Performance and Comparison
Waymo Open Dataset, [email protected]:
| Method | [email protected] (%) |
|---|---|
| Single-frame (SparseConv U-Net) | 56.1 |
| + Kalman Filter tracking baseline | 56.8 |
| 4-frame early-fusion (concatenation) | 62.4 |
| 4-frame LSTM fusion (Object-Sequenced) | 63.6 |
| StarNet (point-based) | 53.7 |
| PointPillars | 57.2 |
| MVF (multi-view fusion) | 62.9 |
The Object-Sequenced LSTM improves by mAP over the single-frame method, and over 4-frame early concatenation. Compared to other state-of-the-art approaches in the literature, the LSTM model achieves the highest reported mAP, demonstrating the advantage of learned temporal aggregation in sparse 3D domains (Huang et al., 2020).
Runtime/Memory Efficiency:
- Single-frame backbone: $19$ ms/frame (Titan V GPU).
- LSTM:~$2$ ms overhead from 3 extra sparse conv blocks.
- Full pipeline: 100 ms per frame (supports $10$ Hz operation).
- Memory: sub-sampled (k points), yielding k voxels per frame.
5. Implementation Decisions and Ablation Insights
Key hyperparameter and architectural ablation findings:
- Sequence Length ():
- (LSTM, no memory): mAP.
- : .
- (recommended): .
- : (saturation). This suggests moderate temporal horizon is optimal.
- LSTM Module Structure:
- Replacing fully connected gates with a 3D sparse U-Net in LSTM gates improves performance by mAP at , highlighting the importance of spatial context within the LSTM gating mechanism.
- Memory Sub-sampling: Sub-sampling to k points per frame retains object history and maintains constant memory usage, a notable advantage over denser fusion schemes.
- Graph-conv Smoothing: Applies neighborhood-based refinement, reducing jitter and local false positives.
- Early Fusion Concatenation: While denser (more compute/memory), lags LSTM-based fusion by and does not leverage temporally consistent memory.
- LSTM U-Net Depth: Deeper U-Net variants trade capacity for latency; the $1$-encoder, $128$-bottleneck, $1$-decoder ($128$–$256$ channels) configuration provides a robust capacity-latency balance.
6. Significance in 3D Spatiotemporal Detection
The Object-Sequenced LSTM architecture is the first reported method to apply LSTM mechanisms—augmented with 3D sparse convolutions—for temporal 3D object detection in large LiDAR point clouds. By integrating geometric, sequential, and memory signals within a unified end-to-end pipeline, it sets a new state-of-the-art on major benchmarks while remaining within stringent real-time and computational constraints. The framework’s explicit modeling of temporally evolving objectness mitigates the limitations of both per-frame and naive multi-frame fusion approaches (Huang et al., 2020).
A plausible implication is that this approach generalizes to other sparse sequential 3D domains where spatial and temporal context are both critical.
7. Limitations and Directions for Future Research
Empirical saturation occurs at moderate temporal window lengths (), indicating diminishing returns for longer memory, possibly due to information loss from sub-sampling or temporal drift. While memory sub-sampling is effective, more adaptive or adaptive point selection strategies may further enhance long-range temporal modeling. Exploration of more expressive graph convolution heads or hybrid attention modules presents a potential avenue for improving robustness to local ambiguities.
The demonstrated efficiency and performance of Object-Sequenced LSTM motivate extensions to multi-modal sensor fusion (e.g., radar-LiDAR), domain adaptation, and lifelong temporal learning in robotics and autonomous navigation (Huang et al., 2020).