Object-Sequenced LSTM for 3D Detection

Updated 16 February 2026

Object-Sequenced LSTM is a deep learning architecture that leverages sequential LiDAR data to achieve robust temporal 3D object detection in autonomous driving.
It integrates a sparse 3D convolutional LSTM module with a U-Net backbone, efficiently fusing per-frame features, memory states, and temporal consistency.
The method outperforms traditional approaches by improving mAP by up to 7.5% while ensuring real-time performance with low computational overhead.

An Object-Sequenced LSTM is a deep neural architecture designed for temporal 3D object detection in LiDAR point clouds, particularly in autonomous driving contexts. It robustly leverages sequential LiDAR data by integrating features, memory states, and temporal consistency, employing a sparse 3D convolutional LSTM module with a U-Net backbone. This approach outperforms traditional frame-wise and early-fusion methods, improving multi-frame 3D object detection metrics while maintaining computational efficiency (Huang et al., 2020).

1. End-to-End Architecture and Data Flow

The Object-Sequenced LSTM employs a pipeline comprising six stages at each temporal step $t$ :

Input:
- Raw LiDAR sweep $P_t \in \mathbb{R}^{N\times 3}$ with $N \approx 1.8\times 10^5$ points.
- Memory states from the previous frame: hidden features $H_{t-1}$ and cell features $C_{t-1}$ for $N' \approx 3\times 10^4$ “high-score” points.
Per-frame SparseConv U-Net Backbone:
- Voxelizes $P_t$ with a grid size of $0.2~\text{m}^3$ .
- Encoder: 6 blocks of $3\times 3\times 3$ sparse 3D convs and max-pooling, with dimension progression $[64, 96, 128, 160, 192, 224, 256]$ .
- Decoder: symmetric structure with upsampling and skip-connections.
- De-voxelizes to obtain per-point features $X_t \in \mathbb{R}^{N \times F}$ , $F=256$ .
Joint Voxelization for LSTM Fusion:
- Applies ego-motion compensation for $H_{t-1}, C_{t-1}$ .
- Merges $P_t$ , $H_{t-1}$ , $C_{t-1}$ ; voxelizes them together.
- Concatenates backed features within each occupied voxel: $[X_{t,\text{voxel}} \oplus H_{t-1,\text{voxel}} \oplus C_{t-1,\text{voxel}}]$ .
3D Sparse-Conv LSTM Module:
- Gate computations use a lightweight U-Net ($1$ encoder block, $1$ bottleneck, $1$ decoder block).
- Updates memory $C_t$ and hidden state $H_t$ via 3D convolutional LSTM gates.
- De-voxelizes $H_t,C_t$ to point sets of $N'' \approx 3\times 10^4,\, F'=256$ .
Per-Point 3D Bounding Box Proposal Head:
- Input: $H_t \in \mathbb{R}^{N''\times F'}$ .
- Three $3\times 3\times 3$ sparse conv layers per attribute (center, size, rotation, objectness).
- De-voxelizes predictions to all points; uses integrated box-corner regression and dynamic objectness classification.
Graph-Convolution Smoothing and NMS:
- Builds a $K$ -NN graph ( $K=16$ ) among detected centers.
- Propagates/re-weights box predictions via learned edge weights.
- Farthest-point sampling yields the top $M$ ( $\sim$ 512) proposals.
- 3D non-maximum suppression with $\text{IoU} \ge 0.7$ produces final detections $D_t$ .
Memory Sub-selection:
- Filters top $N' \approx 3\times 10^4$ points from $H_t$ by objectness score, yielding $H_t$ and $C_t$ for the next frame.

2. Sparse-Conv LSTM Formulation

LSTM cell operations are reformulated with sparse 3D U-Net convolutions, substituting standard fully connected gates with spatially aware gated updates: \begin{align*} i_t &= \sigma(W_i \ast [x_t, h_{t-1}] + b_i) \ f_t &= \sigma(W_f \ast [x_t, h_{t-1}] + b_f) \ \tilde{c}t &= \tanh(W_c \ast [x_t, h{t-1}] + b_c) \ c_t &= f_t \odot c_{t-1} + i_t \odot \tilde{c}t \ o_t &= \sigma(W_o \ast [x_t, h{t-1}] + b_o) \ h_t &= o_t \odot \tanh(c_t) \end{align*} Here, “ $\ast$ ” denotes 3D sparse convolution, and all gate and cell outputs are spatial feature volumes, not vectors.

Definitions and Typical Dimensions:

Variable	Definition	Dimension
$P_t$	Raw LiDAR points	$N \times 3$ ( $N \sim 1.8\times 10^5$ )
$X_t$	Per-point features	$N \times 256$
$h_t, c_t$	Hidden/cell features	$N' \times 256$ ( $N' \sim 3\times 10^4$ )
$M$	Sequence length	typically $4$ (range $1$–$7$)

The spatially structured LSTM is realized by one $128$-channel encoder, one $128$-channel bottleneck, one $256$-channel decoder (U-Net configuration).

3. 3D Proposal Regression, Losses, and Training Objective

Each candidate point regresses:

Center: $\Delta x, \Delta y, \Delta z$
Size: $\Delta l, \Delta w, \Delta h$
Rotation: $R$ ( $3\times 3$ matrix)
Objectness score: $p_\text{obj}$

Box corners are computed in a differentiable manner for corner-based regression. The total loss function combines regression and classification over all sequence steps: $L_\mathrm{reg} = \sum_i \text{smooth}_{L1}\Big(\hat{C}_i - C^{gt}_i\Big)$ where $\hat{C}_i,~C^{gt}_i \in \mathbb{R}^{24}$ are predicted and ground-truth box corners.

The objectness classification loss uses dynamic ground-truth assignment ( $y_i = 1$ if $\mathrm{IoU}(\text{pred}_i,\text{gt}) > 0.7$ , $0$ otherwise): $L_\mathrm{cls} = - \sum_{i} \left[y_i \log p_i + (1-y_i) \log (1-p_i)\right]$

The training objective per frame $t$ is: $L_\text{total}(t) = \lambda_\text{cls} L_\text{cls}(t) + \lambda_\text{reg} L_\text{reg}(t)$ with $\lambda_\text{cls} = 1.0$ and $\lambda_\text{reg} = 1.0$ , summed across $t=1$ to $M$ in the temporal window.

4. Empirical Performance and Comparison

Waymo Open Dataset, [email protected]:

Method	[email protected] (%)
Single-frame (SparseConv U-Net)	56.1
+ Kalman Filter tracking baseline	56.8
4-frame early-fusion (concatenation)	62.4
4-frame LSTM fusion (Object-Sequenced)	63.6
StarNet (point-based)	53.7
PointPillars	57.2
MVF (multi-view fusion)	62.9

The Object-Sequenced LSTM improves by $+7.5\%$ mAP over the single-frame method, and $+1.2\%$ over 4-frame early concatenation. Compared to other state-of-the-art approaches in the literature, the LSTM model achieves the highest reported mAP, demonstrating the advantage of learned temporal aggregation in sparse 3D domains (Huang et al., 2020).

Runtime/Memory Efficiency:

Single-frame backbone: $19$ ms/frame (Titan V GPU).
LSTM:~$2$ ms overhead from 3 extra sparse conv blocks.
Full pipeline: $<$ 100 ms per frame (supports $10$ Hz operation).
Memory: sub-sampled $H_t, C_t$ ( $\approx 30$ k points), yielding $\approx 30$ k voxels per frame.

5. Implementation Decisions and Ablation Insights

Key hyperparameter and architectural ablation findings:

Sequence Length ( $M$ ):
- $M=1$ (LSTM, no memory): $58.7\%$ mAP.
- $M=2$ : $59.7\%$ .
- $M=4$ (recommended): $63.6\%$ .
- $M=7$ : $63.3\%$ (saturation). This suggests moderate temporal horizon is optimal.
LSTM Module Structure:
- Replacing fully connected gates with a 3D sparse U-Net in LSTM gates improves performance by $+2.6\%$ mAP at $M=1$ , highlighting the importance of spatial context within the LSTM gating mechanism.
Memory Sub-sampling: Sub-sampling $H_t$ to $\approx 30$ k points per frame retains object history and maintains constant memory usage, a notable advantage over denser fusion schemes.
Graph-conv Smoothing: Applies neighborhood-based refinement, reducing jitter and local false positives.
Early Fusion Concatenation: While denser (more compute/memory), lags LSTM-based fusion by $1.2\%$ and does not leverage temporally consistent memory.
LSTM U-Net Depth: Deeper U-Net variants trade capacity for latency; the $1$-encoder, $128$-bottleneck, $1$-decoder ($128$–$256$ channels) configuration provides a robust capacity-latency balance.

6. Significance in 3D Spatiotemporal Detection

The Object-Sequenced LSTM architecture is the first reported method to apply LSTM mechanisms—augmented with 3D sparse convolutions—for temporal 3D object detection in large LiDAR point clouds. By integrating geometric, sequential, and memory signals within a unified end-to-end pipeline, it sets a new state-of-the-art on major benchmarks while remaining within stringent real-time and computational constraints. The framework’s explicit modeling of temporally evolving objectness mitigates the limitations of both per-frame and naive multi-frame fusion approaches (Huang et al., 2020).

A plausible implication is that this approach generalizes to other sparse sequential 3D domains where spatial and temporal context are both critical.

7. Limitations and Directions for Future Research

Empirical saturation occurs at moderate temporal window lengths ( $M \geq 4$ ), indicating diminishing returns for longer memory, possibly due to information loss from sub-sampling or temporal drift. While memory sub-sampling is effective, more adaptive or adaptive point selection strategies may further enhance long-range temporal modeling. Exploration of more expressive graph convolution heads or hybrid attention modules presents a potential avenue for improving robustness to local ambiguities.

The demonstrated efficiency and performance of Object-Sequenced LSTM motivate extensions to multi-modal sensor fusion (e.g., radar-LiDAR), domain adaptation, and lifelong temporal learning in robotics and autonomous navigation (Huang et al., 2020).

Markdown Report Issue Upgrade to Chat

References (1)

An LSTM Approach to Temporal 3D Object Detection in LiDAR Point Clouds (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Object-Sequenced LSTM.