Papers
Topics
Authors
Recent
Search
2000 character limit reached

Object-Sequenced LSTM for 3D Detection

Updated 16 February 2026
  • Object-Sequenced LSTM is a deep learning architecture that leverages sequential LiDAR data to achieve robust temporal 3D object detection in autonomous driving.
  • It integrates a sparse 3D convolutional LSTM module with a U-Net backbone, efficiently fusing per-frame features, memory states, and temporal consistency.
  • The method outperforms traditional approaches by improving mAP by up to 7.5% while ensuring real-time performance with low computational overhead.

An Object-Sequenced LSTM is a deep neural architecture designed for temporal 3D object detection in LiDAR point clouds, particularly in autonomous driving contexts. It robustly leverages sequential LiDAR data by integrating features, memory states, and temporal consistency, employing a sparse 3D convolutional LSTM module with a U-Net backbone. This approach outperforms traditional frame-wise and early-fusion methods, improving multi-frame 3D object detection metrics while maintaining computational efficiency (Huang et al., 2020).

1. End-to-End Architecture and Data Flow

The Object-Sequenced LSTM employs a pipeline comprising six stages at each temporal step tt:

  1. Input:
    • Raw LiDAR sweep PtRN×3P_t \in \mathbb{R}^{N\times 3} with N1.8×105N \approx 1.8\times 10^5 points.
    • Memory states from the previous frame: hidden features Ht1H_{t-1} and cell features Ct1C_{t-1} for N3×104N' \approx 3\times 10^4 “high-score” points.
  2. Per-frame SparseConv U-Net Backbone:
    • Voxelizes PtP_t with a grid size of 0.2 m30.2~\text{m}^3.
    • Encoder: 6 blocks of 3×3×33\times 3\times 3 sparse 3D convs and max-pooling, with dimension progression [64,96,128,160,192,224,256][64, 96, 128, 160, 192, 224, 256].
    • Decoder: symmetric structure with upsampling and skip-connections.
    • De-voxelizes to obtain per-point features XtRN×FX_t \in \mathbb{R}^{N \times F}, F=256F=256.
  3. Joint Voxelization for LSTM Fusion:
    • Applies ego-motion compensation for Ht1,Ct1H_{t-1}, C_{t-1}.
    • Merges PtP_t, Ht1H_{t-1}, Ct1C_{t-1}; voxelizes them together.
    • Concatenates backed features within each occupied voxel: [Xt,voxelHt1,voxelCt1,voxel][X_{t,\text{voxel}} \oplus H_{t-1,\text{voxel}} \oplus C_{t-1,\text{voxel}}].
  4. 3D Sparse-Conv LSTM Module:
    • Gate computations use a lightweight U-Net ($1$ encoder block, $1$ bottleneck, $1$ decoder block).
    • Updates memory CtC_t and hidden state HtH_t via 3D convolutional LSTM gates.
    • De-voxelizes Ht,CtH_t,C_t to point sets of N3×104,F=256N'' \approx 3\times 10^4,\, F'=256.
  5. Per-Point 3D Bounding Box Proposal Head:
    • Input: HtRN×FH_t \in \mathbb{R}^{N''\times F'}.
    • Three 3×3×33\times 3\times 3 sparse conv layers per attribute (center, size, rotation, objectness).
    • De-voxelizes predictions to all points; uses integrated box-corner regression and dynamic objectness classification.
  6. Graph-Convolution Smoothing and NMS:
    • Builds a KK-NN graph (K=16K=16) among detected centers.
    • Propagates/re-weights box predictions via learned edge weights.
    • Farthest-point sampling yields the top MM (\sim512) proposals.
    • 3D non-maximum suppression with IoU0.7\text{IoU} \ge 0.7 produces final detections DtD_t.
  7. Memory Sub-selection:
    • Filters top N3×104N' \approx 3\times 10^4 points from HtH_t by objectness score, yielding HtH_t and CtC_t for the next frame.

2. Sparse-Conv LSTM Formulation

LSTM cell operations are reformulated with sparse 3D U-Net convolutions, substituting standard fully connected gates with spatially aware gated updates: \begin{align*} i_t &= \sigma(W_i \ast [x_t, h_{t-1}] + b_i) \ f_t &= \sigma(W_f \ast [x_t, h_{t-1}] + b_f) \ \tilde{c}t &= \tanh(W_c \ast [x_t, h{t-1}] + b_c) \ c_t &= f_t \odot c_{t-1} + i_t \odot \tilde{c}t \ o_t &= \sigma(W_o \ast [x_t, h{t-1}] + b_o) \ h_t &= o_t \odot \tanh(c_t) \end{align*} Here, “\ast” denotes 3D sparse convolution, and all gate and cell outputs are spatial feature volumes, not vectors.

Definitions and Typical Dimensions:

Variable Definition Dimension
PtP_t Raw LiDAR points N×3N \times 3 (N1.8×105N \sim 1.8\times 10^5)
XtX_t Per-point features N×256N \times 256
ht,cth_t, c_t Hidden/cell features N×256N' \times 256 (N3×104N' \sim 3\times 10^4)
MM Sequence length typically $4$ (range $1$–$7$)

The spatially structured LSTM is realized by one $128$-channel encoder, one $128$-channel bottleneck, one $256$-channel decoder (U-Net configuration).

3. 3D Proposal Regression, Losses, and Training Objective

Each candidate point regresses:

  • Center: Δx,Δy,Δz\Delta x, \Delta y, \Delta z
  • Size: Δl,Δw,Δh\Delta l, \Delta w, \Delta h
  • Rotation: RR (3×33\times 3 matrix)
  • Objectness score: pobjp_\text{obj}

Box corners are computed in a differentiable manner for corner-based regression. The total loss function combines regression and classification over all sequence steps: Lreg=ismoothL1(C^iCigt)L_\mathrm{reg} = \sum_i \text{smooth}_{L1}\Big(\hat{C}_i - C^{gt}_i\Big) where C^i, CigtR24\hat{C}_i,~C^{gt}_i \in \mathbb{R}^{24} are predicted and ground-truth box corners.

The objectness classification loss uses dynamic ground-truth assignment (yi=1y_i = 1 if IoU(predi,gt)>0.7\mathrm{IoU}(\text{pred}_i,\text{gt}) > 0.7, $0$ otherwise): Lcls=i[yilogpi+(1yi)log(1pi)]L_\mathrm{cls} = - \sum_{i} \left[y_i \log p_i + (1-y_i) \log (1-p_i)\right]

The training objective per frame tt is: Ltotal(t)=λclsLcls(t)+λregLreg(t)L_\text{total}(t) = \lambda_\text{cls} L_\text{cls}(t) + \lambda_\text{reg} L_\text{reg}(t) with λcls=1.0\lambda_\text{cls} = 1.0 and λreg=1.0\lambda_\text{reg} = 1.0, summed across t=1t=1 to MM in the temporal window.

4. Empirical Performance and Comparison

Waymo Open Dataset, [email protected]:

Method [email protected] (%)
Single-frame (SparseConv U-Net) 56.1
+ Kalman Filter tracking baseline 56.8
4-frame early-fusion (concatenation) 62.4
4-frame LSTM fusion (Object-Sequenced) 63.6
StarNet (point-based) 53.7
PointPillars 57.2
MVF (multi-view fusion) 62.9

The Object-Sequenced LSTM improves by +7.5%+7.5\% mAP over the single-frame method, and +1.2%+1.2\% over 4-frame early concatenation. Compared to other state-of-the-art approaches in the literature, the LSTM model achieves the highest reported mAP, demonstrating the advantage of learned temporal aggregation in sparse 3D domains (Huang et al., 2020).

Runtime/Memory Efficiency:

  • Single-frame backbone: $19$ ms/frame (Titan V GPU).
  • LSTM:~$2$ ms overhead from 3 extra sparse conv blocks.
  • Full pipeline: <<100 ms per frame (supports $10$ Hz operation).
  • Memory: sub-sampled Ht,CtH_t, C_t (30\approx 30k points), yielding 30\approx 30k voxels per frame.

5. Implementation Decisions and Ablation Insights

Key hyperparameter and architectural ablation findings:

  • Sequence Length (MM):
    • M=1M=1 (LSTM, no memory): 58.7%58.7\% mAP.
    • M=2M=2: 59.7%59.7\%.
    • M=4M=4 (recommended): 63.6%63.6\%.
    • M=7M=7: 63.3%63.3\% (saturation). This suggests moderate temporal horizon is optimal.
  • LSTM Module Structure:
    • Replacing fully connected gates with a 3D sparse U-Net in LSTM gates improves performance by +2.6%+2.6\% mAP at M=1M=1, highlighting the importance of spatial context within the LSTM gating mechanism.
  • Memory Sub-sampling: Sub-sampling HtH_t to 30\approx 30k points per frame retains object history and maintains constant memory usage, a notable advantage over denser fusion schemes.
  • Graph-conv Smoothing: Applies neighborhood-based refinement, reducing jitter and local false positives.
  • Early Fusion Concatenation: While denser (more compute/memory), lags LSTM-based fusion by 1.2%1.2\% and does not leverage temporally consistent memory.
  • LSTM U-Net Depth: Deeper U-Net variants trade capacity for latency; the $1$-encoder, $128$-bottleneck, $1$-decoder ($128$–$256$ channels) configuration provides a robust capacity-latency balance.

6. Significance in 3D Spatiotemporal Detection

The Object-Sequenced LSTM architecture is the first reported method to apply LSTM mechanisms—augmented with 3D sparse convolutions—for temporal 3D object detection in large LiDAR point clouds. By integrating geometric, sequential, and memory signals within a unified end-to-end pipeline, it sets a new state-of-the-art on major benchmarks while remaining within stringent real-time and computational constraints. The framework’s explicit modeling of temporally evolving objectness mitigates the limitations of both per-frame and naive multi-frame fusion approaches (Huang et al., 2020).

A plausible implication is that this approach generalizes to other sparse sequential 3D domains where spatial and temporal context are both critical.

7. Limitations and Directions for Future Research

Empirical saturation occurs at moderate temporal window lengths (M4M \geq 4), indicating diminishing returns for longer memory, possibly due to information loss from sub-sampling or temporal drift. While memory sub-sampling is effective, more adaptive or adaptive point selection strategies may further enhance long-range temporal modeling. Exploration of more expressive graph convolution heads or hybrid attention modules presents a potential avenue for improving robustness to local ambiguities.

The demonstrated efficiency and performance of Object-Sequenced LSTM motivate extensions to multi-modal sensor fusion (e.g., radar-LiDAR), domain adaptation, and lifelong temporal learning in robotics and autonomous navigation (Huang et al., 2020).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Object-Sequenced LSTM.