TrajectoryFormer: 3D Object Tracking Transformer with Predictive Trajectory Hypotheses (2306.05888v2)

Published 9 Jun 2023 in cs.CV

Abstract: 3D multi-object tracking (MOT) is vital for many applications including autonomous driving vehicles and service robots. With the commonly used tracking-by-detection paradigm, 3D MOT has made important progress in recent years. However, these methods only use the detection boxes of the current frame to obtain trajectory-box association results, which makes it impossible for the tracker to recover objects missed by the detector. In this paper, we present TrajectoryFormer, a novel point-cloud-based 3D MOT framework. To recover the missed object by detector, we generates multiple trajectory hypotheses with hybrid candidate boxes, including temporally predicted boxes and current-frame detection boxes, for trajectory-box association. The predicted boxes can propagate object's history trajectory information to the current frame and thus the network can tolerate short-term miss detection of the tracked objects. We combine long-term object motion feature and short-term object appearance feature to create per-hypothesis feature embedding, which reduces the computational overhead for spatial-temporal encoding. Additionally, we introduce a Global-Local Interaction Module to conduct information interaction among all hypotheses and models their spatial relations, leading to accurate estimation of hypotheses. Our TrajectoryFormer achieves state-of-the-art performance on the Waymo 3D MOT benchmarks. Code is available at https://github.com/poodarchu/EFG .

Citations (9)

View on Semantic Scholar

Summary

The paper introduces a hybrid candidate box generation method that combines temporal predictions with current detections to robustly track objects.
It leverages per-hypothesis feature encoding and transformer-based interactions to fuse long-term motion and short-term appearance cues.
Evaluations on the Waymo dataset show significant improvements in MOTA and recall, enhancing object recovery in challenging dynamic environments.

TrajectoryFormer: 3D Object Tracking Transformer with Predictive Trajectory Hypotheses

Overview

The paper introduces TrajectoryFormer, a novel framework for 3D multi-object tracking (MOT) leveraging the strengths of transformer architectures to tackle the inherent challenges associated with traditional tracking-by-detection methodologies. TrajectoryFormer exploits predictive trajectory hypotheses to enhance its robustness against detection errors, thus achieving state-of-the-art performance in 3D MOT tasks.

Key Contributions

The primary innovation in TrajectoryFormer is the integration of predictive trajectory hypotheses, constructed using both temporally predicted boxes and current detection boxes, to deepen the network's understanding of an object's potential motion path. This approach allows the framework to remain reliable in scenarios where objects are temporarily occluded or missed by detectors, as it capitalizes on the historical trajectory data for predictive associations.

Hybrid Candidate Box Generation: The framework introduces a novel multi-hypothesis generation process, which combines:
- Temporal Predictions: Propagate the historical trajectory information into future frames using a dedicated motion prediction network. The network is small and efficient, enabling predicted boxes from several past frames to be considered.
- Current-frame Detection Boxes: These boxes are derived from current frame detections matched with minimal spatial-temporal distances, thus providing base candidates for tracking.
Feature Encoding and Interaction:
- Per-hypothesis Feature Encoding: The architecture processes both long-term motion features and short-term appearance features. A PointNet-like encoder is utilized for motion features, while attention-based methods efficiently capture appearance features from the point clouds.
- Global-Local Interaction Module: This novel feature interaction module leverages transformers to model interactions both among hypotheses within a trajectory and across a scene. Such dual interactions enrich each hypothesis's context representation, leading to robust trajectory association.

Numerical Results and Performance

TrajectoryFormer was evaluated on the challenging Waymo Open Dataset, showing remarkable improvements over existing methods like CenterPoint, SimpleTrack, and ImmortalTracker. Key improvements were noted in MOTA, FP, and Miss metrics, indicating the method's ability to refine detection results reliably and improve recall by compensating for missed detections.

Waymo Validation Set: TrajectoryFormer exhibits a substantial boost in MOTA, indicating better trajectory box quality maintenance through its predictive capability and improved object recovery.
Waymo Testing Set: Tests further validate these capabilities with consistent performance across all object categories.

Implications and Speculations

TrajectoryFormer's introduction of predictive trajectory hypotheses is a significant step forward in 3D MOT, proving beneficial in dynamic scenarios encountered by autonomous vehicles and service robots. By improving object recall and reducing erroneous associations prevalent in traditional tracking methodologies, the framework allows autonomous systems to function more reliably under challenging conditions.

The integration of transformers for modeling spatial-temporal interactions across multiple candidate hypotheses paves the way for enhanced 3D perception modules in complex environments. Future research could further explore different prediction and association strategies, or adapt these methodologies to broader perceptual tasks tied to spatio-temporal understanding, such as multi-modal sensor fusion or long-term trajectory forecasting in complex dynamic scenes.

PDF Markdown

Related Papers

GitHub

GitHub - V2AI/EFG: An Efficient, Flexible, and General deep learning framework that retains minimal. (107 stars)