TrajectoryNAS: Lidar Trajectory Prediction
- TrajectoryNAS is a NAS framework designed for Lidar-based object detection, tracking, and multi-step trajectory forecasting in autonomous driving.
- It uses multi-objective simulated annealing to optimize prediction accuracy and computational latency, as validated on benchmarks like nuScenes.
- Integrating end-to-end multi-task training and latency-aware design, TrajectoryNAS demonstrates significant improvements over conventional baselines.
TrajectoryNAS is a neural architecture search (NAS) method tailored for trajectory prediction from 3D Lidar point-cloud data, specifically designed for autonomous driving scenarios. It automates the end-to-end design of models that perform object detection, tracking, and multi-step trajectory forecasting in a unified manner. TrajectoryNAS leverages a multi-objective search strategy to optimize both prediction quality and computational latency. The framework demonstrates improved accuracy and efficiency over previous end-to-end baselines, as shown through extensive empirical evaluation on large-scale benchmarks such as the nuScenes dataset (Sharifi et al., 2024).
1. Formulation of Lidar-Based Trajectory Prediction
The fundamental task addresses the prediction of future trajectories for dynamic agents (e.g., cars, pedestrians) detected in Lidar point-cloud sequences. Given a temporally ordered set of point-cloud frames, , where and , the system detects agents. Each agent at time is represented by its state vector
including its spatial location, bounding box dimensions, orientation, and speed. The goal is to predict future states
that closely approximate the true sequence . Standard evaluation metrics are Average Displacement Error (ADE) and Final Displacement Error (FDE).
2. Architecture Search Space and Objective Function
TrajectoryNAS constructs models on top of a 3D backbone (VoxelNet followed by Sparse Feature Pyramid Network, FPN). The search space includes:
- Region Proposal Network (RPN) layers, drawn from sparse-3D-convolution, point-convolution, self-attention, MLP.
- Five prediction heads: Velocity, Rotation, Dimension, Regression, and Height. Each head is a small 2D CNN, where both the layer types and the channel widths are subject to search.
Architectural choices are encoded as discrete parameters (operation selection per layer) and continuous width parameters (channels per layer). The compounded search space encompasses approximately candidate architectures due to supporting detection, tracking, and forecasting jointly.
The multi-objective search seeks parameters that minimize an energy function balancing prediction quality and runtime latency. TrajectoryNAS employs a multiplicative objective:
where mAP denotes mean Average Precision for future locations, and are user-defined weights controlling emphasis on each metric.
3. Search Strategy: Multi-Objective Simulated Annealing
TrajectoryNAS employs Multi-Objective Simulated Annealing (MOSA) as its search algorithm. At each iteration , given a current architecture and temperature , a neighboring architecture is sampled by mutating either the operation for a layer or its channel width. The acceptance probability for a candidate is:
Temperature is decreased via , with , annealing from to to balance exploration and exploitation over the search iterations. This approach enables efficient navigation of the expansive, discrete search space.
4. End-to-End Multi-Task Training
For each candidate architecture, training is conducted end-to-end using a lightweight subset of nuScenes, with direct multi-task supervision. The composite training loss is:
where:
- is the standard object detection loss
- enforces temporal ID consistency
- is an trajectory prediction loss
Assigning weights , the system balances detection, tracking, and forecasting efficacy throughout the search phase.
5. Latency-Aware Optimization and Measurement
To ensure practical deployability, TrajectoryNAS incorporates real-world latency as a first-class metric. Unlike prior work relying on indirect measures (e.g., FLOPs), latency is taken as the observed average inference time over thousands of samples on an NVIDIA RTX A4000 GPU. The energy objective's multiplicative form ensures that efficient architectures with marginally lower prediction quality may be favored if they achieve substantially better latency.
6. Experimental Validation and Comparative Performance
Empirical evaluation on the nuScenes benchmark demonstrates significant advances over prior work. TrajectoryNAS attains at least a gain in mAP and a speedup in inference time relative to the end-to-end FutureDet baseline:
- Cars (K=5 predictions):
- FutureDet: mAP = , latency = 24 ms
- TrajectoryNAS: mAP = , latency = 22 ms
- Pedestrians (K=5 predictions):
- FutureDet: mAP = , latency = 24 ms
- TrajectoryNAS: mAP = , latency = 22 ms
Ablative experiments show that replacing MOSA with Random Search or simple Local Search increases the best achieved energy minimum from $0.113$ (MOSA) to approximately $0.19$ and $0.186$, respectively. Adjusting weights in the energy function demonstrates that increasing the weight on FDE prioritizes reducing worst-case displacement error at the expense of minor latency increases.
7. Architectural Insights, Limitations, and Extension Opportunities
Analysis of architectures discovered by TrajectoryNAS reveals critical design patterns:
- Shallow 3D-to-2D heads in the RPN effectively reduce latency with negligible detection loss.
- Cross-attention layers within the forecasting heads yield mAP gains (up to for non-linear trajectories) over conventional 2D convolutions.
- Rotation and Dimension prediction heads tend to benefit from increased channel width, highlighting the importance of precise, high-capacity estimation for accurate placement.
Current limitations include the exclusion of map-based features (e.g., HD-maps), which are effective in structured road environments. The search process itself is computationally expensive (K GPU-hours for the 300-layer space), but further improvements such as weight-sharing or using a learned predictor for are proposed for acceleration. Integrating multi-sensor fusion (camera+lidar+radar) is recognized as a potential extension, increasing representational richness at the cost of larger search space complexity.
In summary, TrajectoryNAS establishes a unified, latency-aware neural architecture search framework for Lidar-based trajectory prediction, achieving robust accuracy and efficient deployment for autonomous driving systems (Sharifi et al., 2024).