NuPrompt Benchmark for 3D Tracking
- NuPrompt is a benchmark that integrates natural language prompts with multi-modal spatiotemporal data for tracking multiple objects in autonomous driving scenes.
- It enriches the nuScenes dataset by annotating free-form, object-centric prompts across multi-view images and LiDAR sweeps, enabling precise trajectory prediction.
- The evaluation employs metrics like AMOTA, MOTP, recall, and FAF, demonstrating significant gains through cross-modal fusion and temporal reasoning techniques.
NuPrompt is a large-scale benchmark designed for language-grounded, multi-object, multi-view, and multi-frame 3D tracking within autonomous-driving scenes. Constructed atop the nuScenes dataset, NuPrompt integrates object-centric natural-language prompts with spatiotemporal 3D data, providing a foundation for prompt-based trajectory prediction and grounding tasks in computer vision and robotics. NuPrompt has become pivotal for research on models that jointly leverage vision, LiDAR, and free-form language to identify and track objects described by attribute-rich or behavioral prompts in dynamic driving environments (Wu et al., 2023, &&&1&&&).
1. Benchmark Construction and Dataset Statistics
NuPrompt expands the nuScenes dataset (1,000 scenes; public train/val/test splits) with free-form, object-centric natural-language overlays. It embodies:
- Annotation protocol: 850 scenes (20 s duration each; ~40–50 frames sampled at 2 Hz) are annotated with pairs of prompts and tracklet groups, resulting in 34,149 synchronized frames and 40,147–40,776 prompts, depending on the specific release. All frames include six camera views (front, fl/fr/br/bl/back), synchronized with LiDAR sweeps.
- Prompt structure: Each prompt refers to 1–dozens of objects (mean ≈ 5.3–7.4, depending on generation pipeline), spanning cars (~40% of references), pedestrians (~15%), and a mixture of barriers, cones, bicycles, trucks, and other classes. The attribute vocabulary includes color, location, motion, and object class, manually correlated to tracklet properties.
- Annotation workflow: Attributes are first assigned to tracklets, logical selection (AND/OR/NOT) constructs subset groupings, and GPT-3.5 generates natural-language prompts, which are filtered for plausibility. Prompts typically describe behavioral context (e.g., “all red cars turning right except those stopped at lights”).
- Scene diversity: Collected across Boston and Singapore, with varying weather and lighting. The annotations enable granular, behavior-sensitive, object-level retrieval across challenging urban scenes.
2. Formal Definition of Prompt-Guided Tracking Task
NuPrompt tasks models with localizing, identifying, and temporally tracking all objects described by a specific language command across multiple camera views and frames, abstracted as:
- Inputs:
- Language prompt tokenized into embeddings (typically via a frozen RoBERTa encoder),
- Sequence of multi-view image frames , each mapped to a BEV-like feature space ,
- Optionally, LiDAR point clouds (processed into tokens for multimodal fusion).
- Outputs:
- The set of 3D trajectories matching the prompt, with each , encoding 3D box properties (location, dimensions, heading).
- Objective function:
Losses combine multi-object detection/classification, motion regression, past/future trajectory prediction, and prompt-label prediction via cross-modal attention (Wu et al., 2023).
3. Evaluation Metrics
Prompt-based tracking on NuPrompt is evaluated using class-agnostic CLEAR-MOT metrics, adapted to measure language-guided tracking performance:
- Multiple Object Tracking Accuracy (MOTA):
- FP: false positives; FN: false negatives; IDS: identity switches; GT: ground-truth referenced objects.
- Average MOTA (AMOTA):
- Multiple Object Tracking Precision (MOTP) and Average MOTP (AMOTP): Measures mean localization error across matches.
- Recall: Proportion of referred objects correctly tracked.
- Temporal Identity Discontinuity (TID): The mean break/fragmentation in tracked trajectories (lower is better).
- False Alarm Frequency (FAF): Average count of non-grounded predictions per frame (lower is better).
Metrics are computed over multiple detection-score thresholds, reflecting performance at various operating points (Wu et al., 2023, Yu et al., 25 Dec 2025).
4. Baseline and State-of-the-Art Model Architectures
Initial NuPrompt experiments employ PromptTrack—an end-to-end Transformer-based tracker leveraging prompt embeddings for object selection and set-prediction:
- Fusion mechanism: Element-wise multiplication of visual features by prompt embeddings , followed by positional encoding and cross-attention.
- Set prediction: A pool of track queries and detection queries are decoded with multi-layer transformers to output candidate tracks and boxes.
- Temporal reasoning: Past frame features (history window ) and predicted future motion (horizon ) refine tracking stability.
- Prompt reasoning: Queries are filtered via cross-modal attention with prompt tokens, with a tunable threshold () selecting final referred objects.
TrackTeller extends the architecture with unified multimodal (camera + LiDAR) UniScene fusion, language-grounded decoding, and explicit short-term and future trajectory modeling (Yu et al., 25 Dec 2025). Models maintain temporal banks of embeddings over frames, process BEV maps at resolution, and train with balanced loss weights using Adam and early stopping.
5. Empirical Benchmark Results
Benchmarking results reveal substantial gains over heuristic and prompt-naive tracking baselines:
| Model/Method | AMOTA | AMOTP | Recall | TID | FAF |
|---|---|---|---|---|---|
| DQTrack (DETR3D/PETR) | 0.87/0.31 | 1.93/1.90 | 8.67/9.90 | 10.38/10.01 | 658.2/747.5 |
| PF-Track (DETR3D/PETR) | 1.73/1.23 | 1.78/1.77 | 17.29/22.30 | 8.85/6.15 | 764.7/911.4 |
| PromptTrack (DETR3D/PETR) | 2.14/1.10 | 1.74/1.81 | 19.85/17.43 | 5.77/7.69 | 705.5/473.9 |
| PromptTrack-3D (PETR) | 7.08 | 1.55 | 37.69 | 5.03 | 723.1 |
| TrackTeller (PETR) | 17.16 | 1.42 | 41.24 | 3.85 | 187.7 |
TrackTeller achieves +70% AMOTA gain and a 3.15–3.4× drop in FAF versus previous state-of-the-art, with improved localization and reduced trajectory fragmentation (Yu et al., 25 Dec 2025).
PromptTrack ablation studies underline the criticality of prompt reasoning (largest AMOTA drop), as well as short-term past and future temporal modeling. The prompt selection threshold exhibits optimality at (Wu et al., 2023).
6. Design Insights, Limitations, and Future Directions
NuPrompt enables the study of interactive grounding, temporal reasoning, and multimodal fusion for prompt-based autonomous driving tasks. Key findings include:
- Language-grounded tracking is substantially improved by explicit cross-modal fusion, history-aware temporal modeling, and future-propagation mechanisms, which stabilize track continuity (TID) and suppress false alarms (FAF).
- PromptTrack and TrackTeller demonstrate that multimodal fusion (camera + LiDAR), coupled with Transformers and set-prediction, yields improved recall and robustness relative to pure vision baselines.
- Handling ambiguous or multi-clause prompts, longer-term scene history, and integrating planning or scene-graph prediction remain open challenges.
- The dataset’s scale and flexible annotation pipeline open avenues for text-to-scene synthesis, reinforcement learning with natural language, and interactive perception for autonomous agents.
A plausible implication is that NuPrompt’s benchmark structure—object-centric, multi-object, cross-view, multi-modal tracking conditioned on language—will inform future designs of perception and reasoning systems operating under flexible human instruction in dynamic environments.
7. Impact and Research Applications
NuPrompt is fostering research into joint visual-linguistic trajectory prediction, attribute-aware 3D perception, and real-time interactive scene understanding in autonomous systems. It has catalyzed new approaches to prompt-conditioned tracking and attribute-based retrieval, and is referenced as foundational in studies exploring temporal multimodal grounding, as in TrackTeller (Yu et al., 25 Dec 2025), and prompt reasoning architectures, as in PromptTrack (Wu et al., 2023). The benchmark, with its scale, diversity, and richly annotated prompt-object associations, provides a rigorous basis for evaluating models under realistic, human-relevant language conditions.