DONUT: A Decoder-Only Model for Trajectory Prediction (2506.06854v1)

Published 7 Jun 2025 in cs.CV

Abstract: Predicting the motion of other agents in a scene is highly relevant for autonomous driving, as it allows a self-driving car to anticipate. Inspired by the success of decoder-only models for language modeling, we propose DONUT, a Decoder-Only Network for Unrolling Trajectories. Different from existing encoder-decoder forecasting models, we encode historical trajectories and predict future trajectories with a single autoregressive model. This allows the model to make iterative predictions in a consistent manner, and ensures that the model is always provided with up-to-date information, enhancing the performance. Furthermore, inspired by multi-token prediction for language modeling, we introduce an 'overprediction' strategy that gives the network the auxiliary task of predicting trajectories at longer temporal horizons. This allows the model to better anticipate the future, and further improves the performance. With experiments, we demonstrate that our decoder-only approach outperforms the encoder-decoder baseline, and achieves new state-of-the-art results on the Argoverse 2 single-agent motion forecasting benchmark.

Summary

The paper introduces DNUT, a decoder-only model that unifies historical and future trajectory encoding to improve long-term prediction accuracy in autonomous driving.
The methodology replaces traditional encoder-decoder architectures with an autoregressive decoder and an 'overprediction' strategy, achieving superior results on the Argoverse 2 benchmark.
The findings advocate for further exploration of unified architectures in motion forecasting, potentially leading to more reliable and consistent trajectory prediction systems.

Summary of the DNUT: A Decoder-Only Model for Trajectory Prediction

The paper presents DNUT, a Decoder-Only Network for Unrolling Trajectories which focuses on enhancing trajectory predictions, especially in autonomous driving contexts. The approach distinguishes itself from traditional encoder-decoder models by utilizing a singular autoregressive decoder to handle both historical and future trajectory predictions. This allows the framework to make iterative predictions with current context-sensitive information, thereby refining its performance predictively.

Technical Approach and Novelty

The DNUT emphasizes a unified processing method for both historical and future trajectories, markedly different from the compartmentalized architecture prevalent in prior motion forecasting models. This choice is driven by the hypothesized inefficiencies in encoder-decoder models, where historical embeddings could become outdated, thereby impacting long-term predictions. Through DNUT, the authors aim to circumvent this limitation by directly encoding and advancing trajectory data within a singular coherent framework.

Furthermore, DNUT introduces an 'overprediction' strategy inspired by language modeling concepts, aiming to better anticipate future trajectories across longer temporal horizons. The objective with this approach is dual: improving the consistency of iterations for trajectory prediction and enhancing foresight in future trajectory possibilities.

Experimental Evaluation

The effectiveness of DNUT is demonstrated through comprehensive experiments, outperforming an encoder-decoder baseline and achieving state-of-the-art results on the Argoverse 2 single-agent motion forecasting benchmark. Notably, DNUT demonstrates substantial improvements in long-term predictive accuracy, supporting the hypothesis that a continuous autoregressive model benefits significantly from maintaining consistent context relevance throughout its predictive iterations.

Implications and Future Directions

This paper stimulates discourse in trajectory prediction methodologies, suggesting that removing segmentation within model architectures can lead to tangible improvements in prediction consistency and accuracy. The practical implications for autonomous driving are significant, potentially leading to more reliable motion forecasting systems that can adapt more naturally to dynamic on-road circumstances.

From a theoretical standpoint, the paper encourages further exploration into decoder-only models and the applicability of multi-token predictive strategies in motion forecasting. Future research could explore variances within decoder-only architectures or assess the scalability of DNUT across diverse scenario complexities in autonomous driving environments.

In conclusion, DNUT signifies a meaningful advancement in trajectory forecasting models, offering new perspectives on model architecture design that could enhance predictive capability in AI-driven autonomous systems.