Temporally Consistent Long-Term Memory for 3D Single Object Tracking

Published 15 Apr 2026 in cs.CV | (2604.13789v1)

Abstract: 3D Single Object Tracking (3D-SOT) aims to localize a target object across a sequence of LiDAR point clouds, given its 3D bounding box in the first frame. Recent methods have adopted a memory-based approach to utilize previously observed features of the target object, but remain limited to only a few recent frames. This work reveals that their temporal capacity is fundamentally constrained to short-term context due to severe temporal feature inconsistency and excessive memory overhead. To this end, we propose a robust long-term 3D-SOT framework, ChronoTrack, which preserves the temporal feature consistency while efficiently aggregating the diverse target features via long-term memory. Based on a compact set of learnable memory tokens, ChronoTrack leverages long-term information through two complementary objectives: a temporal consistency loss and a memory cycle consistency loss. The former enforces feature alignment across frames, alleviating temporal drift and improving the reliability of proposed long-term memory. In parallel, the latter encourages each token to encode diverse and discriminative target representations observed throughout the sequence via memory-point-memory cyclic walks. As a result, ChronoTrack achieves new state-of-the-art performance on multiple 3D-SOT benchmarks, demonstrating its effectiveness in long-term target modeling with compact memory while running at real-time speed of 42 FPS on a single RTX 4090 GPU. The code is available at https://github.com/ujaejoon/ChronoTrack

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces ChronoTrack, a token-based memory mechanism that enforces temporal and cycle consistency to mitigate feature drift in 3D single object tracking.
It leverages a DGCNN backbone with distinct foreground and background memory strategies to maintain robust target representations amidst occlusion and appearance variations.
Empirical results show ChronoTrack outperforms benchmark methods on KITTI, NuScenes, and Waymo datasets while maintaining constant memory overhead for extended temporal contexts.

Temporally Consistent Long-Term Memory for 3D Single Object Tracking

Introduction and Motivation

ChronoTrack addresses key limitations in 3D-SOT by extending the effective temporal context used for target modeling in point cloud sequences. Existing methods leverage memory modules but are fundamentally restricted to short-term temporal windows, typically storing only a few frames of features. This constraint is rooted in temporal feature inconsistency, where the representation of target objects drifts over time due to occlusion, viewpoint change, and appearance variations. Furthermore, point-level memory structures scale memory and computational overhead linearly with temporal length, which is prohibitive for deployment in real-world settings, particularly on edge devices.

ChronoTrack introduces a token-based memory mechanism supplemented by two core objectives: a temporal consistency loss and a memory cycle consistency loss. These jointly enforce temporal feature alignment and promote semantic diversity within fixed-sized, learnable memory tokens, enabling scalable and effective long-term feature aggregation without incurring the high memory and compute costs of point-level methods.

Figure 1: (a) ChronoTrack maintains temporal consistency in target representations, contrasting with drift in previous methods. (b) Higher average cosine similarity across time frames enables reliable long-term context aggregation compared to MBPTrack.

ChronoTrack Architecture

ChronoTrack processes each LiDAR frame through a feature extraction backbone (DGCNN), producing point-wise features. These are fused with a compact set of foreground memory tokens, which iteratively assimilate target features from all previous frames, and a background memory with contextual features from the most recent frame. The composite features are used by a point- and box-level decoder for bounding box regression and targetness prediction.

Foreground token memory is recurrently updated with predicted target features at each timestep, while background memory is refreshed using only the latest frame to avoid introducing temporal inconsistency from rapidly changing backgrounds.

Figure 2: The ChronoTrack pipeline: current point features are refined via long-term foreground and short-term background memory, then used for bounding box and targetness prediction.

Temporal Consistency and Cycle Consistency Losses

ChronoTrack’s training procedure features two novel losses:

Temporal Consistency Loss ( $\mathcal{L}_{\mathrm{TC}}$ ): Ground-truth foreground points across frames are registered into a canonical coordinate system, and spatially corresponding points are identified. The model is penalized when features for temporally aligned points exhibit dissimilarities, counteracting temporal drift and promoting robust, aligned feature spaces for long-term aggregation.
Memory Cycle Consistency Loss ( $\mathcal{L}_{\mathrm{MCC}}$ ): To guarantee that each memory token encodes a semantically distinct part of the target, ChronoTrack optimizes cyclic walks in feature space (tokens $\rightarrow$ points $\rightarrow$ tokens). This loss maximizes two-step return probability and path affinity over foreground regions, directly enforcing token specialization and diversity.
Figure 3: (a) Temporal consistency loss enforces alignment between paired points across time. (b) The memory cycle consistency loss drives token specialization through two-step cyclic walks.

Empirical Results

ChronoTrack achieves state-of-the-art performance on KITTI, NuScenes, and Waymo Open Dataset benchmarks, consistently outperforming both template-based and memory-based prior art. On KITTI, ChronoTrack attains a Mean Success of 71.8 and Mean Precision of 90.1, surpassing MBPTrack and MemDisst. On NuScenes, ChronoTrack yields Mean Success/Precision of 59.7/72.7, outperforming recent strong baselines by substantial margins.

Qualitative analysis highlights ChronoTrack’s robustness against occlusion and distractors, and its efficacy in leveraging long temporal contexts. Generalization tests demonstrate strong domain transfer without the need for fine-tuning.

Figure 4: Qualitative predictions on KITTI, revealing ChronoTrack’s robustness under sparse and ambiguous scenarios.

Analysis and Ablations

Token Diversity and Feature Specialization

The memory cycle consistency loss drives specialization such that each token reliably attends to a unique semantic target part, which is critical for modeling deformable objects and handling appearance variability.

Figure 5: Token assignment visualization. With the MCC loss, token assignments diversify to reflect distinct parts of the object.

Background Memory Design

Extending temporal context in the background memory degrades performance due to the non-stationary nature of the background, justifying ChronoTrack’s choice to restrict background memory to the most recent frame.

Figure 6: Increasing temporal capacity of background memory decreases performance, confirming the importance of keeping background context short-term.

Scalability

ChronoTrack’s token-based design decouples long-term aggregation from memory and compute cost. Memory overhead remains constant as the temporal window grows, in contrast with the linear scaling seen in point-level approaches.

Figure 7: Memory footprint comparison as a function of foreground memory length: ChronoTrack’s overhead remains flat while MBPTrack’s grows linearly.

Implications and Future Directions

ChronoTrack demonstrates that long-term temporal reasoning in 3D-SOT can be realized efficiently by enforcing temporal feature consistency and leveraging compact, diverse memory tokens. This construction resolves the core scalability and drift issues in memory-based 3D-SOT and opens several avenues:

Dynamic and Open-Set Tracking: The explicit modeling of appearance diversity provides a natural precursor for handling dynamic and open-world categories with continual learning of new semantic parts.
Generalization: The framework demonstrates excellent zero-shot transfer, indicating promise for meta-learning and data-efficient adaptation in robotics and autonomous perception.
Memory Architecture Exploration: Further refinement of token management and the exploration of non-linear token update schemes may unlock richer memory utilization, enhancing temporal abstraction in 3D tracking and beyond.

Conclusion

ChronoTrack effectively overcomes the short-term limitation of previous 3D-SOT methods by introducing temporally consistent, token-based long-term memory and associated objective functions to maintain semantic diversity and temporal alignment. The approach yields substantial improvements across multiple benchmarks with favorable computational characteristics, establishing a scalable, reliable paradigm for long-term single object tracking in LiDAR data (2604.13789).

Markdown Report Issue