Dense Video Object Captioning from Disjoint Supervision (2306.11729v3)

Published 20 Jun 2023 in cs.CV

Abstract: We propose a new task and model for dense video object captioning -- detecting, tracking and captioning trajectories of objects in a video. This task unifies spatial and temporal localization in video, whilst also requiring fine-grained visual understanding that is best described by natural language. We propose a unified model, and demonstrate how our end-to-end approach is more accurate and temporally coherent than a multi-stage pipeline combining state-of-the-art detection, tracking, and captioning models. Moreover, we propose a training strategy based on a mixture of disjoint tasks, which allows us to leverage diverse, large-scale datasets which supervise different parts of our model. Although each pretraining task only provides weak supervision, they are complementary and, when combined, result in noteworthy zero-shot ability and serve as strong initialization for additional finetuning to further improve accuracy. We carefully design new metrics capturing all components of our task, and show how we can repurpose existing video grounding datasets (e.g. VidSTG and VLN) for our new task. We show that our model improves upon a number of strong baselines for this new task. Furthermore, we can apply our model to the task of spatial grounding, outperforming prior state-of-the-art on VidSTG and VLN, without explicitly training for it. Code is available at https://github.com/google-research/scenic/tree/main/scenic/projects/densevoc.

PDF HTML Abstract

Dense Video Object Captioning from Disjoint Supervision

In the paper "Dense Video Object Captioning from Disjoint Supervision," Zhou et al. introduce a novel task and model paradigm termed Dense Video Object Captioning (Dense VOC). This task involves detecting, tracking, and captioning the trajectories of all objects within a video. The proposed model is trained in an end-to-end manner and offers a cohesive approach to spatial and temporal video understanding, requiring meticulous language description of object trajectories.

Model Architecture and Training

The devised model consists of distinct modules for object detection, tracking, and captioning. Initially, class-agnostic region proposals are generated for each frame. These proposals are then processed through an association-based tracking module to create coherent object trajectories over time. A language decoder subsequently generates captions for these trajectories.

One of the principal innovations of this work lies in the ability to train the model using a combination of disjoint tasks from multiple large-scale datasets, each supervising different components of the model. Such training underscores a significant zero-shot performance capability, allowing the model to apply its learned generalizations to unseen tasks. Subsequent fine-tuning further enhances the model’s performance, transcending strong baseline models focused solely on image-based captioning tasks.

Repurposing Existing Datasets

Given the lack of datasets specifically annotated for dense video object captioning, the authors innovatively repurpose existing video grounding datasets such as VidSTG and VLN. Their approach demonstrates that the Dense VOC task is more expansive than traditional grounding tasks, as models trained on Dense VOC can directly perform grounding by maximizing the likelihood of generating the query sentence. In empirical evaluations, the proposed model outperforms existing state-of-the-art models in spatial grounding on both VidSTG and VLN, showcasing its robust performance capabilities.

Key Results and Implications

The authors present compelling results, particularly emphasizing the model's superior performance in producing temporally coherent captions and significantly reducing caption switches compared to per-frame captioning approaches. Additionally, through the strategic use of disjoint supervised tasks, the Dense VOC model achieves consistent improvements in temporal consistency, a critical attribute for video captioning tasks.

The implications of this work are multifaceted. Practically, the model advances the capabilities of automatic video annotation systems, which can be leveraged in applications such as video surveillance, content indexing, and retrieval. Theoretically, the approach paves the way for more holistic models in video understanding that integrate both visual dynamics and language tasks within a unified framework.

Speculations on Future Research

The insights derived from this research signal potential pathways for future work. One intriguing direction would be to explore models that incorporate finer-grained temporal segmentation, allowing distinct actions within a trajectory to be captioned separately. Furthermore, future developments could address the challenge of efficiently handling longer video sequences, thereby expanding the model's applicability to a wider range of video lengths and complexities.

Overall, this paper makes a substantial contribution to the field of video understanding and captioning, offering a robust methodology and comprehensive evaluation framework that underscores the efficacy of leveraging disjoint supervision for complex video-language tasks.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Xingyi Zhou (26 papers)
Anurag Arnab (56 papers)
Chen Sun (187 papers)
Cordelia Schmid (206 papers)

Citations (2)

View on Semantic Scholar

Related Papers

Find Related Papers

Dense Video Object Captioning from Disjoint Supervision (2306.11729v3)