Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 73 tok/s

Gemini 2.5 Pro 42 tok/s Pro

GPT-5 Medium 26 tok/s Pro

GPT-5 High 34 tok/s Pro

GPT-4o 96 tok/s Pro

Kimi K2 191 tok/s Pro

GPT OSS 120B 454 tok/s Pro

Claude Sonnet 4.5 36 tok/s Pro

2000 character limit reached

MAViC: Multimodal Active Learning for Video Captioning (2212.11109v1)

Published 11 Dec 2022 in cs.CV, cs.CL, and cs.LG

Abstract: A large number of annotated video-caption pairs are required for training video captioning models, resulting in high annotation costs. Active learning can be instrumental in reducing these annotation requirements. However, active learning for video captioning is challenging because multiple semantically similar captions are valid for a video, resulting in high entropy outputs even for less-informative samples. Moreover, video captioning algorithms are multimodal in nature with a visual encoder and language decoder. Further, the sequential and combinatorial nature of the output makes the problem even more challenging. In this paper, we introduce MAViC which leverages our proposed Multimodal Semantics Aware Sequential Entropy (M-SASE) based acquisition function to address the challenges of active learning approaches for video captioning. Our approach integrates semantic similarity and uncertainty of both visual and language dimensions in the acquisition function. Our detailed experiments empirically demonstrate the efficacy of M-SASE for active learning for video captioning and improve on the baselines by a large margin.

Citations (1)

View on Semantic Scholar