MICap: A Unified Model for Identity-aware Movie Descriptions (2405.11483v1)
Abstract: Characters are an important aspect of any storyline and identifying and including them in descriptions is necessary for story understanding. While previous work has largely ignored identity and generated captions with someone (anonymized names), recent work formulates id-aware captioning as a fill-in-the-blanks (FITB) task, where, given a caption with blanks, the goal is to predict person id labels. However, to predict captions with ids, a two-stage approach is required: first predict captions with someone, then fill in identities. In this work, we present a new single stage approach that can seamlessly switch between id-aware caption generation or FITB when given a caption with blanks. Our model, Movie-Identity Captioner (MICap), uses a shared auto-regressive decoder that benefits from training with FITB and full-caption generation objectives, while the encoder can benefit from or disregard captions with blanks as input. Another challenge with id-aware captioning is the lack of a metric to capture subtle differences between person ids. To this end, we introduce iSPICE, a caption evaluation metric that focuses on identity tuples created through intermediate scene graphs. We evaluate MICap on Large-Scale Movie Description Challenge (LSMDC), where we show a 4.2% improvement in FITB accuracy, and a 1-2% bump in classic captioning metrics.
- SPICE: Semantic Propositional Image Caption Evaluation. In European Conference on Computer Vision (ECCV), 2016.
- LSMDC v2 Challenge presentation. In 3rd Workshop on Closing the Loop Between Vision and Language, 2019.
- Face, Body, Voice: Video Person-Clustering with Multiple Modalities. In International Conference on Computer Vision Workshops (ICCVW), 2021.
- Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- iPerceive: Applying Common-Sense Reasoning to Multi-Modal Dense Video Captioning and Video Question Answering. In Winter Conference on Applications of Computer Vision (WACV), 2021.
- CLAIR: Evaluating Image Captions with Large Language Models. In Empirical Methods in Natural Language Processing (EMNLP), 2023.
- Towards bridging event captioner and sentence localizer for weakly supervised dense event captioning. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- Sketch, ground, and refine: Top-down dense video captioning. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- ArcFace: Additive Angular Margin Loss for Deep Face Recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- Retinaface: Single-shot multi-level face localisation in the wild. In Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Meteor Universal: Language Specific Translation Evaluation for Any Target Language. In European Chapter of the Association for Computational Linguistics (EACL), 2014.
- Long-term recurrent convolutional networks for visual recognition and description. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
- AutoAD: Movie Description in Context. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023a.
- AutoAD II: The Sequel-Who, When, and What in Movie Audio Description. In International Conference on Computer Vision (ICCV), 2023b.
- CLIPScore: A Reference-free Evaluation Metric for Image Captioning. In Empirical Methods in Natural Language Processing (EMNLP), 2021.
- Image Retrieval using Scene Graphs. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
- Grounded Video Situation Recognition. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
- Dense-captioning events in videos. In International Conference on Computer Vision (ICCV), 2017.
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In International Conference on Machine Learning (ICML), 2023.
- Jointly localizing and describing events for dense video captioning. In Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- Chin-Yew Lin. ROUGE: A Package for Automatic Evaluation of Summaries. In Workshop on Text Summarization Branches Out (WAS), 2004.
- Swinbert: End-to-end transformers with sparse attention for video captioning. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- UniVL: A Unified Video and Language Pre-training Model for Multimodal Understanding and Generation. arXiv preprint arXiv:2002.06353, 2020.
- ClipCap: CLIP Prefix for Image Captioning. arXiv preprint 2111.09734, 2021.
- Streamlined dense video captioning. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- From Benedict Cumberbatch to Sherlock Holmes: Character Identification in TV series without a Script. In British Machine Vision Conference (BMVC), 2017.
- BLEU: a method for automatic evaluation of machine translation. In Association of Computational Linguistics (ACL), 2002.
- Adversarial inference for multi-sentence video description. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- Identity-aware multi-sentence video description. In European Conference on Computer Vision (ECCV), 2020.
- Towards video captioning with naming: a novel dataset and a multi-modal approach. In International Conference on Image Analysis and Processing (ICIAP), 2017.
- M-VAD names: a dataset for video captioning with naming. Multimedia Tools and Applications (MTAP), 78:14007–14027, 2019.
- Language Models are Unsupervised Multitask Learners. 2019.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML). PMLR, 2021.
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research (JMLR), 21:1–67, 2020.
- Watch, listen and tell: Multi-modal weakly supervised dense event captioning. In International Conference on Computer Vision (ICCV), 2019.
- Coherent multi-sentence video description with variable level of detail. In German Conference on Pattern Recognition (GCPR), 2014.
- A Dataset for Movie Description. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
- Movie description. International Journal of Computer Vision (IJCV), 123:94–120, 2017.
- Visual Semantic Role Labeling for Video Understanding. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- FaceNet: A Unified Embedding for Face Recognition and Clustering. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
- Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In Fourth Workshop on Vision and Language, 2015.
- End-to-end generative pretraining for multimodal video captioning. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Weakly supervised dense video captioning. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- Dense procedure captioning in narrated instructional videos. In Association of Computational Linguistics (ACL), 2019.
- Beyond caption to narrative: Video captioning with multiple sentences. In International Conference on Image Processing (ICIP), 2016.
- MAD: A Scalable Dataset for Language Grounding in Videos From Movie Audio Descriptions. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- “Knock! Knock! Who is it?" Probabilistic Person Identification in TV series. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
- Video Face Clustering with Unknown Number of Clusters. In International Conference on Computer Vision (ICCV), 2019.
- Attention is All You Need. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
- CIDEr: Consensus-based image description evaluation. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
- Sequence to sequence-video to text. In International Conference on Computer Vision (ICCV), 2015a.
- Translating Videos to Natural Language Using Deep Recurrent Neural Networks. In North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2015b.
- Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning. In Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- Event-centric hierarchical representation for dense video captioning. IEEE Transactions on Circuits and Systems for Video Technology, 31(5):1890–1900, 2020.
- End-to-end Dense Video Captioning with Parallel Decoding. In International Conference on Computer Vision (ICCV), 2021.
- Vid2seq: Large-scale pretraining of a visual language model for dense video captioning. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Video paragraph captioning using hierarchical recurrent neural networks. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- End-to-end concept word detection for video captioning, retrieval, and question answering. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- LSMDC v2 Challenge presentation. 2019.
- Character Grounding and Re-Identification inStory of Videos and Text Descriptions. In European Conference on Computer Vision (ECCV), 2020.
- BERTScore: Evaluating Text Generation with BERT. In International Conference on Learning Representations (ICLR), 2020.
- End-to-end dense video captioning with masked transformer. In Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- Haran Raajesh (2 papers)
- Naveen Reddy Desanur (1 paper)
- Zeeshan Khan (12 papers)
- Makarand Tapaswi (41 papers)