Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MICap: A Unified Model for Identity-aware Movie Descriptions (2405.11483v1)

Published 19 May 2024 in cs.CV

Abstract: Characters are an important aspect of any storyline and identifying and including them in descriptions is necessary for story understanding. While previous work has largely ignored identity and generated captions with someone (anonymized names), recent work formulates id-aware captioning as a fill-in-the-blanks (FITB) task, where, given a caption with blanks, the goal is to predict person id labels. However, to predict captions with ids, a two-stage approach is required: first predict captions with someone, then fill in identities. In this work, we present a new single stage approach that can seamlessly switch between id-aware caption generation or FITB when given a caption with blanks. Our model, Movie-Identity Captioner (MICap), uses a shared auto-regressive decoder that benefits from training with FITB and full-caption generation objectives, while the encoder can benefit from or disregard captions with blanks as input. Another challenge with id-aware captioning is the lack of a metric to capture subtle differences between person ids. To this end, we introduce iSPICE, a caption evaluation metric that focuses on identity tuples created through intermediate scene graphs. We evaluate MICap on Large-Scale Movie Description Challenge (LSMDC), where we show a 4.2% improvement in FITB accuracy, and a 1-2% bump in classic captioning metrics.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. SPICE: Semantic Propositional Image Caption Evaluation. In European Conference on Computer Vision (ECCV), 2016.
  2. LSMDC v2 Challenge presentation. In 3rd Workshop on Closing the Loop Between Vision and Language, 2019.
  3. Face, Body, Voice: Video Person-Clustering with Multiple Modalities. In International Conference on Computer Vision Workshops (ICCVW), 2021.
  4. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  5. iPerceive: Applying Common-Sense Reasoning to Multi-Modal Dense Video Captioning and Video Question Answering. In Winter Conference on Applications of Computer Vision (WACV), 2021.
  6. CLAIR: Evaluating Image Captions with Large Language Models. In Empirical Methods in Natural Language Processing (EMNLP), 2023.
  7. Towards bridging event captioner and sentence localizer for weakly supervised dense event captioning. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  8. Sketch, ground, and refine: Top-down dense video captioning. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  9. ArcFace: Additive Angular Margin Loss for Deep Face Recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  10. Retinaface: Single-shot multi-level face localisation in the wild. In Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  11. Meteor Universal: Language Specific Translation Evaluation for Any Target Language. In European Chapter of the Association for Computational Linguistics (EACL), 2014.
  12. Long-term recurrent convolutional networks for visual recognition and description. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  13. AutoAD: Movie Description in Context. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023a.
  14. AutoAD II: The Sequel-Who, When, and What in Movie Audio Description. In International Conference on Computer Vision (ICCV), 2023b.
  15. CLIPScore: A Reference-free Evaluation Metric for Image Captioning. In Empirical Methods in Natural Language Processing (EMNLP), 2021.
  16. Image Retrieval using Scene Graphs. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  17. Grounded Video Situation Recognition. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
  18. Dense-captioning events in videos. In International Conference on Computer Vision (ICCV), 2017.
  19. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In International Conference on Machine Learning (ICML), 2023.
  20. Jointly localizing and describing events for dense video captioning. In Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  21. Chin-Yew Lin. ROUGE: A Package for Automatic Evaluation of Summaries. In Workshop on Text Summarization Branches Out (WAS), 2004.
  22. Swinbert: End-to-end transformers with sparse attention for video captioning. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  23. UniVL: A Unified Video and Language Pre-training Model for Multimodal Understanding and Generation. arXiv preprint arXiv:2002.06353, 2020.
  24. ClipCap: CLIP Prefix for Image Captioning. arXiv preprint 2111.09734, 2021.
  25. Streamlined dense video captioning. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  26. From Benedict Cumberbatch to Sherlock Holmes: Character Identification in TV series without a Script. In British Machine Vision Conference (BMVC), 2017.
  27. BLEU: a method for automatic evaluation of machine translation. In Association of Computational Linguistics (ACL), 2002.
  28. Adversarial inference for multi-sentence video description. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  29. Identity-aware multi-sentence video description. In European Conference on Computer Vision (ECCV), 2020.
  30. Towards video captioning with naming: a novel dataset and a multi-modal approach. In International Conference on Image Analysis and Processing (ICIAP), 2017.
  31. M-VAD names: a dataset for video captioning with naming. Multimedia Tools and Applications (MTAP), 78:14007–14027, 2019.
  32. Language Models are Unsupervised Multitask Learners. 2019.
  33. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML). PMLR, 2021.
  34. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research (JMLR), 21:1–67, 2020.
  35. Watch, listen and tell: Multi-modal weakly supervised dense event captioning. In International Conference on Computer Vision (ICCV), 2019.
  36. Coherent multi-sentence video description with variable level of detail. In German Conference on Pattern Recognition (GCPR), 2014.
  37. A Dataset for Movie Description. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  38. Movie description. International Journal of Computer Vision (IJCV), 123:94–120, 2017.
  39. Visual Semantic Role Labeling for Video Understanding. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  40. FaceNet: A Unified Embedding for Face Recognition and Clustering. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  41. Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In Fourth Workshop on Vision and Language, 2015.
  42. End-to-end generative pretraining for multimodal video captioning. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  43. Weakly supervised dense video captioning. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  44. Dense procedure captioning in narrated instructional videos. In Association of Computational Linguistics (ACL), 2019.
  45. Beyond caption to narrative: Video captioning with multiple sentences. In International Conference on Image Processing (ICIP), 2016.
  46. MAD: A Scalable Dataset for Language Grounding in Videos From Movie Audio Descriptions. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  47. “Knock! Knock! Who is it?" Probabilistic Person Identification in TV series. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
  48. Video Face Clustering with Unknown Number of Clusters. In International Conference on Computer Vision (ICCV), 2019.
  49. Attention is All You Need. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
  50. CIDEr: Consensus-based image description evaluation. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  51. Sequence to sequence-video to text. In International Conference on Computer Vision (ICCV), 2015a.
  52. Translating Videos to Natural Language Using Deep Recurrent Neural Networks. In North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2015b.
  53. Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning. In Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  54. Event-centric hierarchical representation for dense video captioning. IEEE Transactions on Circuits and Systems for Video Technology, 31(5):1890–1900, 2020.
  55. End-to-end Dense Video Captioning with Parallel Decoding. In International Conference on Computer Vision (ICCV), 2021.
  56. Vid2seq: Large-scale pretraining of a visual language model for dense video captioning. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  57. Video paragraph captioning using hierarchical recurrent neural networks. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  58. End-to-end concept word detection for video captioning, retrieval, and question answering. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  59. LSMDC v2 Challenge presentation. 2019.
  60. Character Grounding and Re-Identification inStory of Videos and Text Descriptions. In European Conference on Computer Vision (ECCV), 2020.
  61. BERTScore: Evaluating Text Generation with BERT. In International Conference on Learning Representations (ICLR), 2020.
  62. End-to-end dense video captioning with masked transformer. In Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
Citations (2)

Summary

  • The paper introduces a unified single-stage model that integrates identity-aware captioning and fill-in-the-blanks tasks, achieving a 4.2% FITB accuracy gain and up to 1.8% METEOR improvement.
  • It employs a Transformer-based encoder-decoder that fuses multimodal features, including semantic, action, and facial data processed with Arcface, to generate coherent captions.
  • The unified approach enhances narrative comprehension in video descriptions and holds promise for improved accessibility services and advanced multimedia analysis.

Identity-aware Captioning for Movies: A Unified Approach

The paper "MICap: A Unified Model for Identity-aware Movie Descriptions" addresses a critical issue in video captioning—identifying and accurately describing the characters involved in a storyline. Previous work predominately focused on generating captions using placeholders for character names, presenting challenges for understanding narratives in a series of connected videos. This research proposes an innovative single-stage approach with their model, Movie-Identity Captioner (MICap), which unifies identity-aware caption generation and the fill-in-the-blanks (FITB) task, enhancing accuracy and coherence in narrative description.

Methodology and Approach

MICap improves upon prior methods by adopting a single-stage encoder-decoder framework that integrates the tasks of identity-aware captioning and FITB seamlessly. The model employs a shared auto-regressive decoder that capitalizes on both full-caption generation and FITB objectives. It transitions efficiently between generating captions with or without pre-specified identities (blanks) by exploiting a Transformer-based architecture.

The input representation is enriched with multimodal features including semantic, action, and facial data extracted from video clips. These diverse inputs are transformed into a unified latent space that feeds the Transformer encoder. By incorporating advanced face processing techniques, such as using Arcface, and attention mechanisms to aggregate spatial and identity-specific information across clips, MICap navigates the identity-aware captioning landscape with notable efficacy.

Results and Metrics

Evaluation on the Large-Scale Movie Description Challenge (LSMDC) dataset demonstrates the efficacy of MICap. It delivers superior performance, with a 4.2% enhancement in FITB accuracy and up to a 1.8% improvement in classic captioning metrics like METEOR. The introduction of iSPICE, an identity-sensitive extension of the SPICE metric, offers a novel way of evaluating captions that encapsulate identity-specific semantics, providing more nuanced insights into the model's performance on identity-aware tasks.

Implications and Future Directions

The MICap model presents strong implications for enhancing automated video description systems, particularly in accessibility-focused applications such as audio descriptions for the visually impaired. The ability to maintain character identity across scenes without relying solely on external databases broadens the applicability of this methodology to a wider range of video content, including independent films and personal video compilations.

Theoretically, the integration of identity information into the captioning pipeline enhances narrative comprehension, making it an important step towards developing more sophisticated AI capable of understanding complex stories. However, the model is currently limited to small groups of clips, which poses an exciting area for future work: scaling MICap to operate effectively across entire films. Additionally, exploring hybrid models that incorporate LLMs and external knowledge could further enhance narrative detail and accuracy.

Conclusion

Overall, MICap represents an advancement in identity-aware captioning by resolving the two-stage bottleneck of previous methodologies and providing a unified, efficient approach. The improvements seen in performance metrics attests to the potential of MICap in setting a new standard for video description tasks, and points to enriching the AI tools available for narrative-focused video content processing.

Youtube Logo Streamline Icon: https://streamlinehq.com