Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Connecting Vision and Language with Localized Narratives (1912.03098v4)

Published 6 Dec 2019 in cs.CV

Abstract: We propose Localized Narratives, a new form of multimodal image annotations connecting vision and language. We ask annotators to describe an image with their voice while simultaneously hovering their mouse over the region they are describing. Since the voice and the mouse pointer are synchronized, we can localize every single word in the description. This dense visual grounding takes the form of a mouse trace segment per word and is unique to our data. We annotated 849k images with Localized Narratives: the whole COCO, Flickr30k, and ADE20K datasets, and 671k images of Open Images, all of which we make publicly available. We provide an extensive analysis of these annotations showing they are diverse, accurate, and efficient to produce. We also demonstrate their utility on the application of controlled image captioning.

Localized Narratives: A New Form of Multimodal Image Annotations

The paper introduces Localized Narratives, a novel approach to multimodal image annotation seeking to provide a more intricate connection between language and visual data. Unlike prior image captioning datasets, such as Flickr Entities, which sparsely connect nouns in captions to image bounding boxes, Localized Narratives aim for dense grounding by linking each word in a spoken description to specific regions within an image through synchronized annotations. This research integrates four modalities: images, spoken descriptions, textual transcriptions, and mouse traces to fulfill its objective.

The core methodology involves annotators describing an image vocally while simultaneously using a mouse to hover over corresponding regions. This approach uniquely grounds each word, including verbs and prepositions, that relate to tasks beyond simply identifying objects. Once the vocal description is complete, annotators provide a text transcription of their speech. The team addresses the challenge of synchronization between the mouse trace and text using a sequence-to-sequence alignment strategy. This process ensures precise and temporally aligned captions.

From an efficiency standpoint, the Localized Narratives protocol is advantageous. The time required for narration and transcription is notably lower than the cost associated with traditional grounded captioning datasets, often requiring labor-intensive bounding box annotations. This efficiency could be further enhanced with advances in automatic speech recognition technology, potentially eliminating the need for manual transcription.

The paper details the collection of Localized Narratives at scale, annotating complete datasets such as COCO, ADE, Flickr, and parts of OID, resulting in a substantial corpus made available for public use. The paper's analysis shows the richness of this data, noting the comprehensive grounding of various word types and the substantially longer captions in comparison to earlier datasets. Furthermore, it highlights the diversity in both language and visual representation captured by the annotations.

Localized Narratives have significant implications across multiple applications. The dense grounding enables more sophisticated image captioning and retrieval tasks, thereby supporting developments in areas like fine-grained user control in captioning and assistive technologies for visually impaired individuals. Additionally, the annotations can enhance training for models requiring visual grounding, offering opportunities to improve attention mechanisms within machine learning systems.

The future work anticipated by the paper includes using this dataset to provide additional attention supervision, augment self-supervised systems, and validate spatial attention models, potentially refining the performance of image captioning models. Moreover, Localized Narratives can be leveraged in diverse areas such as image generation, retrieval, speech recognition, and environment navigation, where precise multimodal annotations are critical.

In conclusion, Localized Narratives represent an advancement in multimodal data annotation, offering a structured way to bridge the gap between visual and linguistic data more effectively and efficiently than previous datasets, and opening pathways for both theoretical exploration and practical application in AI advancements.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Jordi Pont-Tuset (38 papers)
  2. Jasper Uijlings (20 papers)
  3. Soravit Changpinyo (24 papers)
  4. Radu Soricut (54 papers)
  5. Vittorio Ferrari (83 papers)
Citations (217)