Detailed Localized Image and Video Captioning: Describe Anything Model
The paper introduces the Describe Anything Model (DAM), addressing challenges in generating detailed localized captions for images and videos. Despite the advancements in Vision-LLMs (VLMs), most fall short of providing accurate regional descriptions due to limitations in processing contextual information and regional specifics concurrently. DAM excels in this domain through two principal innovations—focal prompting and a localized vision backbone—allowing for the preservation of fine-grained details within user-specified regions and capturing global context efficiently.
DAM employs a focal prompt strategy, which ensures high token density for region encoding, maintaining both the context surrounding the specified region and its intricate details. By doing so, DAM can circumvent the common pitfall in image processing models where regional features, when derived from global image representations, might lose their specificity, especially for smaller objects. The localized vision backbone complements this approach by integrating precise localization inputs via spatially aligned mask embedding and cross-attention gates, refining feature relevance and enhancing contextual understanding without increasing token sequence lengths in the LLM.
The paper highlights DAM's capability to generate detailed descriptions of complex scenes—something previous models struggled with due to either loss of detail or irrelevant information inclusion in captions. DAM leverages contextual cues through gated cross-attention, optimizing the interaction between focal and global visual tokens to ensure that captions are both detailed and accurate.
To mitigate the scarcity of high-quality datasets for training such models, the authors introduce a Semi-supervised learning-based Data Pipeline (DLC-SDP). This pipeline is twofold: relying first on human-annotated segmentation datasets to expand simple regional keywords into detailed captions using advanced VLMs, and second, utilizing a self-training scheme on unlabeled web images to enhance data diversity and scale—a technique inspired by successful SSL applications in image classification. This pipeline significantly enriches dataset diversity, covering more object categories and improving DAM's performance across various benchmarks.
Furthermore, standard captioning benchmarks require reference captions for evaluation, a method not suited to detailed localized captioning due to frequent omissions and unaccounted details. This paper introduces DLC-Bench, an innovative benchmark that dispenses with such reference captions, instead employing an LLM judge mechanism. By systematically assessing a model's output against specific positive and negative attributes predefined for each region, DLC-Bench better evaluates a model's precision in capturing exhaustive regional details and avoiding factual inaccuracies—encouraging richer, more accurate descriptions.
On evaluation, DAM has demonstrated remarkable improvements, achieving state-of-the-art results across 7 multi-granular regional captioning benchmarks, including keyword-level, phrase-level, and detailed image/video captioning. The performance of DAM across DLC-Bench, especially, indicates its robust capability in generating highly informative and contextually grounded descriptions, outperforming conventional VLMs and region-specific captioning models like GPT-4o and o1. This emphasizes DAM's fine balance of detail retention and contextual processing—components crucial for practical applications in image and video understanding.
In essence, the Describe Anything Model represents a significant step in advancing the capabilities of vision-LLMs, enabling them to provide nuanced and context-aware localized descriptions. Future work might explore extending this approach, refining the data pipeline for handling unstructured wild data and further leveraging the potential of semi-supervised learning in varied domains within AI. The implications for tasks demanding detailed scene inspections or interactive visual question answering are profound, paving the way for refined applications in autonomous systems and multimedia content analysis.