Describe Anything Model (DAM)

Updated 16 September 2025

Describe Anything Model is a region-aware vision–language system that fuses global views and focal crops to generate detailed, context-rich captions.
Its dual-input architecture with focal prompting and gated cross-attention adapters effectively integrates local details with global context for superior captioning.
A semi-supervised pipeline, combining supervised expansion and self-training on web data, enables scalable region-level captioning with state-of-the-art performance.

The Describe Anything Model (DAM) refers to a class of region-aware vision–language systems, with a reference implementation introduced specifically for detailed localized captioning (DLC) across images and videos. DAM is designed to generate comprehensive, localized textual descriptions for arbitrary image regions, objects, or parts, preserving both high-fidelity local detail and global context. Its architecture and data pipeline address the limitations of both global captioning models and prior region-level description methods, making it possible to produce multi-granular, context-aware, and detail-rich captions for image and video regions without requiring aligned region–caption supervision.

1. Model Architecture and Focal Prompting

DAM utilizes a dual-input architecture centered on the notion of a "focal prompt." For each specified region in an image, DAM processes two complementary views:

A global view: The entire image, masked to indicate the foreground region of interest.
A focal crop: An enlarged crop around the target region, where the bounding box is expanded by a factor α and a minimum size constraint is enforced (e.g., at least 48 pixels per side).

Given an input image $I$ , binary mask $M$ , and the tight bounding box $B$ of $M$ , the focal crop is produced as:

$B' = \text{ExpandBox}(B, \alpha) \ I' = I|_{B'} \ M' = M|_{B'}$

where $|_{B'}$ denotes restriction (cropping) to $B'$ . These two views form the focal prompt, ensuring dense, high-resolution coverage of the target with preserved contextual cues.

The global image $I$ and mask $M$ are embedded via dedicated patch embedding layers $E_I$ and $E_M$ , with positional encodings $P$ , yielding:

$x = E_I(I) + E_M(M) + P$

After a global encoder $f_G$ , this generates global features $z = f_G(x)$ . Similarly, the crop and mask $I', M'$ are embedded and passed through a regional encoder $f_R$ (which also receives $z$ as input), yielding

$x' = E_I(I') + E_M(M') + P \ z' = f_R(x', z)$

Critically, $f_R$ incorporates gated cross-attention adapters that allow region features to access global context. For any layer $l$ :

$h'^{(l)} = h^{(l)} + \tanh(\gamma^{(l)}) \cdot \text{CrossAttn}(h^{(l)}, z) \ h_\text{Adapter}^{(l)} = h'^{(l)} + \tanh(\beta^{(l)}) \cdot FFN(h'^{(l)})$

$h^{(l)}$ denotes the local features, and $\gamma^{(l)}, \beta^{(l)}$ control the gating, initialized at zero to preserve base VLM behavior at initialization.

The final fused feature $z'$ is passed, with a textual prompt $t$ , to a LLM for caption generation:

$T = \text{LLM}(t, z')$

2. Data Pipeline for Detailed Localized Captioning

DAM achieves robust performance through a two-stage semi-supervised learning pipeline termed DLC-SDP:

Stage 1 (Supervised expansion): DAM trains using segmentation datasets with high-quality masks and short class/part keywords. Prompts to off-the-shelf vision–LLMs elicit elaborated, detail-rich region descriptions for each mask. Training encourages DAM to generate such captions at inference time without seeing the original keywords.
Stage 2 (Self-training on web data): DAM is used to generate captions for candidate regions extracted from unlabeled web images via open-vocabulary segmentation models (e.g., SAM, OWL-ViT). CLIP-based filtering discards triplets with low image–text relevance. An LLM then enriches these pseudo-captions across multiple detail levels, providing abundant, diverse supervision.

This SSL approach enables scaling region-level captioning without large-scale region–caption annotated data.

3. Benchmarking and Evaluation Protocols

DAM introduces DLC-Bench, a reference-free evaluation framework for detailed localized captioning. Instead of relying on incomplete ground-truth captions, DLC-Bench uses an LLM judge to score model outputs against discrete, manually curated positive and negative attributes or questions about the region (e.g., "Does the caption mention the control panel?" or "Does it avoid hallucinating irrelevant parts?"). Scores are the proportion of positive and negative attributes correctly addressed:

Positive Score: $(\text{number of correct positives})/(\text{total positives})$
Negative Score: $(\text{number of correct negatives})/(\text{total negatives})$
Overall: Average of positive and negative scores.

This strategy ensures that detailed, contextually appropriate descriptions are rewarded, and hallucinations or omissions are penalized.

4. Performance and Comparative Results

DAM achieves state-of-the-art performance on seven region-level captioning benchmarks, including keyword-level (LVIS, PACO), phrase-level (Flickr30k Entities), and detailed multi-sentence tasks (Ref-L4, DLC-Bench, video datasets). Notable gains include:

Relative improvements above 33% on short caption metrics and ~13% on longer narratives.
Superior or competitive results over both general-purpose VLMs (e.g., GPT-4o) and region-sensitive baselines, including video datasets like HC-STVG and VideoRefer-Bench.

These improvements stem from DAM’s explicit local–global fusion and dense regional representation, which prior architectures (relying solely on masked full-image or hard crops) inconsistently preserved.

5. Extensions and Domain-Specific Adaptations

DAM’s architecture lends itself to domain adaptation:

Medical Imaging (MedDAM): Adapts the core DAM model via expert prompts, dual-input pre-processing, and specialized QA-driven region evaluation (MedDLC-score), enabling region-level factual captioning on CXR, CT, and dermatology datasets. MedDAM demonstrably outperforms generalist VLMs and other models in attribute-level clinical factuality.
Text-rich VQA (DAM-QA): Extends DAM with a sliding-window patching and answer aggregation mechanism. For text-centric VQA tasks (e.g., DocVQA), each image is parsed into overlapping patches, each patch described or queried individually; answers are then aggregated with area-weighted voting. DAM-QA achieves more than 7 ANLS point improvement on DocVQA and matches or surpasses other region-aware models with a fraction of the parameters.

6. Implications and Future Research

The Describe Anything Model sets a new paradigm for localized vision–language tasks, overcoming the inherent tradeoff between local detail and context via focal prompts, dual-view encoding, and cross-modal context fusion. Its semi-supervised, scalable data pipeline and reference-free evaluation protocol address longstanding challenges in region-level vision–language research.

Anticipated research directions include:

Advanced multi-scale or region-hierarchical integration for even finer detail-context tradeoff.
Extension to more complex modalities (e.g., medical 3D imaging, complex video regions).
Further refinement of patch aggregation and voting for downstream tasks (e.g., document and infographic VQA).
Integration of action or pose reasoning with compositional vision–language output.

DAM’s technical and methodological contributions suggest broad utility across domains requiring detailed, context-aware region reasoning, including medical diagnostics, robotic perception, and dense scene understanding.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to Describe Anything Model.