Papers
Topics
Authors
Recent
2000 character limit reached

Grounding-Bench Benchmark for LMMs

Updated 5 January 2026
  • Grounding-Bench is a benchmark for evaluating multimodal models by testing integrated free-form visual conversations and spatial region grounding.
  • It requires each noun phrase in generated dialogs to map to an image region with IoU > 0.5, ensuring both semantic and spatial accuracy.
  • The benchmark supports multi-turn dialog, compositional reasoning, and precise spatial predictions to drive advances in unified LMM architectures.

Grounding-Bench Benchmark

Grounding-Bench is a benchmark for evaluating the integrated capabilities of large multimodal models (LMMs) to perform both open-ended visual chat and precise region-level grounding within images. Unlike classic grounding datasets, Grounding-Bench specifically tests models on simultaneous conversational fluency and spatial correspondence, requiring each mentioned noun phrase in the generated dialogue to be mapped to a valid image region. It supports multi-turn dialogs, compositional reasoning, and evaluates both natural language quality and spatial accuracy in a unified protocol (Zhang et al., 2023).

1. Motivation and Objectives

Grounding-Bench was designed to address the limitations of established grounding datasets, which almost exclusively focus on single-turn, short referring expressions (e.g., RefCOCO/+/g, Flickr30K Entities). Prior LMMs have exhibited a marked drop in conversational fluency when tasked with grounding outputs, highlighting a need for benchmarks that jointly assess dialog capability and region-level prediction. Grounding-Bench fills this gap by requiring LMMs to:

  • Engage in detailed, free-form visual conversations that mention multiple objects per image.
  • Ground each noun phrase in the response to a distinct bounding box or segmentation mask with intersection over union (IoU) >0.5> 0.5 to a pre-defined ground-truth region.
  • Maintain high conversational fluency and correctness while producing explicit spatial predictions.
  • Be evaluated on both chat realism (via GPT-4 simulation) and grounding accuracy (IoU-based).

2. Dataset Composition and Annotation Scheme

Training and Test Data

  • GVC (Grounded Visual Chat): 150,000 image–dialog pairs from COCO. Each instance includes multi-turn or single-turn GPT-4 generated visual chat with every noun phrase linked to a COCO ground-truth object.
  • GVC-R (Referring Visual Chat): For visual prompts (marks, clicks, boxes), placeholder tokens in questions and corresponding segmentation masks are provided.
  • Pretraining Data: Standard grounding sets for feature alignment, e.g., RefCOCO/+/g, Visual Genome, COCO2017, Flickr30K Entities. Pseudo-masks via SAM are used for segmentation regions.

Grounding-Bench Test Set:

  • 1,000 images sampled from MSCOCO 2014 val, annotated with complex, detailed captions (~5–10 sentences per image; ≈7,000 entities total).
  • Each mention paired with a bounding box and mask.
  • No separate validation subset.

Annotation Format

Responses mark each grounded phrase as:

1
<g_s> phrase <g_e> <seg>

  • <g_s>/<g_e>: Span delimiters for the grounded content.
  • <seg>: Placeholder for box/mask prediction; hidden features fed to the grounding head.

Entities are represented as box coordinates [x0,y0,x1,y1][x_0, y_0, x_1, y_1] (normalized) or as segmentation masks.

3. Evaluation Protocol and Metrics

Grounding Correctness

  • Spatial Match: IoU threshold (IoU>0.5\mathrm{IoU} > 0.5) is used:

IoU(Bpred,Bgt)=∣Bpred∩Bgt∣∣Bpred∪Bgt∣\mathrm{IoU}(B_{pred}, B_{gt}) = \frac{|B_{pred} \cap B_{gt}|}{|B_{pred} \cup B_{gt}|}

  • Semantic Alignment: GPT-4 judges phrase–box matches for referential correctness beyond spatial metrics.

Let NgtN_{gt} be the total number of ground-truth mentions, NpredN_{pred} be the total predicted mentions, TPgtTP_{gt} the number of ground-truth mentions recovered by the prediction, and TPpredTP_{pred} the number of model mentions aligned to ground-truth. The core metrics are:

R=TPgtNgt,P=TPpredNpred,F1=2PRP+RR = \frac{TP_{gt}}{N_{gt}}, \quad P = \frac{TP_{pred}}{N_{pred}}, \quad F_1 = \frac{2PR}{P+R}

Chat Quality

  • Evaluated by GPT-4, ignoring grounding tags. Includes detailed description, conversation, and compositional reasoning, averaged as the "All" score.

4. Baseline Models and Comparative Results

Architectures Benchmarked:

  • Pure Visual Chat: LLaVA (0.3B)
  • Grounding Modules Combined with LMM: BuboGPT (2B), Shikra (0.3B), MiniGPT-v2 (1B), CogVLM-Grounding (10B), CogVLM-Chat (10B)
  • GPT-4 Vision + Set-of-Mark (SoM) external prompts
  • LLaVA-Grounding (0.35B): End-to-end integration, CLIP encoder with grounding head (OpenSeeD), segmentation support, and Semantic-SAM prompt encoding.

Representative Results:

Model Recall Precision F1 All (Chat)
BuboGPT (2B) 26.2 37.2 30.7 78.2
Shikra (0.3B) 21.1 39.8 27.6 75.5
MiniGPT-v2 (1B) 20.6 33.6 25.6 45.8
CogVLM-G (10B) 22.3 56.3 32.0 34.9
GPT-4V + SoM – 55.1 63.2 93.3
LLaVA-G (0.35B) 28.6 52.7 37.1 79.3

Key findings indicate LLaVA-Grounding attains the highest F1 among open-source LMMs, with competitive chat fluency. GPT4-V+SoM achieves superior grounding with multiple specialists.

Classic Referring Expression Comprehension (REC) and Segmentation Results:

  • [email protected]: 89.16 / 81.68 / 84.82 (RefCOCO/+/g)
  • cIoU (RES): 79.68 / 72.92 / 74.39
  • mIoU: 77.13 / 68.79 / 71.54

5. Technical Innovations and Ablation Analysis

  • Integration of visual prompts (marks, clicks, boxes) via Semantic-SAM embeddings. On COCO, object classification accuracy with clicks/boxes: 70.8%/71.5%. Marks yield 77.9% top-1 accuracy on Flickr30K.
  • Grounding-model detachment ablation: Preventing gradient flow from grounding head to LLM improves chat scores but degrades grounding accuracy, confirming the necessity of joint training.
  • Query count in OpenSeeD head: 50 learned region queries is optimal for REC/RES.
  • Hallucination reduction: Grounding tokens enable user intervention and correction during dialog.

6. Limitations and Future Directions

Current Grounding-Bench data covers a closed vocabulary (COCO classes). Expanding to fully open-vocabulary grounding and more complex multi-agent, tool-using dialog remains an open direction. Other unresolved challenges include:

  • Improving semantic and spatial alignment for ambiguous and complex visual scenes.
  • Enabling multiclass and multiregion simultaneous predictions in free-form dialogs.
  • Generalizing region/entity mapping protocols for arbitrary visual domains and language taxonomies.

7. Contextual Impact and Recommendations

Grounding-Bench establishes a rigorous criterion for evaluating LMMs where both verbal and spatial competence are essential. It drives LMM research toward unified architectures capable of integrated chat and grounding, and highlights the need for models that do not trade off dialog fluency for spatial precision. Its joint evaluation protocol combining GPT-4 simulated semantic judgments and standard spatial metrics provides holistic insight into model behavior and failure modes. Extension to open-vocabulary categories and richer interaction protocols is anticipated to further advance the state of multimodal understanding (Zhang et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Grounding-Bench Benchmark.