Davidsonian Scene Graph (DSG) Score
- DSG Score is a formal evaluation metric that decomposes text prompts into a DAG of atomic propositions and applies logical entailment.
- It employs a three-step pipeline that extracts tuples, generates yes/no questions, and enforces dependency consistency to avoid hallucinations.
- The metric integrates coverage and consistency scores, validated through the DSG-1k benchmark, to reliably assess text-to-image alignment.
The Davidsonian Scene Graph (DSG) Score is a formal, empirically-grounded evaluation metric for the fine-grained assessment of text-to-image alignment in generative models. The DSG framework decomposes textual prompts into a Directed Acyclic Graph (DAG) of atomic propositions, organizes these into question-answer pairs, enforces logical entailment via dependency structure, and aggregates visual question answering (VQA) outputs into an interpretable coverage score. This methodology addresses reliability challenges in prior Question Generation/Answering (QG/A) approaches by ensuring atomicity, uniqueness, coverage, and dependency consistency. The open-source DSG-1k benchmark provides high-resolution human and model-based evaluations across diverse categories (Cho et al., 2023).
1. Formal Definition of the Davidsonian Scene Graph
At its core, the Davidsonian Scene Graph is a DAG that encapsulates the full decompositional semantics of a text prompt by mapping it to a set of atomic propositions (the nodes) and entailment dependencies (the edges).
- Atomic Propositions (nodes) fall into four well-defined types:
- Entities: Single-item tuples, e.g., (“motorcycle”)
- Attributes: Pairwise tuples of the form (attribute, entity), e.g., (“blue”, “motorcycle”)
- Relationships: Triplet tuples (relation, subject, object), e.g., (“next to”, “motorcycle”, “door”)
- Globals: Single-item scene-wide properties, e.g., (“bright lighting”)
- Dependency Edges (directed) encode entailment: Child node truth can be evaluated only if the parent node is true. For example, the truth of (“blue”, “motorcycle”) is conditioned on the existence of (“motorcycle”). Mathematically, for prompt ,
- The set of atomic tuples is ( of arity 1, 2, or 3).
- The set of dependencies is , such that if , entails .
2. Question Generation Pipeline
The QG pipeline extracts and realizes the semantic structure of the prompt in three sequential LLM stages:
- Tuple Extraction The prompt is parsed to produce a complete, non-hallucinated set of atomic tuples , with instructions to strictly avoid fact invention.
- Question Realization Each tuple is mapped to a unique, atomic yes/no question reflecting the truth of . For example, ("blue", "motorcycle") yields “Is the motorcycle blue?” Each question targets only a single atomic fact.
- Dependency Extraction For every tuple , the set of parent IDs is determined, capturing logical entailment structure.
This three-step process guarantees atomicity (one fact per question), uniqueness (no duplications), hallucination-freeness (constrained to prompt content), and comprehensive semantic coverage (all ground-truth tuples enumerated). The entire method is codified by:
1 2 3 4 5 |
def generate_DSG(prompt, LLM): id2tuple = LLM(prompt, "Extract minimal semantic tuples...") id2parents = LLM(prompt, id2tuple, "For each tuple, list parent tuple IDs...") id2question= LLM(prompt, id2tuple, "Rewrite each tuple as a natural-language yes/no question...") return id2tuple, id2parents, id2question |
3. Visual Question Answering and Consistency Enforcement
Given the set of DSG questions and a rendered image (from any text-to-image model), each pair is answered via a pretrained VQA model (e.g., PaLI, mPLUG-large, Instruct-BLIP), yielding .
To maintain logical strictness, dependency consistency is enforced: For any child question whose parent receives a "No," is forcibly answered "No," preempting invalid queries (e.g., questioning "Is the motorcycle blue?" when the motorcycle is absent).
Enforcement algorithm:
1 2 3 |
for j, parents in id2parents.items(): if any(id2score[p]==0 for p in parents): id2score[j] = 0 |
4. Mathematical Formulation of the DSG Score
The DSG Score is mathematically defined as a composite of coverage and consistency components:
- Coverage Score: Measures agreement of VQA answers with ground-truth over all questions :
- Consistency Score: Quantifies dependency violations:
i.e., one minus the fraction of entailment violations.
- Composite DSG Score:
with balancing strict coverage vs. dependency consistency. Empirically, was used due to the hard enforcement of dependency consistency.
Algorithmic summary:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
def DSG_score(prompt, gen_image, LLM, VQA, alpha=1.0): id2tuple, id2parents, id2question = generate_DSG(prompt, LLM) id2score = {} for i, q in id2question.items(): ans = VQA(gen_image, q) # “Yes” or “No” id2score[i] = 1.0 if ans=="Yes" else 0.0 for j, parents in id2parents.items(): if any(id2score[p]==0.0 for p in parents): id2score[j] = 0.0 coverage = sum(id2score.values()) / len(id2score) violations = sum( 1 for (i,j) in E if id2score[i]==0 and id2score[j]==1 ) consistency = 1 - violations/len(E) return alpha*coverage + (1-alpha)*consistency |
5. DSG-1k Benchmark: Structure and Evaluation
DSG-1k is an open-sourced benchmark assembled to guarantee broad and challenging semantic coverage and to rigorously validate DSG scoring reliability. It consists of 1,060 human-written prompts, sampled uniformly from ten diverse sources, including TIFA-160, Stanford Paragraphs, Localized Narratives, CountBench, VRD relations, DiffusionDB, Midjourney, PoseScript, Whoops! commonsense-defying examples, and DrawText.
Distribution of question types:
| Category | Approximate Count | Examples |
|---|---|---|
| Entity questions | ~3,400 | (“motorcycle”) |
| Attribute questions | ~2,250 | (“blue”, “motorcycle”) |
| Relation questions | ~2,080 | (“next to”, ...) |
| Global questions | ~530 | (“bright lighting”) |
Evaluation on DSG-1k comprised both VQA model outputs and human answers for each question over three state-of-the-art generators (Stable Diffusion v2.1, Imagen*, MUSE*), measuring:
- Per-question VQA vs. human matching accuracy (73.8% for PaLI)
- Per-item Spearman correlation with human 1–5 Likert ratings ()
- Detailed error analysis by semantic category (counting, text-rendering remain hardest)
These results confirm that DSG-1k prompts and corresponding DSG questions are atomic, unique, and hallucination-free, and that the DSGScore provides reliable fine-grained text-to-image alignment measurement.
6. Reliability and Significance of DSG Scoring
The DSG framework addresses prior QG/A shortcomings by eliminating hallucinations, duplications, and omissions at tuple extraction; enforcing atomicity and uniqueness of questions; and maintaining logical entailment consistency. Integrated human and model-based assessments in DSG-1k empirically validate both coverage and reliability across a wide semantic range. This suggests DSGScore can serve as a standard for evaluating prompt-image faithfulness in generative modeling pipelines when fine-grained metric rigor is required (Cho et al., 2023).