Papers
Topics
Authors
Recent
2000 character limit reached

Davidsonian Scene Graph (DSG) Score

Updated 4 December 2025
  • DSG Score is a formal evaluation metric that decomposes text prompts into a DAG of atomic propositions and applies logical entailment.
  • It employs a three-step pipeline that extracts tuples, generates yes/no questions, and enforces dependency consistency to avoid hallucinations.
  • The metric integrates coverage and consistency scores, validated through the DSG-1k benchmark, to reliably assess text-to-image alignment.

The Davidsonian Scene Graph (DSG) Score is a formal, empirically-grounded evaluation metric for the fine-grained assessment of text-to-image alignment in generative models. The DSG framework decomposes textual prompts into a Directed Acyclic Graph (DAG) of atomic propositions, organizes these into question-answer pairs, enforces logical entailment via dependency structure, and aggregates visual question answering (VQA) outputs into an interpretable coverage score. This methodology addresses reliability challenges in prior Question Generation/Answering (QG/A) approaches by ensuring atomicity, uniqueness, coverage, and dependency consistency. The open-source DSG-1k benchmark provides high-resolution human and model-based evaluations across diverse categories (Cho et al., 2023).

1. Formal Definition of the Davidsonian Scene Graph

At its core, the Davidsonian Scene Graph is a DAG that encapsulates the full decompositional semantics of a text prompt by mapping it to a set of atomic propositions (the nodes) and entailment dependencies (the edges).

  • Atomic Propositions (nodes) fall into four well-defined types:
    • Entities: Single-item tuples, e.g., (“motorcycle”)
    • Attributes: Pairwise tuples of the form (attribute, entity), e.g., (“blue”, “motorcycle”)
    • Relationships: Triplet tuples (relation, subject, object), e.g., (“next to”, “motorcycle”, “door”)
    • Globals: Single-item scene-wide properties, e.g., (“bright lighting”)
  • Dependency Edges (directed) encode entailment: Child node truth can be evaluated only if the parent node is true. For example, the truth of (“blue”, “motorcycle”) is conditioned on the existence of (“motorcycle”). Mathematically, for prompt PP,
    • The set of atomic tuples is T={t1,,tM}T = \{t_1,\dots, t_M\} (tit_i of arity 1, 2, or 3).
    • The set of dependencies is E{(titj)}E \subseteq \{(t_i \to t_j)\}, such that if (titj)E(t_i \to t_j) \in E, truth(tj)\mathrm{truth}(t_j) entails truth(ti)\mathrm{truth}(t_i).

2. Question Generation Pipeline

The QG pipeline extracts and realizes the semantic structure of the prompt in three sequential LLM stages:

  1. Tuple Extraction The prompt PP is parsed to produce a complete, non-hallucinated set of atomic tuples {(i,ti)}\{(i, t_i)\}, with instructions to strictly avoid fact invention.
  2. Question Realization Each tuple (i,ti)(i, t_i) is mapped to a unique, atomic yes/no question qiq_i reflecting the truth of tit_i. For example, ("blue", "motorcycle") yields “Is the motorcycle blue?” Each question targets only a single atomic fact.
  3. Dependency Extraction For every tuple ii, the set of parent IDs pa(i)\mathrm{pa}(i) is determined, capturing logical entailment structure.

This three-step process guarantees atomicity (one fact per question), uniqueness (no duplications), hallucination-freeness (constrained to prompt content), and comprehensive semantic coverage (all ground-truth tuples enumerated). The entire method is codified by:

1
2
3
4
5
def generate_DSG(prompt, LLM):
    id2tuple   = LLM(prompt, "Extract minimal semantic tuples...")
    id2parents = LLM(prompt, id2tuple, "For each tuple, list parent tuple IDs...")
    id2question= LLM(prompt, id2tuple, "Rewrite each tuple as a natural-language yes/no question...")
    return id2tuple, id2parents, id2question

3. Visual Question Answering and Consistency Enforcement

Given the set of DSG questions {qi}\{q_i\} and a rendered image II (from any text-to-image model), each (I,qi)(I,q_i) pair is answered via a pretrained VQA model (e.g., PaLI, mPLUG-large, Instruct-BLIP), yielding a~i{Yes,No}\tilde a_i \in \{\text{Yes},\text{No}\}.

To maintain logical strictness, dependency consistency is enforced: For any child question qjq_j whose parent qiq_i receives a "No," qjq_j is forcibly answered "No," preempting invalid queries (e.g., questioning "Is the motorcycle blue?" when the motorcycle is absent).

Enforcement algorithm:

1
2
3
for j, parents in id2parents.items():
    if any(id2score[p]==0 for p in parents):
        id2score[j] = 0

4. Mathematical Formulation of the DSG Score

The DSG Score is mathematically defined as a composite of coverage and consistency components:

  • Coverage Score: Measures agreement of VQA answers a~q\tilde a_q with ground-truth aqa_q over all questions QPQ_P:

Coverage(P,I)=1QPqQP1{VQA(I,q)=aq}\mathrm{Coverage}(P,I) = \frac{1}{|Q_P|} \sum_{q \in Q_P} \mathbf{1}\{\mathrm{VQA}(I,q) = a_q\}

  • Consistency Score: Quantifies dependency violations:

Consistency(I)=1#{(ij)E:ai=No,a~j=Yes}E\mathrm{Consistency}(I) = 1 - \frac{\#\{(i \to j) \in E : a_i=\text{No},\,\tilde a_j=\text{Yes}\}}{|E|}

i.e., one minus the fraction of entailment violations.

  • Composite DSG Score:

DSGScore(P,I)=αCoverage(P,I)+(1α)Consistency(I)\mathrm{DSGScore}(P,I) = \alpha \cdot \mathrm{Coverage}(P,I) + (1-\alpha) \cdot \mathrm{Consistency}(I)

with α[0,1]\alpha \in [0,1] balancing strict coverage vs. dependency consistency. Empirically, α=1\alpha=1 was used due to the hard enforcement of dependency consistency.

Algorithmic summary:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
def DSG_score(prompt, gen_image, LLM, VQA, alpha=1.0):
    id2tuple, id2parents, id2question = generate_DSG(prompt, LLM)
    id2score = {}
    for i, q in id2question.items():
        ans = VQA(gen_image, q)  # “Yes” or “No”
        id2score[i] = 1.0 if ans=="Yes" else 0.0
    for j, parents in id2parents.items():
        if any(id2score[p]==0.0 for p in parents):
            id2score[j] = 0.0
    coverage = sum(id2score.values()) / len(id2score)
    violations = sum(
        1 for (i,j) in E
        if id2score[i]==0 and id2score[j]==1
    )
    consistency = 1 - violations/len(E)
    return alpha*coverage + (1-alpha)*consistency

5. DSG-1k Benchmark: Structure and Evaluation

DSG-1k is an open-sourced benchmark assembled to guarantee broad and challenging semantic coverage and to rigorously validate DSG scoring reliability. It consists of 1,060 human-written prompts, sampled uniformly from ten diverse sources, including TIFA-160, Stanford Paragraphs, Localized Narratives, CountBench, VRD relations, DiffusionDB, Midjourney, PoseScript, Whoops! commonsense-defying examples, and DrawText.

Distribution of question types:

Category Approximate Count Examples
Entity questions ~3,400 (“motorcycle”)
Attribute questions ~2,250 (“blue”, “motorcycle”)
Relation questions ~2,080 (“next to”, ...)
Global questions ~530 (“bright lighting”)

Evaluation on DSG-1k comprised both VQA model outputs and human answers for each question over three state-of-the-art generators (Stable Diffusion v2.1, Imagen*, MUSE*), measuring:

  • Per-question VQA vs. human matching accuracy (73.8% for PaLI)
  • Per-item Spearman correlation with human 1–5 Likert ratings (ρ=0.56\rho=0.56)
  • Detailed error analysis by semantic category (counting, text-rendering remain hardest)

These results confirm that DSG-1k prompts and corresponding DSG questions are atomic, unique, and hallucination-free, and that the DSGScore provides reliable fine-grained text-to-image alignment measurement.

6. Reliability and Significance of DSG Scoring

The DSG framework addresses prior QG/A shortcomings by eliminating hallucinations, duplications, and omissions at tuple extraction; enforcing atomicity and uniqueness of questions; and maintaining logical entailment consistency. Integrated human and model-based assessments in DSG-1k empirically validate both coverage and reliability across a wide semantic range. This suggests DSGScore can serve as a standard for evaluating prompt-image faithfulness in generative modeling pipelines when fine-grained metric rigor is required (Cho et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Davidsonian Scene Graph (DSG) Score.