Soft-TIFA Judge: T2I Faithfulness & Fairness
- Soft-TIFA Judge is an evaluation framework that decomposes text prompts into atomic sub-components to score text-to-image outputs accurately.
- It integrates vision-language models with soft scoring and explanation-driven rubrics, reducing benchmark drift and improving interpretability.
- The framework supports fairness audits through structured social-attribute scoring and abstention protocols for ambiguous cases.
Soft-TIFA Judge is an evaluation framework designed for reliable, interpretable, and modular assessment of text-to-image (T2I) model outputs. Developed to mitigate limitations of both holistic and VQA-only judge models, Soft-TIFA advances the state of automated image-text faithfulness and social-attribute auditing. It achieves this by decomposing prompts into primitive 'atoms,' introducing per-atom soft scoring with vision-LLMs, enforcing explanation-driven rubrics, and integrating abstention protocols for ambiguous cases. The method is prominent within benchmarking efforts such as GenEval 2, where benchmark drift has been documented as a severe problem for static automated judges (Kamath et al., 18 Dec 2025), and is further extended for fairness evaluation in social attributes in multimodal settings (Sahili et al., 26 Oct 2025).
1. Motivation and Conceptual Framework
The emergence of advanced T2I models exposes critical weaknesses in conventional automated evaluation. Holistic judge models (such as those based on COCO-trained detectors with CLIP) break down when generation styles drift from detector training data, resulting in systematic failures in object or attribute detection. VQA-only judges (such as VQAScore), which cast the veracity check as a single yes/no task ("Does this image show {prompt}?"), conflate all compositional concepts into a binary, non-localizable score, leading to misalignment with human judgment as prompt complexity increases and models improve.
Soft-TIFA was developed to address these issues by unifying strengths of previous approaches: interpretability, granularity, and uncertainty quantification. It does so by systematically decomposing each templated prompt into atomic sub-parts (objects, attributes, relations, counts), constructing VQA queries for each, and aggregating their 'soft' continuous scores. This explicit structure facilitates the identification and analysis of which sub-concepts succeed or fail, reduces susceptibility to benchmark drift, and enables modular adoption of new VQA backbones (Kamath et al., 18 Dec 2025).
2. Algorithmic Pipeline and Scoring Protocol
Prompt Decomposition and Atomization
Prompts for evaluation are constructed from fixed templates of the form:
Each placeholder (count, attribute, object, relation) is treated as an independent atom. For example, the prompt "three pink pigs jumping over a sheep" yields atoms: "three" (count), "pink" (attribute), "pigs" (object), "jumping over" (relation), "a" (count, implicit), "sheep" (object).
Atom-Specific Question Generation
For each atom , the system instantiates a prompt-specific VQA query:
- Object atom (): "What object is in the image?" (answer must match )
- Attribute atom () for object : "What color/material/pattern is the ?" (answer: )
- Count atom () for object : "How many 0 are in the image?" (answer: 1)
- Spatial relation (2) between 3 and 4: "Is the 5 6 the 7?" (yes/no)
- Transitive verb relation (8) between 9 and 0: "Is the 1 2 the 3?" (yes/no)
VQA-Driven Soft Scoring
Each atom-question pair 4 is scored via an open vision–LLM (e.g., Qwen3-VL-8B). The relevant probability 5 is computed, serving as a "soft" atom-level correctness metric (Kamath et al., 18 Dec 2025).
Aggregation
Two modes of aggregation are primarily used:
- Arithmetic mean: Estimates average atom-level accuracy across the prompt.
- Geometric mean: Amplifies impact of a single failure, emphasizing all-or-nothing correctness at the prompt level.
This dual aggregation scheme enables both fine-grained analysis of model performance and robust estimation of overall faithfulness, with interpretability preserved at all stages.
3. Rubrics, Continuous Scale Mapping, and Abstention
Soft-TIFA Judge extends the framework to social attribute and alignment evaluation using explanation-oriented rubrics. Each atomic or attribute-based assessment relies on a rating 6, anchored to:
- "Not match at all"
- "Significant discrepancies"
- "Several minor discrepancies" (neutral)
- "A few minor discrepancies"
- "Matches exactly"
Ratings are then normalized to 7:
8
This continuous soft alignment score enables more sensitive calibration to subtle model improvements.
For categorical social attributes (gender, race, age, religion, culture, disability), the protocol enforces closed label sets and explicit abstention ("UNSPECIFIED") if evidence is insufficient. The macro-averaged social-attribute fairness score, over 9 images and 0 attributes, is given by
1
where 2 and 3 if not abstained, 0 otherwise. Prompt–image alignment macro-average is
4
Abstention is enforced if visible cues are insufficient, as per evidence-grounding prompts: in such cases, the system either excludes the attribute from averaging or marks it as a misclassification, depending on attribute type (Sahili et al., 26 Oct 2025).
4. Closed Label Sets and Evidence Grounding
The protocol mandates strict label sets per attribute:
- Gender: {male, female, unspecified}
- Race: {Black, White, Asian, Latino_Hispanic, Indigenous, unspecified}
- Age: {child, young adult, middle-aged, elderly, unspecified}
- Religion: {Christian, Muslim, Hindu, Buddhist, Jewish, Sikh, Shinto, Neutral, unspecified}
- Culture: {taxonomy-derived tag, unspecified}
- Disability: {mobility impairment, blind/low vision, deaf/hard of hearing, dwarfism, vitiligo, unspecified}
Evidence grounding is enforced through prompt templates requiring models to explicitly cite only visible evidence. If cues are absent, abstention ("unspecified") is triggered. At evaluation, only structured labels are used for scoring; rationales are logged for auditability or alternative downstream analysis (Sahili et al., 26 Oct 2025).
5. Pseudocode and Implementation Workflow
A high-level pseudocode for the Soft-TIFA Judge workflow follows:
6 (Sahili et al., 26 Oct 2025)
The modular nature of this workflow permits scalable deployment with various backbone MLLMs, optional cascaded judge architectures, and post-hoc analysis from logged explanations.
6. Empirical Evaluation and Benchmarks
Soft-TIFA Judge and its variants have been systematically evaluated on multiple datasets, including FAIRFACE (face-only gender/race/age), PATA (attribute-in-context), FairCoT (face & context), IdenProf (profession identification), and DIVERSIFY-Professions (culturally-varied, non-iconic profession scenes).
Quantitative results demonstrate substantial improvements over conventional proxies:
- Gender (FairFace): GPT-4.1 achieves 96% (vs. CLIP 94%, DeepFace 75%)
- Race (FairFace): GPT-4.1 88% (vs. CLIP 69%, DeepFace 48%)
- Religion (DIVERSIFY): GPT-4.1 69% (vs. CLIP 34%)
- Disability (DIVERSIFY): GPT-4.1 93% (vs. CLIP 36%)
- Professions (DIVERSIFY-Prof): Alignment 5: GPT-4.1 0.81 (vs. CLIP 0.25); accuracy 86% (vs. CLIP 73%)
Judges under this framework (GPT-4.1, Gemini 1.5 Pro, LLaMA-4) outperform contrastive-similarity (CLIP) and face attribute predictors (DeepFace) in both social-attribute and prompt-image alignment tasks (Sahili et al., 26 Oct 2025).
7. Strengths, Limitations, and Directions for Advancement
Soft-TIFA Judge yields notable strengths:
- Substantial gains in context-dependent social-attribute accuracy compared to CLIP and face-only baselines
- Unified, transparent scoring for both faithfulness (prompt-image alignment) and fairness (social attribute) audits
- Systematic abstention protocol, avoiding overconfident guesses when evidence is ambiguous or missing
- Per-image explanations logged for auditing and flexible post-hoc re-scoring
Principal limitations include:
- Persistent difficulty in "culture" identification; both judge models and CLIP struggle with diffuse, relational groundings
- Sensitivity to prompt phrasing, with absolute scores shifting under minor word changes, though model orderings are preserved
- Calibration of soft scores is rubric-dependent; alternative rubrics may alter scale and comparability
- Increased computational cost relative to single-pass approaches such as CLIP, addressed by deploying cascaded architectures (lightweight judge escalates abstentions to high-capability judge)
Proposed directions for ongoing research include: more robust benchmarking under rubric and prompt variation, leave-one-dataset-out calibration to estimate generalization, integration of self-evaluation/multi-judge consensus, and extension to additional non-facial attributes (e.g., socioeconomic status) or dynamic video scenes (Kamath et al., 18 Dec 2025, Sahili et al., 26 Oct 2025).