Compositional Consistency Score
- Compositional consistency score is a metric that quantifies a model’s ability to maintain coherent, logical, and structurally sound outputs under controlled compositional changes.
- It employs methodologies such as hierarchical decomposition, programmatic perturbation, and self-consistency trees to isolate and assess compositional phenomena.
- Empirical studies across vision, language, and simulation tasks demonstrate its diagnostic power in revealing model weaknesses invisible to traditional accuracy metrics.
A compositional consistency score is a rigorous, task-matched metric designed to quantify a model’s ability to maintain coherent, structurally sound, and logically consistent outputs under compositional manipulations. Across modalities—vision, language, and simulation-based inference—the term encompasses several related but technically distinct metrics, all addressing whether a system’s performance or internal representations respect the compositional structure of inputs or reasoning chains.
1. Formal Definitions Across Modalities
In vision and scene reasoning, the compositional consistency score often refers specifically to the SCS Similarity Index Measure (SCSSIM), which evaluates scene composition structure (SCS) preservation. For visual question answering (VQA) and language tasks, compositional consistency is operationalized as the proportion of question-sets or reasoning chains for which a model’s correctness persists across systematically constructed compositional variants. In simulation-based inference, a compositional score denotes an aggregation of individual score functions, with theoretical guarantees on error accumulation under composition.
Image Structure/SCSSIM (Haque et al., 7 Aug 2025):
with
where 's are normalized cumulative gain curves derived from hierarchical image partitions.
Compositional Generalization in VQA (Li et al., 2024):
where is a triplet across three composition levels and is a 0/1 indicator of the model’s accuracy.
Tree-based Consistency for LLMs (Hong et al., 14 Jun 2025):
where is a -step path in a self-consistency tree , a similarity measure, and the aggregated consistency at depth .
Contrast Set Consistency (Bitton et al., 2021):
expressing the fraction of original-plus-perturbed question sets on which the model answers every instance correctly.
2. Methodological Foundations
Compositional consistency scores are typically derived from controlled perturbations of input data or tasks that isolate specific compositional phenomena:
- Hierarchical Decomposition: In SCSSIM, images are decomposed via greedy, variance-maximizing splits into a binary tree of cuboid regions, abstracting away from object-level semantics and focusing on spatial partition structure (Haque et al., 7 Aug 2025).
- Programmatic Perturbation: In VQA, compositional contrast sets are generated by minimally perturbing functional programs or scene graphs, resulting in input-output sets where only compositional elements differ (Bitton et al., 2021, Gandhi et al., 2022).
- Multi-level Generalization: GQA-CCG and similar datasets provide triplets or DAGs of questions, explicitly compositional in their logical structure, to assess whether the model’s accuracy propagates through different granularity levels (Li et al., 2024, Gandhi et al., 2022).
- Reversible Transformation Trees: ConsistencyChecker constructs self-consistency trees using pairs of inverse operations, measuring whether repeated application and inversion of such operations yields stable function or meaning (Hong et al., 14 Jun 2025).
3. Mathematical Properties and Theoretical Guarantees
Compositional consistency scores exhibit key properties critical for robust benchmarking:
- Range and Bound: All variants produce values in , with perfect compositional generalization yielding $1$.
- Monotonicity: In SCSSIM, consistency decays monotonically under compositional distortions (rotation, scaling, cropping), while remaining invariant under non-structural changes (noise, blur). In contrast sets, failure on any perturbed variant zeros out the whole set’s consistency (Haque et al., 7 Aug 2025, Bitton et al., 2021).
- Tight Upper Bound: In multilevel compositions, the overall consistency is tightly bounded above by the minimum accuracy over all subcomponents (Li et al., 2024).
- Asymptotic Consistency: For simulation-based inference, the compositional score’s mean squared error vanishes as the number of observations grows and individual errors decrease, under natural regularity assumptions (Touron et al., 17 Oct 2025).
4. Practical Usage and Experimental Insights
Empirical results highlight the diagnostic power of compositional consistency scores:
| Domain | Metric/Score | Key Result | Paper |
|---|---|---|---|
| Image SCS | SCSSIM | Noisy/blurred images: SCSSIM1, Rotated: 0.09, while other metrics (LPIPS, SSIM, CLIP) misorder cases | (Haque et al., 7 Aug 2025) |
| VQA (GQA-CCG) | Consistency () | MAC+MLO: 34.10% consistency, strong baseline CFR: 49.27%; state-of-the-art VLMs seldom exceed 45% | (Li et al., 2024) |
| LLM MT/code | GPT-4o-mini: 98.0% (MT), 76.5% (code); Qwen-2.5-32B: 96.4% (MT), 85.1% (code) | (Hong et al., 14 Jun 2025) | |
| VQA contrast sets | CCS | LXMERT: 83.9% original, 67.2% perturbed, 46.1% full consistency; retraining with perturbations closes the gap | (Bitton et al., 2021) |
In all cases, compositional consistency scores reveal failures that are not captured by overall accuracy or conventional similarity metrics, particularly on out-of-distribution compositional variations.
5. Critical Comparisons and Limitations
Compositional consistency targets a more stringent regime than standard metrics:
- Robustness to Nuisance Variation: SCSSIM and tree-based LLM consistency are robust to superficial changes but sensitive to meaningful compositional structure, distinguishing them from pixel-level or embedding-based metrics (Haque et al., 7 Aug 2025, Hong et al., 14 Jun 2025).
- Diagnostic Precision: Right-for-the-Wrong-Reasons and Internal Consistency metrics in video QA explicitly expose spurious reasoning and logical contradictions (Gandhi et al., 2022).
- Coverage Constraints: Many frameworks rely on fixed compositional templates or strictly paired DAGs/triplets, limiting their extensibility to other forms of composition or unenumerated structure (Li et al., 2024, Bitton et al., 2021).
- Human Alignment Limitation: Some compositional scores operate reference-free (e.g., ConsistencyChecker), while in other domains, consistency may not align perfectly with human preference or intuition, as in the case of CLIP or FID remaining insensitive to structural flips (Haque et al., 7 Aug 2025).
6. Implementation and Best Practices
Effective use of compositional consistency scoring requires careful protocol adherence:
- Parameter Choices: For SCSSIM, 64 partitioning cuts and a decay parameter yielded robust discrimination (Haque et al., 7 Aug 2025).
- Scalability: Complexity scales linearly with both data size and structural complexity in most frameworks (SCSSIM, ConsistencyChecker).
- Implementation Guidance: Employ integral images for fast sum-of-squared-error computation, maintain priority queues for greedy partitioning, and use automated or dynamic benchmark generation to avoid information leakage (Haque et al., 7 Aug 2025, Hong et al., 14 Jun 2025).
- Usage Recommendations: For models targeting compositional robustness, combinatorial or contrast-set augmentation of training data can directly improve compositional consistency without degrading in-distribution accuracy (Bitton et al., 2021).
7. Future Directions and Extensions
Current compositional consistency frameworks invite further refinement and generalization:
- Extension to Arbitrary Composition Levels: Moving beyond triplets or shallow DAGs to n-level or tree-based evaluations would improve granularity (Li et al., 2024, Hong et al., 14 Jun 2025).
- Flexible Partial Consistency Metrics: Introducing soft penalties or weighting by composition depth may better capture nuanced failures and successes.
- Domain Generalization: Porting methodology to novel domains—scientific texts, equation parsing, graph reasoning—would expand diagnostic reach.
- Integration with Human Judgments: Blending compositional consistency with human-in-the-loop or adversarial assessment could bridge the gap between structural robustness and practical applicability.
Compositional consistency scores fundamentally shift evaluation from surface-level similarity toward principled measurement of systemic, structural robustness, offering indispensable tools for benchmarking, diagnosing, and improving models in vision, language, and probabilistic inference frameworks.