Control Quality Score (CQS) Overview

Updated 2 July 2026

Control Quality Score (CQS) is a family of formalized metrics designed to evaluate and steer generative models by quantifying trade-offs across axes like completeness, conciseness, and faithfulness.
It encompasses domain-specific instantiations for summarization, source code review, and multimodal generation, each integrating specialized loss functions or rule-based filters.
Practical implementations of CQS improve model controllability and benchmarking accuracy, achieving strong alignment with human evaluations across diverse applications.

Control Quality Score (CQS) refers to a family of formalized metrics and control objectives designed to evaluate and steer generative models with respect to multi-dimensional quality criteria. CQS mechanisms appear in natural language generation, source code analysis, and multimodal tasks, each providing either an actionable objective for optimization or an interpretable scalar for benchmarking quality across competing demands, such as completeness, conciseness, faithfulness, code robustness, or consistency in identity preservation. The term encompasses three notable instantiations: (1) the control-oriented ranking objective for summary generation (Liu et al., 19 Apr 2026), (2) the Code Quality Score framework for large codebase review (Wong et al., 1 Aug 2025), and (3) the Consistency Quality Score for evaluating identity-consistent text-to-image generation (Kim et al., 29 Dec 2025). While sharing a unifying motivation—quantification and control of trade-offs across competing quality axes—each CQS implementation is domain-adapted in design and computation.

1. Control Quality Score for Summarization

The Control Quality Score (CQS) in summarization is a training objective introduced to align LLM output distributions with fine-grained, model-based evaluation scores across mutually competing criteria, most notably completeness, conciseness, and faithfulness (Liu et al., 19 Apr 2026). Its central goal is to allow inference-time control through prompt-based steering, permitting users to prioritize or balance these quality axes while preserving overall summary quality.

CQS surpasses traditional maximum likelihood or aggregate-score-based fine-tuning by incorporating three specialized loss terms—margin ranking (L_MR), max-score loss (L_MS), and control-oriented loss (L_CO)—that collectively calibrate a model's generation ordering, reward the best achievable candidate, and directly enforce desired trade-offs. The scheme scales more robustly than RL/PPO-based preference optimization and does not require a reference policy. Empirical results show that models fine-tuned with CQS demonstrate enhanced Spearman correlation between model scores and human-validated quality rankings, stronger axis-specific controllability, and competitive overall summary quality. Human agreement on controlled outputs approaches 0.85 accuracy (pairwise judgments) and Spearman ρ ≈ 0.72 (Liu et al., 19 Apr 2026).

2. Mathematical Formulation and Training Algorithm (Summarization)

Given a source document $X$ , a control prompt $Z$ ("prioritize completeness", "prioritize conciseness", or "balance"), and a candidate summary $\tilde{Y} = (y_1, ..., y_N)$ , the model computes the sentence log-likelihood:

$s_\theta(X, Z; \tilde{Y}) = \frac{1}{N} \sum_{i=1}^N \log p_\theta(y_i \mid y_{<i}, X, Z)$

A pool of $K$ candidate summaries $\{\tilde{Y}_k\}$ is generated, each receiving FineSurE-based scores $S_{com}(\tilde{Y}_k)$ (completeness), $S_{con}(\tilde{Y}_k)$ (conciseness), and combined $S_{sum}(\tilde{Y}_k) = S_{com} + S_{con}$ .

The total training objective is:

$L_{total} = L_{MR} + \gamma \cdot L_{MS} + \beta \cdot L_{CO}$

Margin Ranking Loss ( $Z$ 0): Encourages generation orderings to match quality orderings under scenario-specific penalties.
Max-Score Loss ( $Z$ 1): Drives the model to surpass the best candidate in the pool.
Control-Oriented Loss ( $Z$ 2): Directly enforces the target ratio $Z$ 3 according to the prompt.

A scenario-dependent penalty shifts $Z$ 4 to $Z$ 5, so the model distinguishes between trade-offs even for identical aggregate scores. Training proceeds via LoRA adapter-based parameter updates, pooling candidate references from past and online samples, and applies the required control prompt at inference (Liu et al., 19 Apr 2026).

3. Code Quality Score for Large-Scale Code Review

The Code Quality Score (CQS) system is an end-to-end pipeline for automated issue detection in code changes at scale, based on dual-stage LLM evaluation and rules (Wong et al., 1 Aug 2025). Letting $Z$ 6 denote a code diff, CQS comprises:

f₁ (issue collector): Generates candidate issues $Z$ 7 via a supervised fine-tuned and DPO-optimized LLaMA3 model.
f₂ (issue validator): Applies a second LLaMA3 LLM-judge fine-tuned on code review critiques to assign $Z$ 8.
$Z$ 9 (hand-crafted filter set): Applies tag- and language-specific code rules, such as minimum thresholds or context verification.

The formal output:

$\tilde{Y} = (y_1, ..., y_N)$ 0

CQS thereby integrates learning-based and domain rule-based filtering, achieving strong industry-scale precision (78.2%) at the expense of recall (1.2%), and is backed by empirical developer helpfulness rates (60%) in production (Wong et al., 1 Aug 2025).

4. Consistency Quality Score for Identity-Consistent Generation

The Consistency Quality Score (CQS) in multimodal generation is a unified, training-free metric to evaluate both per-image prompt alignment and inter-image identity consistency (Kim et al., 29 Dec 2025). Targeting the challenge of generating image sequences maintaining character/object coherence while respecting per-image descriptions, CQS operates by combining (i) per-image VQA-based alignment scores and (ii) DreamSim-driven identity similarity scores, into a single scalar via harmonic mean, augmented with explicit imbalance penalty terms.

Formally, for a sequence $\tilde{Y} = (y_1, ..., y_N)$ 1:

$\tilde{Y} = (y_1, ..., y_N)$ 2: Alignment to combined prompt $\tilde{Y} = (y_1, ..., y_N)$ 3 via VQA.
$\tilde{Y} = (y_1, ..., y_N)$ 4: Alignment to $\tilde{Y} = (y_1, ..., y_N)$ 5 via VQA.
$\tilde{Y} = (y_1, ..., y_N)$ 6: Pairwise mean identity similarity (DreamSim, post-transformation and rescaling).

An alignment gap $\tilde{Y} = (y_1, ..., y_N)$ 7 is used for penalty/reward computation, producing adjusted identity scores $\tilde{Y} = (y_1, ..., y_N)$ 8. The per-image CQS score is:

$\tilde{Y} = (y_1, ..., y_N)$ 9

Overall:

$s_\theta(X, Z; \tilde{Y}) = \frac{1}{N} \sum_{i=1}^N \log p_\theta(y_i \mid y_{<i}, X, Z)$ 0

This structure enforces that high scores are contingent on joint strength and balance along both metrics, discouraging one-sided optimization. In the ASemConsist benchmark, superior CQS values reflected improved model balancing over baselines (Kim et al., 29 Dec 2025).

5. Comparative Insights Across Domains

Variant	Domain	Controlling/Scoring Axes
Summarization CQS	Text generation	Completeness, conciseness, faithfulness
Code Quality Score	Source code	Code style, correctness, best practices
Consistency Quality	Multimodal (image)	Prompt alignment, identity consistency

Each CQS instantiation emphasizes the explicit modeling or penalization of trade-offs, the use of harmonically- or rule-combined axes, and a design curated for interpretable, actionable output. In both summarization and multimodal settings, CQS operations are provably sensitive to imbalances and offer direct axes-specific control; in source code, precision is maximized through a hybrid learned-rule framework.

6. Limitations and Considerations

Across domains, all CQS mechanisms inherit certain limitations:

Dependency on Sub-Models: For text, image, or code, the reliability of the CQS hinges on the accuracy and calibration of included scorers (e.g., FineSurE, LLM-judge, DreamSim, VQA), which may propagate bias or domain-mismatch.
Hyperparameter Sensitivity: Margin widths, penalty/reward strengths, and filter thresholds introduce degrees of freedom requiring careful cross-validation or calibration.
Task-Specific Applicability: While designed for generality across model types, each CQS variant is tailored to distinct output modalities and quality trade-offs, potentially requiring adaptation—especially where semantic axes are ill-defined.

A plausible implication is that future CQS methodologies benefit from advances in robust sub-model evaluation, principled hyperparameter selection, and expanded task-generalization to maintain interpretability and actionable control across evolving domains.