Learning to Control Summaries with Score Ranking

Published 19 Apr 2026 in cs.CL | (2604.17197v1)

Abstract: Recent advances in summarization research focus on improving summary quality across multiple criteria, such as completeness, conciseness, and faithfulness, by jointly optimizing these dimensions. However, these efforts largely overlook the challenge of controlling summary generation with respect to individual criteria, especially in the presence of their inherent trade-offs. For example, enhancing conciseness can compromise completeness, and vice versa. In this work, we address this gap by proposing a loss function that aligns model outputs with fine-grained, model-based evaluation scores (e.g., from FineSurE), enabling both improvement in summary quality and dimension-specific control. Our approach improves the overall quality of summaries while maintaining the ability to selectively prioritize one criterion over others. Experiments on three pretrained models (LLaMA, Qwen, and Mistral) demonstrate that our method achieves performance comparable to state-of-the-art summarizers, while uniquely offering strong controllability over individual quality dimensions.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces a multi-dimensional optimization framework that adjusts the trade-off between completeness and conciseness in summary generation.
It employs a composite loss function—combining margin ranking, maximum scoring, and control-oriented losses—to align model likelihoods with quality scores.
Empirical results across various domains validate robust controllability and high ranking alignment, confirming the framework's effectiveness.

Score-Ranked, Controllable Summarization via Fine-Grained Multi-Dimensional Optimization

Motivation and Research Problem

The automatic generation of concise, yet complete, summaries from text is a central problem in sequence transensing with broad applications. Traditional neural summarization approaches have been optimized against proxies such as token-level likelihoods or $n$ -gram-based metrics, which are known to be poorly aligned with human judgments of semantic fidelity, faithfulness, and balance. Recent advances in model-based evaluation, specifically using fine-grained metrics like completeness and conciseness from LLM-based evaluators (e.g., FineSurE), afford more nuanced assessment but have yet to be fully leveraged to control generation along specific summary dimensions.

This work directly addresses the challenge of multidimensional control in summarization: adjusting the trade-off between completeness and conciseness at generation time, in light of their inherent antagonistic relationship. Notably, the authors define a learning framework allowing a single model to yield summaries tailored, via prompts, to prioritize completeness, conciseness, or a balanced configuration.

Figure 1: Summaries of the abstract prioritizing completeness, conciseness, and balance.

Methodology and Learning Objective

The technical framework reframes summary generation as a problem of aligning conditional log-likelihoods with black-box model-based scores for multiple quality dimensions. Model control occurs via a prompt variable $Z$ so that $p_\theta(Y|X,Z)$ is modulated according to the desired summary trait.

The central loss is composed of three parts:

Margin Ranking Loss (MR): Enforces consistency between the model’s likelihood ranking and the ranking by model-based scores (e.g., summaries rated as higher quality by FineSurE should have higher log-probabilities). Pairwise margin-penalties encourage robust separation.
Maximum Scoring Loss (MS): Drives the generator to produce summaries whose model-based score matches or exceeds the best among a set of diverse candidates.
Control-Oriented Loss (CO): Explicitly leverages the ratio of completeness to conciseness (or other metric pairs) in the objective. During training, the ratio is shaped by the control prompt to increase, decrease, or equilibrate, depending on whether completeness, conciseness, or balance is the instructed priority.

The overall loss, $\mathcal{L}_{\text{Total}}$ , is a weighted sum of these objectives, with hyperparameters tuned via cross-validation. Training operates in a LoRA fine-tuning regime on open-source backbones (LLaMA, Qwen, Mistral), incorporating candidate diversity via reference summary sampling and nucleus-sampled predictions.

Figure 2: Model architecture and loss functions. Losses computed over $K$ candidate summaries generated according to specified prompt/control type.

Experimental Results

Quality–Control Trade-off

Empirical validation is conducted on a comprehensive suite of in-domain (WikiHow, CNN/DM, DialogSum) and out-of-domain (OpoSum, MeQSum) datasets. The model is evaluated with harmonic mean of FineSurE completeness and conciseness ( $\text{HM}(\tilde{Y})$ ), control ratio ( $\text{R}(\tilde{Y})$ ), and correlation between log-likelihoods and model-based evaluation scores.

Key findings include:

Strong control: Fine-tuned models (denoted with *) demonstrate strong movement of $\text{R}(\tilde{Y})$ in the desired direction prescribed by the prompt—for completeness, conciseness, and balance—while maintaining high quality, unlike both baselines and prior prompt-optimized models (e.g., SummLLaMA).
Ranking alignment: The Spearman correlation between model likelihood and evaluation rank is improved substantially for fine-tuned models (e.g., Qwen*: 0.22, LLaMA*: 0.15), confirming that candidate ranking accurately reflects quality scores.
Multi-domain robustness: Out-of-domain evaluations indicate consistent behavior, with commercial LLMs occasionally holding an edge in faithfulness, but fine-tuned open models matching or exceeding in controllability and harmonic mean quality.
Figure 3: Model performance across control settings. Each point: mean $\text{HM}(\tilde{Y})$ and $\text{R}(\tilde{Y})$ on test cases, partitioned by control priority and model.

Qualitative Analysis and Distributions

Distributions over quality–control axes visualize the trade-offs and controllability:

Figure 4: Distributions of $Z$ 0 visualizing density and bias per control setting.

Contour plots and score distributions show, for optimal settings and control, highly desirable clustering in regions of high summary quality and controlled metric, which prior baselines do not achieve.

Figure 5: Distribution of Spearman correlations between model likelihood and model-based scores for models sorted by median.

Ablation and Human Alignment

Ablation experiments confirm that all objective components (MR, MS, CO) are necessary to achieve both high ranking alignment and controllability. Human studies further indicate that the model’s controlled outputs are robustly aligned with annotator judgments, achieving average annotation alignment accuracy of 0.85 and Spearman correlation of 0.72.

Domain Generalization

Scattered plots across domains (DialogSum, CNN/DM, WikiHow) and across control settings demonstrate the approach’s consistency, with fine-tuned open models approaching the quality and control attainable by black-box commercial APIs. Further, out-of-domain performance remains strong—reinforcing the modularity and generality of the score-alignment method.

Figure 6: Scatter plot of model performance on out-of-domain data (MeQSum; OpoSum), showing cross-domain generalization under different controls.

Implications and Future Directions

The work formalizes controllable, multidimensional summary generation via scalable objectives that move beyond both likelihood-based training and RL-fitted reward models. Unlike RLHF, which requires policy-reference pairs (as in PPO, GRPO), this approach leverages only the black-box scorer and diverse candidate sets, yielding substantial resource savings and facilitating straightforward application to a broad class of controllable generation tasks.

Potential axes for future research include:

Integration of additional fine-grained evaluator signals (e.g., style, coherence, relevance, domain-specific preferences).
Extension to scenarios with more than two simultaneously controlled dimensions.
Application to high-latency, high-cost evaluators (e.g., human-in-the-loop, multi-model committees).
Investigation of interaction between scorer quality, candidate diversity, and controllable range in emergent models.

Conclusion

This paper establishes a practical, generalizable method for controlling summary generation across multiple quality dimensions by aligning likelihoods with black-box, model-based evaluators. Evidence is provided for control efficacy, improved summary ranking, and high-quality generation both in- and out-of-domain. The framework’s modularity, efficiency over RL approaches, and empirical strength indicate a robust foundation for future research on fine-grained controllable text generation.

Markdown Report Issue