Attentive Feedback Summarizer (AFS)

Updated 26 November 2025

The paper introduces AFS as a novel summarization family that uses real-time user feedback to iteratively refine output quality and content relevance.
AFS employs diverse architectures—including ILP optimization, transformer modules, and online learning—to dynamically adapt both extractive and abstractive summaries.
Empirical results show that AFS outperforms non-personalized baselines across metrics like ROUGE, MRR, and HAREScore in text, vision-language, and interactive reading tasks.

The Attentive Feedback Summarizer (AFS) designates a family of summarization methodologies that incorporate user feedback in real time, personalizing extractive or abstractive outputs according to explicit ratings, interactive refinements, or implicit behavioral signals. AFS instantiations span concept-based extractive systems, multimodal transformer modules for vision-language retrieval, iterative LLM finetuning using human textual feedback, and incremental preference-driven document filtering during reading. These approaches leverage iterative feedback loops—soliciting accept/reject actions, importance scores, or improvement instructions—to dynamically adapt summarization, maximizing coverage of user-relevant content within budget or quality constraints.

1. Architectures and Feedback Modalities

AFS spans a spectrum of architectures:

In personalized extractive summarization, such as Adaptive Summaries, documents are segmented into sentences comprising surface concepts (unigrams, bigrams, named entities, or whole sentences). Candidate concepts are presented to users, who assign binary actions (accept/reject), weighted importance, and confidence scores. These ratings update integer weights for concept inclusion, driving a global sentence selection via integer linear programming (ILP) under a summary-length constraint (Ghodratnama et al., 2020).
In multimodal retrieval, AFS is realized as a compact transformer-based module: cross-attention is performed between a tokenized query and patch/token-level embeddings from top-ranked images and their synthetic captions. The model fuses visual and textual relevance at fine granularity, extracting a refined feedback embedding for use in query vector updates (Khaertdinov et al., 21 Nov 2025).
The ILF paradigm reframes AFS as an iterative refinement loop: human users supply free-form feedback on initial LLM outputs, and LMs are conditioned on the original context, initial summary, and feedback to propose candidate refinements. A reward model ranks candidates; top-scoring refinements are used to further fine-tune the summarizer (Scheurer et al., 2023).
In HARE-style frameworks, minimally invasive feedback (binary swipes, dwell/gaze signals) is captured at sentence granularity during active reading. Per-sentence relevance is updated dynamically, resulting in real-time adaptation that preserves reading flow (Bohn et al., 2021).

2. Core Mathematical and Algorithmic Models

An AFS typically operationalizes feedback-driven optimization:

Concept-based extractive AFS employs a binary vector $x_i$ for sentence selection. Each concept $c_j$ is assigned a user weight $W_{c_j}$ , confidence $\mathit{conf}(c_j)$ , and action $A_{c_j}$ . The ILP maximizes total confidence-weighted, user-rated concept score subject to a summary-length budget $B$ :

$\max_{\{x_i\}} \sum_{i} x_i \sum_{c_j \in s_i} \left( A_{c_j} \mathit{conf}(c_j) W_{c_j} \right) \quad \text{s.t.} \quad \sum_{i} x_i \mathrm{length}(s_i) \leq B$

Concepts not encountered retain initialization weights (e.g., small $\epsilon$ ), updating post-feedback (Ghodratnama et al., 2020).

Transformer-based AFS for vision-language retrieval operates over query and relevance sequences. Cross-attention and self-attention blocks output an updated CLS embedding $\tilde h_{q,CLS}$ ; a linear projector produces the feedback vector $z_q^{CLS}$ , which is combined with the original query and a negative feedback derived from cross-attention weights:

$z'_q = \alpha z_q + \beta z_q^{CLS} - \gamma z_q^-$

Parameters $(\alpha, \beta, \gamma)$ balance signal contribution; negative component counters query drift (Khaertdinov et al., 21 Nov 2025).

ILF-based AFS models optimize the KL-divergence between the “ideal” feedback-incorporated summary distribution and the LM output policy. Refinements $x_1^{(i)}$ are scored by a reward model; only the highest-scoring $x_1^*$ updates the supervised dataset for LM training:

$L(\theta) \approx -E_{c \sim p(c)} [\log \pi_\theta(x_1^* | c)]$

Iteration increases alignment with human preferences (Scheurer et al., 2023).

HARE-type AFS models compute sentence relevance as $T_x = \max_j [w_j (1-d(c_j, x))]$ for user interest centers $\{c_j\}$ and weights $w_j$ . Skipping and showing decisions update coverage adaptively, with real-time feedback updating either logistic regression weights or concept center scores (Bohn et al., 2021).

3. Low-Latency and Interactive Design Principles

Responsiveness is essential for usability:

AFS implementations often restrict the user queries per iteration to 3–5 items (heuristic candidate selection) and cache initial sentence/embedding rankings to limit recomputation.
ILP-based AFS leverages commercial solvers (Gurobi, CPLEX) for moderate (<1 s) solve times on 50–100 sentence domains.
Transformer feedback modules are lightweight (2–25M parameters), yielding <0.3 s per inference per query-image batch on modern accelerators (Ghodratnama et al., 2020, Khaertdinov et al., 21 Nov 2025).
HARE-style AFS relies on simple heuristics (skip neighbor/similar sentences), with per-update times ∼1 ms; online learning preference models perform single-step feature updates (Bohn et al., 2021).

4. Training Protocols and Evaluation Metrics

AFS systems deploy simulated or real user feedback for training and benchmark against reference baselines:

Extractive systems use adaptive dictionaries or gold-standard summaries as simulated oracle feedback. ROUGE-1, ROUGE-2, and ROUGE-L F1-scores are reported, with AFS outperforming LEAD-3, HSSAS, and BanditSum on CNN/DailyMail and DUC-2002 (Ghodratnama et al., 2020).
Transformer AFS modules apply cosine similarity loss to align output feedback vectors with true image/caption embeddings. Performance is quantified via mean reciprocal rank (MRR@5) on retrieval tasks, demonstrating 3–5% improvements for small VLMs and 1–3% for large ones compared to classical relevance feedback (Khaertdinov et al., 21 Nov 2025).
ILF training collects multiple candidate refinements per (context, initial summary, feedback) triplet; the best incorporating feedback is used for LM fine-tuning. Human win rates (side-by-side ranking against human references) indicate ILF outperforms supervised finetuning on human summaries (31.3% vs 28.9%), with best-of-N selection models achieving ~human-level performance (Scheurer et al., 2023).
HARE-style AFS evaluates summaries with HAREScore, an unsupervised metric aggregating user interest model coverage over shown sentences. Automated and human experiments confirm simple adaptive heuristics achieve measurable gains in personal relevance without disrupting reading fluency (Bohn et al., 2021).

5. Empirical Results and Comparative Performance

Table: Selected AFS Models and Empirical Metrics

System	Task/Modality	Key Metric(s)	Experimental Results
Adaptive Summaries	Extractive text	ROUGE-1/2/L F1	CNN/DM: R-1=42.9, R-2=20.1, R-L=38.2
AFS Transformer	VLM retrieval	MRR@5	Flickr30k: 0.801, COCO: 0.428
ILF (LM feedback)	Abstractive text	Win rate %	ILF: 31.3%; ILF+best-of-N: 50.8%
HARE (Hone As Read)	Doc-level filtering	HAREScore	Coverage-opt: ~83.1 vs baseline ~82.15

These findings indicate AFS approaches consistently exceed vanilla or non-personalized baselines in multiple modalities and settings, particularly in iterative, feedback-driven refinement scenarios. AFS modules are robust to query drift (unlike generative relevance feedback), support multi-turn interactions, and transfer across backbone architectures (Khaertdinov et al., 21 Nov 2025).

6. Strengths, Limitations, and Open Challenges

Strengths of AFS include fine-grained personalization, active human-in-the-loop interpretability, elimination of required gold-standard references during production, and responsiveness in interactive contexts. These systems admit flexible feedback types (binary, weighted, natural language), and are extensible to multimodal and hierarchical settings.

Limitations arise from user simulation fidelity, potential cognitive load in concept selection, scalability of global optimization (e.g., ILP) on large corpora, and residual challenges in negative feedback incorporation and query drift management. HARE-style AFS highlights limitations in unsupervised “no-reference” metrics and the risk of over-filtering or loss of coherence in aggressive personalization (Bohn et al., 2021).

Potential extensions include semantic clustering for concept selection, integration of richer implicit feedback (gaze, dwell-time), multimodal document summarization bridging text, tables, and figures, and principled active learning for feedback solicitation (Ghodratnama et al., 2020, Khaertdinov et al., 21 Nov 2025).

7. Domain Extensions and Interdisciplinary Impact

AFS systems generalize across extractive, abstractive, and retrieval domains, admitting rapid adaptation to personalized summarization, visual search, and live document filtering in reading workflows. Empirical and theoretical advances in transformer architectures, imitation learning from language feedback, and concept-based coverage optimization inform broader applications in information retrieval, conversational agents, and interactive educational tools. A plausible implication is that multimodal AFS variants will play a growing role in bridging semantic gaps in VLM-based cross-domain search and summary systems.