- The paper presents a modular QFS pipeline that integrates query decomposition and QA guidance, significantly improving factual consistency in less-resourced language summarization.
- It employs two decomposition strategies—LLM-driven and NER-based—to optimize query breakdown, fine-tuning Slovene QA models on translated datasets for better evaluation.
- Empirical results demonstrate that prompt augmentation yields concise, factually aligned summaries with higher performance on QAGS, QuestEval, and RQUGE metrics.
QFS-Composer: A Query-Focused Summarization Pipeline for Less-Resourced Languages
Query-focused summarization (QFS) tasks LLMs with constructing summaries that are directly responsive to a user-specified query, requiring strong factual alignment and selective paraphrase. While QFS in high-resource languages is well-studied, performance, available models, and evaluation strategies for less-resourced languages (LRLs) like Slovene remain limited. This paper presents QFS-Composer, a modular pipeline targeting QFS for LRLs, with an emphasis on factual consistency, effective supervision, and practical evaluation in domains without extensive annotated data and language resources.
Methodology
The QFS-Composer framework embodies a compositional architecture, decoupling query decomposition, QA/QG, and summarization into modular components.
Query Decomposition is approached via two strategies:
- LLM-driven: Few-shot prompt-based decomposition using a compact LLM (decomp_aug), allowing for nuanced query breakdown but raising issues with out-of-domain question types for downstream Slovene QA.
- NER-based: Extraction of named entities from the query, which are then used as seeds for question generation (ner_aug), ensuring higher compatibility with the QA model but restricted to entity-centric queries.
Question Answering is realized by fine-tuning a Slovene Gemma-family LLM (GaMS-9B-Instruct) on a translated SQuAD v2 [squad_slo] dataset. Input length is artificially restricted due to GPU memory constraints, necessitating a chunk-and-rank passage selection protocol using BERTScore.
Summarization Generation exploits several LLMs (GPT-4.1-mini, Gemma-2-9B-It, Llama-3.1-8B-Instruct, GaMS-9B-Instruct), prompted with a composite prompt comprising the original user query, decomposed QA pairs, and the source text. The architecture aims to enforce factual correctness and enhanced relevance, leveraging augmented prompts.
Evaluation adapts three metrics for Slovene and QFS:
- QAGS: QA-based factual consistency metric, relying on answer similarity between the source and summary.
- QuestEval: A QG/QA-based metric considering both precision and recall for summary coverage, with a weighting mechanism and answerability confidence.
- RQUGE: Reference-free QA-based metric, also ported for Slovene.
The paper reports detailed adaptations of QA/QG models, reference-free evaluation protocols, and a translation of the MOCHA dataset for Slovene to support these methods.
Experimental Setup and Empirical Results
The evaluation dataset comprises 21 real-world Slovene news articles. Comparative experiments are conducted across four LLMs and three prompt augmentation regimes (no_aug, decomp_aug, ner_aug).
Key findings:
- Across models, prompt augmentation—particularly ner_aug—consistently improves factual alignment (QAGS F1, EM, BERTScore) relative to plain queries, with Llama-3.1-8B-Instruct (ner_aug) achieving the highest QAGS scores. The increase in scores is not attributable to longer outputs; in fact, summaries generated with augmented prompts are systematically shorter yet more information-dense.
- QuestEval generally shows only marginal improvements with augmentation, apart from GaMS-9B-Instruct (ner_aug), which yields the best performance. The dependence of QuestEval on strict answer match and coverage leads to overall lower scores.
- Qualitative assessment reveals augmented-prompt summaries are more concise and relevant, with better handling of entity-specific questions; however, limitations arise from the QA model’s context window, errors in question decomposition, and model hallucinations under challenging source/query combinations.
Theoretical and Practical Implications
The results indicate decompositional pipelines with QA guidance can enhance the factual accuracy and query-responsiveness of abstractive summaries for LRLs, even when forced to rely on limited data resources and hardware. The modular design supports flexible improvements to decomposition, QA, QG, and summary generation submodules as better Slovene models become available.
The pipeline demonstrates that entity-based query decomposition, despite its simplicity, robustly improves alignment when the underlying QA/QG models are trained on analogous objective spaces. However, prompt engineering and model alignment for decomposition require careful curation to avoid cascading errors from upstream to downstream submodules.
The Slovene QFS ecosystem benefits from the assets and tools released: SQuAD-based QA/QG models, translated evaluation datasets, and adapted reference-free metrics. These resources serve as baselines and testbeds for further QFS research in Slovene and other LRLs. Challenges persist around QA model context limitations, evaluation noise from paraphrastic answers, and a lack of large-scale human-annotated datasets for direct system tuning and assessment.
Limitations and Future Directions
The most pressing technical bottleneck is the restricted context window for QA inference, leading to chunking strategies that may miss answer spans—particularly in longer, less-structured documents. Addressing this will require enhanced passage ranking, longer-context LLMs, or dynamic memory architectures tailored for LRLs. Further, the reliance on reference-free metrics—while necessary for LRLs—limits comparability and interpretability of results; human evaluation and the acquisition of parallel annotated corpora remain critical objectives.
Improved question decomposition for queries beyond entity-based scopes (e.g., for relational or abstract queries) and robust QA/QG alignment to support deeper query semantics are essential next steps. Community efforts to build language resources, such as parallell QFS datasets and shared benchmarks, will further advance the field.
Conclusion
QFS-Composer provides a novel, extensible framework for query-focused abstractive summarization in less-resourced languages, integrating query decomposition, QA-guided supervision, and adapted evaluation metrics. Empirical findings demonstrate measurable gains in factuality and relevance over baseline LLMs, achieved through modular augmentation and Slovene-optimized supervision. The work establishes new assets for Slovene QFS and elucidates fundamental challenges for robust, factual summarization in LRL settings, charting a clear path for subsequent research in compositional NLP pipelines and evaluation for low-resource languages.
Reference: "QFS-Composer: Query-focused summarization pipeline for less resourced languages" (2604.10687)