Context Inconsistency Hallucinations
- Context Inconsistency Hallucinations are mismatches in LLM outputs resulting from varying contextual cues that introduce unsupported or extraneous content.
- They are characterized by prompt brittleness, lead bias, and performance fluctuations (e.g., up to 8% ROUGE variation) when paraphrasing or altering input formulations.
- Mitigation strategies such as controlled context repetition and multi-stage prompting effectively enhance factual consistency and reduce hallucinations in summarization tasks.
Zero-shot LLM summarisation refers to the use of LLMs to generate summaries of content—textual or multimodal—without any task-specific fine-tuning or supervised training on summarization datasets for the given target domain or modality. Instead, LLMs rely solely on prompt-based instructions and their pretraining signal to generalize to unseen summarization tasks, domains, or languages. This paradigm is distinguished by its training-free deployment, backed by the finding that instruction-tuned LLMs can, under well-formulated prompts, rival or exceed the performance of supervised and fine-tuned summarization systems across benchmark and specialized datasets, sometimes even outperforming reference summaries as evaluated by human judges (Pu et al., 2023, Zhang et al., 2023).
1. Foundational Principles and Problem Formulation
Formal definition: Given an input document —ranging from a news article, biomedical abstract, source code function, to a video transcript—a zero-shot LLM summarizer produces a summary via a single prompt, without gradient-based updates on summarization data (Pu et al., 2023). In cross-lingual settings, the task (CLS) is to produce in a language different from that of , integrating translation and summarization zero-shot (Wang et al., 2023). Prompting is the primary interface:
The zero-shot regime assumes no in-domain exemplars or demonstration-based in-context learning; however, stratified prompt variants, chain-of-thought or meta-generation steps, and minor domain adaptation within the prompt itself are allowed.
2. Prompt Engineering and Zero-Shot Summarization Pipelines
Prompt engineering is the core driver of success in zero-shot LLM summarisation (Aly et al., 7 Jul 2025, Manuvinakurike et al., 2023, Jaaouine et al., 30 Nov 2025). Common prompt types include:
- Canonical direct prompts: “Summarize the following document...”
- Instruction and role-play prompts: e.g., “Act as a text summarizer and provide a concise summary of up to 3 sentences...”
- Structured prompts: enforce sections, length, extractiveness, or lay style.
- Meta-generation (multi-stage) pipelines: breaking summarization into sub-steps such as question-answering, critique-improve cycles, and summary refinement (Goldsack et al., 9 Jan 2025, Li et al., 26 Oct 2024).
Best practices identified:
- Instruction tuning is necessary for optimal zero-shot performance; instruction-tuned models consistently outperform vanilla LLMs, regardless of parameter count (Zhang et al., 2023, Manuvinakurike et al., 2023).
- Explicit length and style specification in prompts improves compliance and relevance, especially for scientific, legal, or highly technical domains (Retkowski et al., 31 Dec 2024).
- Chunking and multi-stage summarization are critical for long-form inputs; for example, document partitioning and iterative summarization are required for ArXiv-scale scientific papers due to context window constraints (Aly et al., 7 Jul 2025).
- Prompt sensitivity is considerable: slight syntactic or role variations yield ROUGE-L swings of up to 8% (Manuvinakurike et al., 2023), driving the trend towards prompt set selection or prompt ensembling.
In zero-shot video summarization, contextual prompt templates balance global and local coherence, with boundary (first/last scene) prompts relying on global context only, and intermediate scene prompts incorporating adjacent scene descriptions for narrative flow and redundancy control (Wu et al., 20 Oct 2025).
3. Evaluation Metrics, Benchmarks, and Empirical Results
Zero-shot summarisation is evaluated across canonical and task-specific metrics:
- ROUGE-N / ROUGE-L: N-gram and longest common subsequence recall, for surface-level n-gram overlap.
- BERTScore: embedding-based semantic similarity (Aly et al., 7 Jul 2025, Jaaouine et al., 30 Nov 2025, Ramprasad et al., 5 Feb 2024).
- CUI-F1 (UMLS concept overlap): for clinical summarisation (Kruse et al., 30 Jan 2025).
- Human judgments: head-to-head preference, faithfulness, coherence, relevance, factuality, and extractiveness (Pu et al., 2023, Zhang et al., 2023, Goldsack et al., 9 Jan 2025).
- Statistical validation: bootstrap CIs, Wilcoxon signed-rank tests, Bonferroni-Holm correction (Jaaouine et al., 30 Nov 2025).
Key empirical findings:
- Instruction-tuned LLMs (GPT-3.5, GPT-4, Llama-2) regularly surpass fine-tuned baselines on news, dialogue, and cross-lingual tasks (Pu et al., 2023, Zhang et al., 2023, Manuvinakurike et al., 2023), with human judges preferring LLM outputs up to 65% of the time.
- Zero-shot models can close the gap with, and sometimes outperform, supervised references in factual consistency, coverage, and fluency (Pu et al., 2023).
- Performance on domain-specialized tasks (biomedical, legal, scientific) is more variable, with increased intrinsic error in underrepresented domains and lower correlation of ROUGE with human factuality (Ramprasad et al., 5 Feb 2024).
- For low-resource cross-lingual summarisation, meta-generation pipelines using self-improvement and refinement stages (e.g., SITR) yield large gains (+50–100% sum-ROUGE) versus baseline or few-shot LLMs (Li et al., 26 Oct 2024).
- Video summarisation via rubric-guided, pseudo-labeled prompting achieves F1 scores of 57.58 (SumMe) and 63.05 (TVSum), outperforming prior zero-shot and unsupervised baselines (Wu et al., 20 Oct 2025).
Sample results subset (headline figures):
| Task/Dataset | LLM/Method | Main F1/BERTScore | Reference/Leaderboard |
|---|---|---|---|
| News (CNN/DM) | Llama-2-13B-chat | ROUGE-1=37.87 | (Aly et al., 7 Jul 2025) |
| Zero-shot video (SumMe) | Rubric-Guided LLM | F1=57.58 | (Wu et al., 20 Oct 2025) |
| Cross-lingual (WikiLingua) | GPT-3.5 + SITR | SumROUGE=49.02 | (Li et al., 26 Oct 2024) |
| Biomedical lay sum. | DBRX, QA stage | Human pref.↑ | (Goldsack et al., 9 Jan 2025) |
4. Robustness, Hallucination, and Factuality in Zero-Shot Summarisation
Hallucination (factually unsupported or extrinsic content) and robustness are central areas of investigation:
- Prompt-based context repetition and random addition (repeating source sentences in the prompt) substantially reduce hallucinations and improve lexical and semantic alignment (RA-K2 boost: +0.135 mean over baseline, p<0.001) (Jaaouine et al., 30 Nov 2025).
- Brittleness to paraphrasing: Minimal, meaning-preserving paraphrases of critical input sentences (relevance paraphrasing) cause large summarization quality drops (up to –50% ROUGE-2), exposing reliance on shallow heuristics (lead bias, specific surface forms) (Askari et al., 6 Jun 2024).
- Lead and position bias persists: LLMs default to using early-sentence content; position bias high for XSum even in SOTA LLMs (Chhabra et al., 3 Jan 2024).
- Low extractiveness yields more hallucinations in high-representation domains: News summarization with higher pretraining corpus representation shows higher extrinsic error (correlated with pretraining exposure), while specialized domains display more copying and less unsupported content (Ramprasad et al., 5 Feb 2024).
Statistical findings further indicate that standard factuality metrics lose discriminative power in niche domains (Spearman ρ ≈ 0.09–0.30), necessitating domain-specific evaluation and model adjustment.
5. Specialized Variants and Task Extensions
Zero-shot LLM summarisation generalizes beyond simple document summarization:
- Cross-lingual summarisation (CLS): Chain-of-thought prompting (summarize-then-translate, or vice versa) and iterative refinement via follow-up prompts enable LLMs to compete with supervised, fine-tuned encoder-decoders (e.g., mBART-50), except for poorly-represented low-resource languages unless meta-generation strategies are applied (Wang et al., 2023, Li et al., 26 Oct 2024).
- Lay summarisation: Two-stage QA→summary pipelines, inspired by journalistic editorial practice, yield lay summaries in biomedicine and NLP that are strongly preferred by both human and LLM judges once model scale exceeds ∼46B parameters (Goldsack et al., 9 Jan 2025).
- Video summarisation: Frameworks employing LLM-moderated rubrics constructed from small “seed” annotations, with contextualized prompt variants for segment scoring, outperform both prior zero-shot and most supervised baselines on TVSum and SumMe, and enable domain generalization with minimal adaptation (Wu et al., 20 Oct 2025).
- Length-controllable summarisation: Integrated length-approximation, target adjustment, sample filtering, and automated LLM revision strategies achieve >90% length compliance for word-, character-, and token-based summary targets, all zero-shot and without architecture modification (Retkowski et al., 31 Dec 2024).
- Extractive multilingual summarisation: Neural label search across multiple translated label sets, combined with learned weighting, robustly improves zero-shot extractive performance in cross-lingual settings (Jia et al., 2022).
6. Limitations, Open Challenges, and Directions
Zero-shot LLM summarisation remains subject to several limitations:
- Surface form sensitivity: Even minimally perturbed, semantically-equivalent inputs can yield divergent and often degraded summaries (Askari et al., 6 Jun 2024).
- Limited temporal and narrative coherence: In complex domains like clinical summarisation, zero-shot LLMs omit, rearrange, or hallucinate temporally critical events; hybrid retrieval-augmented and explicit temporal extraction are needed (Kruse et al., 30 Jan 2025).
- Domain drift and factuality: Specialized domains not represented in pretraining data experience higher rates of intrinsic errors and degraded consistency of automatic metrics (Ramprasad et al., 5 Feb 2024).
- Prompt brittleness: Model performance can decrease sharply (up to 8% ROUGE fluctuations) between minor prompt wording variants, especially for smaller LLMs (Manuvinakurike et al., 2023, Aly et al., 7 Jul 2025).
- Evaluation constraints: ROUGE and BERTScore do not predict human utility perfectly. Reference quality constrains metric reliability (Zhang et al., 2023).
Recommendations emerging from recent studies:
- Emphasize instruction tuning and careful prompt engineering.
- For critical fidelity, adopt context repetition/random addition and metaprompt improvement steps.
- In low-resource cross-lingual or lay summarisation settings, leverage explicit multi-stage meta-generation.
- For robustness, explore paraphrase-invariant training and rank-based reranking for length or fact control (Retkowski et al., 31 Dec 2024, Askari et al., 6 Jun 2024).
- Human or LLM panel evaluation is needed, as automatic metrics are unreliable on abstractive, domain-shifting, or lay tasks (Goldsack et al., 9 Jan 2025).
7. Generalizability and Future Directions
Zero-shot LLM summarisation frameworks are inherently generalizable:
- Adaptation to new domains: Only a small set of pseudo-labeled gold examples plus rubric abstraction is needed for video, medical, or even text summarization (Wu et al., 20 Oct 2025).
- Structured task decomposition: Multi-stage role-play and metaprompt pipelines (e.g., QA, critique, refine) are applicable across technical, lay, and cross-lingual summarisation (Goldsack et al., 9 Jan 2025, Li et al., 26 Oct 2024).
- Model-agnostic control: Length, style, and factuality can be externally enforced via prompt engineering, ranking, or revision (Retkowski et al., 31 Dec 2024).
- Automated LLM evaluators can serve as practical proxies for human preference, especially when references are poor or unavailable (Goldsack et al., 9 Jan 2025).
Open directions include scaling paraphrase-invariance objectives, integrating multimodal input for video and scientific summarisation, and developing domain-aware factuality metrics and robust meta-evaluation frameworks (Askari et al., 6 Jun 2024, Wu et al., 20 Oct 2025).
In summary, zero-shot LLM summarisation, powered by prompt engineering and meta-task pipelines, has redefined the state of the art in both general and specialized summarisation tasks. Despite remarkable progress, the paradigm faces ongoing challenges in robustness, fidelity, and domain adaptation that define the current research frontier (Pu et al., 2023, Jaaouine et al., 30 Nov 2025, Goldsack et al., 9 Jan 2025).