LLM-Based Bug Report Summarization
- The paper demonstrates that instruction fine-tuning with LoRA enables LLMs to transform unstructured bug reports into structured summaries with improved CTQRS and field detection.
- The methodology integrates textual descriptions with code artifacts using a progressive, hierarchical summarization pipeline that overcomes transformer context limitations.
- The approach automates bug triaging by consolidating and highlighting essential details while flagging missing fields to streamline debugging workflows.
LLM-Based Abstractive Bug Report Summarization refers to neural methodologies that leverage large-scale, instruction-tuned or code-aware transformer models to generate concise, readable, and semantically coherent summaries or structured templates from unstructured, verbose, or heterogeneous bug reports, potentially while integrating heterogeneous sources such as textual descriptions, commit diffs, and code snippets. This paradigm aims to address the significant engineering bottleneck of low-quality bug report triage and comprehension by automatically extracting the most salient details, reducing ambiguity, and ensuring completeness for downstream maintenance and debugging workflows (Acharya et al., 26 Apr 2025, Karim et al., 29 Nov 2025).
1. Problem Setting and Motivations
Bug reports in large software projects typically exhibit varied linguistic quality, ambiguous descriptions, missing critical fields (e.g., Steps to Reproduce (S2R), Expected Behavior (EB), Actual Behavior (AB)), and fragmentation between natural language and associated technical artifacts. These factors prolong the triage cycle and undermine the efficiency of developer teams. The central research objective is to transform such unstructured, user-generated content into structured, actionable bug reports or abstractive summaries, with accurate mapping of content to a predefined schema and identification of missing information. Core research questions include: Can open-source, instruction-fine-tuned LLMs outperform proprietary few-shot systems? Do these models generalize to unseen projects or diverse ecosystems? How is missing-field detection accuracy impacted by LLM architecture and tuning regime (Acharya et al., 26 Apr 2025)?
Concurrently, bug report summarization must address code-context integration: standard approaches often ignore code, leading to loss of defect context, especially for issues linearly tied to specific diffs, patches, or stack traces. Abstractive methods aim to combine the strengths of both free-form language understanding and direct program artifact reasoning (Karim et al., 29 Nov 2025).
2. Dataset Construction and Preprocessing Methodologies
Two major data-centric methodologies are prominent. For structured bug report generation, high-quality Bugzilla bug reports from Mozilla projects serve as source data; reports are extracted via the Bugzilla API and filtered by presence of explicit S2R/EB/AB/Additional Info, exclusion of stack traces/code snippets to reduce noise, and retention of only those reports with high Crowdsourced Test Report Quality Score (CTQRS > 14/17). Manual validation ensures schema compliance; resulting high-quality structured reports are then paraphrased into free-form unstructured variants using Llama 3 with naturalization prompts, with filtering by SBERT similarity (≥ 0.85) and cosine similarity (≥ 0.80) to construct near-parallel (unstructured, structured) pairs (Acharya et al., 26 Apr 2025).
For text-plus-code summarization, pipeline architectures start from Defects4J (395 real Java defects) and other benchmarks (SDS, ADS, Fang et al.), aligning tokenized bug report texts () and preprocessed code artifacts () as parallel input-source tuples per bug instance (Karim et al., 29 Nov 2025).
| Dataset | Domains | Content Paired |
|---|---|---|
| Bugzilla (Mozilla) | Bug reports | Unstructured/structured |
| Defects4J | Java defects | Bug report + Code |
| SDS, ADS, Fang | Mixed | Bug + Summary |
Each dataset partition follows standard splits (80/10/10) and, where applicable, 4-fold cross-validation.
3. Model Architectures and Fine-Tuning Strategies
Instruction fine-tuning with open-source LLMs is a dominant paradigm. Evaluated models include Qwen 2.5-7B-Instruct, Mistral-7B-Instruct, Llama 3.2-3B-Instruct, and, as an external baseline, ChatGPT-4o (3-shot). Models are parameter-efficiently fine-tuned using Low-Rank Adaptation (LoRA) with rank 16, injected into attention modules such as , , , , , , , using the Unsloth and TRL’s SFTTrainer frameworks. Hyperparameters typically entail 3 epochs, with a learning rate of for Qwen and Mistral, and for Llama 3.2; batch size is generally 8 (Acharya et al., 26 Apr 2025).
Prompt strategy uses an Alpaca-style template instructing the LLM as a "senior software engineer" to generate exhaustive structured bug reports, requiring explicit notification of missing sections. Structured output is delivered in JSON with clearly labeled fields and a “Missing Fields” key if any sections are absent.
For text-plus-code summarization, a hierarchical pipeline incrementally processes long code snippets via chunking: given with length and a maximum context length , code is split into contiguous blocks, each summarized by the LLM, aggregated into a intermediate code summary via a second LLM pass, and then integrated with the bug report text to generate the final abstractive summary (Karim et al., 29 Nov 2025).
| Model | Training Regime | Key Features |
|---|---|---|
| Qwen 2.5 | LoRA+Instruction tuning | GQA |
| Mistral-7B | LoRA+Instruction tuning | |
| Llama 3.2-3B | LoRA+Instruction tuning | Conservative MF |
| ChatGPT-4o | 3-shot prompting | Proprietary |
| Various | Progressive code integration | Chunked code |
4. Evaluation Metrics and Experimental Results
Evaluation of LLM-based summarization and structured report generation employs CTQRS, ROUGE, METEOR, SBERT cosine similarity, and field-level F1 scores:
- CTQRS (Crowdsourced Test Report Quality Score) measures completeness, conciseness, atomicity, understandability, reproducibility via 13 dependency-parsing rules (max = 17):
- ROUGE-1 and METEOR provide n-gram and semantic overlap.
- SBERT Cosine Similarity measures embedding-level correspondence.
- F1-score (including per-field and missing-section detection):
Reported results for structured bug report generation:
| Model | CTQRS | SBERT | ROUGE-1 | F1 (S2R) |
|---|---|---|---|---|
| Qwen 2.5 | 77% | 0.82 | 0.64 | 76% |
| Mistral 7B | 71% | 70% | ||
| Llama 3.2 | 63% | <70% | ||
| ChatGPT-4o | 75% |
For cross-project generalization (Eclipse, GCC, Apache, ): Qwen 2.5 achieves up to 70% CTQRS, evidencing generalization beyond Mozilla (Acharya et al., 26 Apr 2025).
For progressive code-integration summarization, across Defects4J and benchmarks:
| Method | ROUGE-1 (Fang) | BERTScore F1 (Defects4J, Code-incl.) |
|---|---|---|
| BugSum | 0.2591 | - |
| DeepSum | 0.1760 | - |
| Mistral, zero/few-shot | 0.2786 | 0.6877 |
| CodeLlama, zero/few-shot | 0.2641 | 0.6767 |
| GPT-3.5 Turbo, few-shot incl. code | - | 0.9003 |
Progressive code integration outperforms extractive baselines by 7.5%-58.2% in ROUGE-1 and yields consistent BERTScore F1 gains (1–5 points) when code information is included (Karim et al., 29 Nov 2025).
5. Technical Analysis, Insights, and Error Modes
Qwen 2.5’s superior CTQRS is attributed to its Grouped-Query Attention (GQA) mechanism, which enhances its ability to capture long-range cross-field dependencies crucial for structured transformations (Acharya et al., 26 Apr 2025). Llama 3.2 demonstrates conservative missing-field (MF) detection but tends to under-generate lengthy S2Rs, resulting in higher accuracy in EB/AB omission flagging yet lower recall for procedural detail. Instruction fine-tuning offers 5–14% CTQRS uplift over few-shot regimes, affirming its efficacy.
Progressive code-integration, by chunking code and hierarchically aggregating summaries, enables semantic condensation of code artifacts while sidestepping transformer context window limitations. Empirical ablation demonstrates that code-context, especially patch code, materially increases BERTScore F1 in summarization, with bug-report→code ordering outperforming code→bug-report. Fine-tuning LLMs with LoRA brings smaller incremental gains compared to optimal prompt engineering for the abstractive task on some datasets, which suggests that pre-trained LLMs with strong in-context learning abilities can robustly generalize with well-designed prompts (Karim et al., 29 Nov 2025).
Documented error patterns include hallucinated S2R content when original reports lack detail, misattribution or swapping of EB versus AB fields, and over-detailed procedural sections diminishing n-gram overlap despite improved informativeness. Fine-tuned models insert explicit “Missing: Steps to Reproduce” or analogous warnings in the output JSON, improving triage clarity (Acharya et al., 26 Apr 2025).
6. Applications and Implications for Software Engineering
Empirical findings confirm that open-source, instruction-fine-tuned LLMs can match or surpass leading proprietary models (e.g., ChatGPT-4o) in structured bug report generation, with robust generalization across divergent software projects and repositories. Integration workflows enable automatic reformulating of unstructured reporter input, immediate highlighting of missing or incomplete sections, and substantial reduction in human triaging overhead, which directly accelerates the bug-fixing lifecycle. LLMs trained on large project corpora can act as drop-in solutions for nascent repositories, obviating the need for extensive project-specific annotation (Acharya et al., 26 Apr 2025). Progressive code-integration pipelines extend the summarization scope to simultaneous synthesis of linguistic and technical evidences, enabling richer, more context-aware summaries for defect comprehension (Karim et al., 29 Nov 2025).
7. Future Research Directions
Open challenges and prospective advances include:
- Multi-modal enrichment: Incorporating stack traces, logs, or screenshots into report synthesis to further enhance bug localizability and actionability.
- Advanced parameter-efficient tuning: Investigation of techniques such as QLoRA to balance compute efficiency with model capacity.
- Broader platform coverage: Extending approaches to platforms beyond Bugzilla, including GitHub Issues and Jira, with attention to ecosystem-specific linguistic and structural norms.
- Interactive systems: Real-time reporter assistants that provide missing-field prompts or semi-automated bug report filling.
- Human-centered evaluation: Systematic studies on developer-perceived clarity, actionability, and satisfaction with LLM-generated summaries or reports.
- Continuous learning: Deploying active learning feedback loops where developer or reporter corrections iteratively refine LLM weights for sustained accuracy improvements.
These directions collectively suggest the evolution of LLM-based bug report summarization toward highly adaptive, contextually aware, and multi-modal intelligent assistants for automated software maintenance (Acharya et al., 26 Apr 2025, Karim et al., 29 Nov 2025).