Automatic Issue-Type Classification

Updated 15 January 2026

Automatic issue-type classification is a process that uses machine learning and statistical methods to label software issues with categories such as bug, enhancement, and question.
Techniques range from bag-of-n-grams and TF–IDF to advanced transformers like BERT and GPT-3.5-turbo to extract and interpret textual features.
Practical implementations integrate with software workflows, automating issue triage, improving developer-task matching, and streamlining project management.

Automatic issue-type classification refers to the use of statistical and machine learning techniques to assign semantic or functional categories—such as bug, feature/enhancement, question, or required skill domain—to software issue tracker entries based solely on their textual content or associated metadata. This process addresses the workflow bottleneck introduced by high issue throughput in large software repositories and facilitates improved triage, developer assignment, and project management.

1. Taxonomies of Issue Types

Contemporary approaches employ two principal taxonomies for issue classification:

Intent-based categories: These are canonical in mainstream GitHub workflows and include bug report, enhancement (feature request), and question. These are typically mutually exclusive and represent the default labeling taxonomy for classification tasks, as operationalized in "Ticket Tagger" and related studies (Kallis et al., 2021, Trautsch et al., 2022, Aracena et al., 2024).
API-domain/skill-based categories: Used in tools such as "GiveMeLabeledIssues," this schema comprises 31 project- and skill-relevant domains (e.g., UI, DB, Test, Networking, Security) and is designed to support developer-task matching beyond the intent dimension, serving as a proxy for the APIs and skills required to resolve the issue. These labels are not mutually exclusive, yielding a multi-label classification setting (Vargovich et al., 2023).

2. Datasets and Preprocessing Protocols

Public datasets typically draw from closed and labeled GitHub issues, with selection and preprocessing adapted to task goals:

Balanced vs. Unbalanced Sampling: Balanced datasets contain equal numbers of issues per target class to minimize model bias, e.g., 10,000 bugs, enhancements, and questions each (12,112 repositories; Feb 2018) (Kallis et al., 2021). Unbalanced datasets reflect real-world submission frequencies (e.g., 16,355/14,228/3,458) (Kallis et al., 2021).
Text Normalization:
- Title and body concatenation.
- Basic tokenization (white-space, punctuation splitting).
- For skill/domain labeling, additional normalization includes lowercasing, removal of URLs, in-line code, numbers, punctuation, and application of stemming and stop-word removal (Vargovich et al., 2023).
- In LLM pipelines, further cleaning replaces user handles, URLs, and HTML with sentinel tokens (Aracena et al., 2024).
Sampling Restrictions: Only closed issues with a single relevant label per issue (to avoid ambiguous cases and post-facto re-labelings).
API Linking for Skill Domain: Only issues resolved via pull requests (PRs) modifying ≥1 source file are considered when skill-based labeling is required (Vargovich et al., 2023). API usage is extracted via static parsing of import/include/using statements.

3. Feature Engineering Methods

Approaches span a spectrum from low-level n-gram encodings to contextual representation learning:

Bag-of-n-grams: Classic approaches use character-level (3–6-gram) representations as in fastText, where each word is split into overlapping n-grams and the resulting embeddings are averaged into a fixed-size vector (Kallis et al., 2021).
TF–IDF over Processed Corpora: For API-domain labeling, the TF–IDF vectorization (with preprocessing) is standard; token frequency is counterbalanced by inverse document frequency (Vargovich et al., 2023).
Contextual Embeddings: Transformer models (BERT, seBERT, GPT-3.5-turbo) tokenize text into WordPieces, with document-level semantics pooled from special sentinel tokens ([CLS]) (Trautsch et al., 2022, Aracena et al., 2024).
Exclusions: Metadata such as timestamps and user roles are not included in most intent-type classification models, though future work proposes their integration (Kallis et al., 2021).

4. Classification Models and Training Procedures

Automated issue-type labeling has been implemented using a range of discriminative models:

Approach	Feature Set	Model Family	Output Type
fastText (Ticket Tagger)	char-n-grams	Linear classifier (HS)	Single label
J48 (C4.5)	TF–IDF tokens	Decision Tree	Single label
Random Forest (GiveMeLabeledIssues)	TF–IDF tokens + API usage	RF (multi-label)	Multi-label
BERT / seBERT	WordPiece tokens	Transformer	Single/Multi
GPT-3.5-turbo	Prompted text	LLM + fine-tuning	Single label

Ticket Tagger: fastText with hierarchical softmax; word-level n-grams are disabled for efficiency; feature pruning by frequency (minCount=14) (Kallis et al., 2021). Training is via negative log-likelihood loss over $N$ classes.
GiveMeLabeledIssues: Multi-label Random Forest with optimized hyperparameters (entropy criterion, max_depth=50, n=50 estimators) per project. Alternative BERT-based models are also implemented via fast-bert, with a sigmoid-activated dense head for multi-label output (Vargovich et al., 2023).
seBERT: Transformer encoder (BERTLARGE, 24 layers, 1024 hidden units), pre-trained from scratch on SE-related corpora, then fine-tuned for issue classification with a softmax layer over three labels and multi-class cross-entropy loss (Trautsch et al., 2022).
GPT-3.5-turbo: Instruction-prompted, few-shot fine-tuned LLM with temperature=0.0 and max_tokens=1 to yield deterministic, single-token classifications based on a prompt encompassing both title and body. Fine-tuned using 300 labeled issues per repository, with epochs tailored to dataset and convergence (Aracena et al., 2024).

5. Evaluation Metrics and Benchmark Results

Standardized metrics are used to quantify model effectiveness:

Precision ( $P_l$ ), Recall ( $R_l$ ), F1-score ( $F1_l$ ) for each class $l$ $l$ :
- $P_l = TP_l / (TP_l + FP_l)$
- $R_l = TP_l / (TP_l + FN_l)$
- $F1_l = 2 \cdot P_l \cdot R_l / (P_l + R_l)$
Macro/micro-averaged F1: Micro-averaged F1 used to aggregate over classes; Hamming loss is used for multi-label skill-domain settings (Vargovich et al., 2023).

Performance highlights from major approaches include:

Model	Data/setting	F1-score (macro/micro)	Notable Per-Class Results
fastText (TT)	Balanced	0.83	Bug/Enhancement/Question: 0.83/0.82/0.83
fastText (TT)	Unbalanced	0.75/0.74/0.48	“Question” label is notably harder
Random Forest (GMLI)	Per-project	0.817 (macro)	Precision up to 0.872 (Audacity).
seBERT	Large-scale	0.857 (micro)	Outperforms fastText by +4.1%; biggest gain for “question” class (+12.8%) (Trautsch et al., 2022).
GPT-3.5-turbo	Repo-level	~0.83 (macro)	Repo-specific F1 to 0.87 (tensorflow), precision up to 0.93, recall up to 0.95 (Aracena et al., 2024)

In all evaluated studies, transformer models (seBERT, GPT-based) outperform classical baselines (fastText, Random Forest) both in overall and per-class metrics, particularly in recall and F1 for the “question” label (Trautsch et al., 2022, Aracena et al., 2024).

6. Deployment and Integration Workflows

Deployed systems range from cloud microservices to custom bots:

Ticket Tagger: Packaged as a Node.js GitHub App; on “issue opened” events, concatenates title/body and predicts label using a pre-trained fastText model, updating labels via the GitHub REST API. The system is optimized for rapid inference on low-cost instances (e.g., AWS t2.nano) with a lightweight (~5 MB) footprint (Kallis et al., 2021).
GiveMeLabeledIssues: Packaged as a Django REST API, supporting nightly (cron) batch classification and SQLite caching for open issues. Model retraining and new project onboarding are automated via project registration scripts and TF–IDF+RandomForest/BERT pipelines (Vargovich et al., 2023).
LLM/GPT approaches: Fine-tuned models are deployed behind prompt-based APIs; project adaptation involves labeled data collection, prompt engineering, fine-tuning per project, and direct RESTful inference (Aracena et al., 2024).

7. Limitations, Threats to Validity, and Future Directions

Recognized concerns across studies include:

Label space restriction: Most intent-type classifiers are limited to bug/enhancement/question; project-specific or fine-grained taxonomies require retraining and/or feature schema redesign (Kallis et al., 2021).
Data distribution shift: Balanced training does not mirror production skew; model performance for rare classes (especially “question”) is lower and confusion is higher (Kallis et al., 2021, Trautsch et al., 2022).
Generalizability and external validity: Datasets are typically sampled from specific timeframes or repositories, possibly limiting cross-project or temporal applicability (Kallis et al., 2021, Vargovich et al., 2023).
Feature exclusion: Current best-performing models generally do not leverage non-textual (metadata, code structure) features, although such signals may improve future models (Kallis et al., 2021, Vargovich et al., 2023).
Sequence length constraints: Models like seBERT are capped at 128 tokens; long-form issues may be incompletely encoded unless longer-span models (e.g., Big Bird) are employed (Trautsch et al., 2022).
Few-shot adaptability: GPT-3.5-turbo enables effective project-specific classifiers with only 100–300 examples per class, suggesting robustness in low-data regimes, but no cross-repo generalization evaluation is reported (Aracena et al., 2024).

This suggests that further gains may be possible by integrating metadata, dynamically adjusting class thresholds, or using adaptive sequence windows, and that model selection for a given deployment context should match project label schema, available data volume, and required granularity of predictions.

8. Extensions: Skill-Oriented and Multi-Label Classification

Recent advances have extended issue classification to skill and API domain assignment:

API-Domain Labeling: GiveMeLabeledIssues classifies issues with respect to the APIs affected in their resolving PRs, resulting in a 31-category multi-label vector where each label reflects a skill required (e.g., “DB,” “Test”) (Vargovich et al., 2023). This responds to a need for better developer-task matching and onboarding.
Hybrid Feature Sets: Integration of code-level (API usage) evidence with text-based features is crucial; code usage is parsed directly from PRs using language-specific patterns.
Model Selection: Multi-label Random Forests and optionally BERT for sequence representation are the principal classifiers; macro-averaged F1 and Hamming Loss are the primary evaluation metrics.

The practical implication is that tools supporting both intent and skill-domain classification are increasingly relevant for open source communities seeking to optimize both maintenance and contributor engagement pipelines.

Markdown Report Issue Upgrade to Chat

References (4)

Predicting Issue Types on GitHub (2021)

Predicting Issue Types with seBERT (2022)

Applying Large Language Models API to Issue Classification Problem (2024)

GiveMeLabeledIssues: An Open Source Issue Recommendation System (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Issue Type Task.

Automatic Issue-Type Classification

1. Taxonomies of Issue Types

2. Datasets and Preprocessing Protocols

3. Feature Engineering Methods

4. Classification Models and Training Procedures

5. Evaluation Metrics and Benchmark Results

6. Deployment and Integration Workflows

7. Limitations, Threats to Validity, and Future Directions

8. Extensions: Skill-Oriented and Multi-Label Classification

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Automatic Issue-Type Classification

1. Taxonomies of Issue Types

2. Datasets and Preprocessing Protocols

3. Feature Engineering Methods

4. Classification Models and Training Procedures

5. Evaluation Metrics and Benchmark Results

6. Deployment and Integration Workflows

7. Limitations, Threats to Validity, and Future Directions

8. Extensions: Skill-Oriented and Multi-Label Classification

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research