SQuaD: Software Quality Dataset Overview
- SQuaD is a comprehensive dataset combining expert-annotated quality-in-use reviews and a multi-dimensional corpus of project metrics for robust software evaluation.
- The quality-in-use component provides sentence-level annotations from user reviews, supporting supervised learning and fine-grained topic and sentiment classification.
- The project corpus aggregates static and process metrics from 450 open-source projects, enabling advanced analyses in defect prediction and technical debt assessment.
The Software Quality Dataset (SQuaD) refers to two distinct, influential resources in contemporary software engineering research: (1) a sentence-level, expert-annotated benchmark for software quality-in-use (QinU) assessment leveraging user reviews, and (2) a massively multi-dimensional, static plus process metric corpus extracted from 450 large-scale open-source software projects. Both datasets represent important empirical assets supporting automated software quality evaluation, quality-in-use mining, technical debt analysis, defect prediction, and advanced software analytics at scale (Atoum et al., 2015, Robredo et al., 14 Nov 2025).
1. Scope and Structure of SQuaD Datasets
1.1 SQuaD Benchmark for Software Quality-in-Use
The SQuaD pilot dataset for software quality-in-use comprises user reviews collected from Amazon.com and Cnet.com, which are manually split into “atomic” sentences and annotated by domain experts. It is structured to support supervised learning, evaluation, and benchmarking of models for classifying sentence-level software quality topics (Atoum et al., 2015). The composition is as follows:
| Component | Value | Description |
|---|---|---|
| Reviews collected | 867 | 10 reviews for each star level (1–5) per domain |
| Raw sentences post-splitting | 3,013 | Automatic + manual sentence splitting |
| Gold-standard annotated sentences | 2,036 | After reconciliation/elimination |
| Software domains | Multiple | Desktop, productivity, finance, utilities, office suites |
| Topics covered | 3 | Effectiveness, Efficiency, Freedom from Risk (per ISO/IEC 25010) |
1.2 SQuaD: Multi-Dimensional Software Quality Dataset
The SQuaD corpus for quantitative software quality analysis consists of:
- 450 mature open-source software projects across Apache, Mozilla, FFmpeg, and Linux kernel ecosystems
- 63,586 project release/tag snapshots, spanning ~9 years on average per project
- Over 700 unique, static and process metrics extracted from the integration of nine state-of-the-art static analysis tools (SATs), version control systems, issue trackers, and vulnerability databases (Robredo et al., 14 Nov 2025)
| Axis | Value/Range | Description |
|---|---|---|
| Projects | 450 | Non-forked, active, >50 stars, >3 contributors |
| Release snapshots | 63,586 | Every tagged release from first to latest |
| Lines of code | >7×108 | Aggregated across all releases |
| Metric granularity | Method, class, file, project | Cyclomatic complexity, code smells, technical debt, etc. |
2. Annotation and Metric Extraction Methodologies
2.1 Quality-in-Use Dataset Annotation Protocol
Annotation in the QinU SQuaD follows a rigorously defined five-step process:
- Balanced review selection and sentence atomic splitting.
- Determination of QinU relevance, assignment to one primary topic: Effectiveness, Efficiency, or Risk.
- Feature keyword identification (e.g., “fast”, “crash”), surface opinion extraction, and modifier assignment.
- Polarity labeling: +1 (positive), 0 (neutral), –1 (negative), with modifier tokens.
- Data storage: Each record contains sentence, topic, features, polarity, modifiers, and review/star context.
Majority voting governs topic/feature/polarity reconciliation (“no match eliminate” for disagreement). Each annotator’s segment is retained if at least two of three annotators agree on the label at each hierarchical decision point.
Polarity Assignment
Polarity is positive if at least one positive opinion word occurs without a stronger negative; negative polarity if the converse holds; mixed signals default to neutral.
Cohen’s Kappa Agreement (π, all annotator pairs, topic labels)
- Ann1 vs. Ann2: κ = 0.46 (“moderate”)
- Ann1 vs. Ann3: κ = 0.58 (“moderate”)
- Ann2 vs. Ann3: κ = 0.69 (“substantial”)
2.2 Static and Process Metric Extraction for Project Corpus
All releases are processed with a unified pipeline comprising:
- Full Git and issue tracking mining (2,622,413 commits; 628,178 issues).
- Nine static analysis tools, including SonarQube (192 metrics: code smells, technical debt, rule violations), CodeScene (temporal code health, hotspots), PMD, Understand, CK (object-oriented metrics), JaSoMe, RefactoringMiner (Java refactorings), RefactoringMiner++ (C++), and PyRef (Python).
- CVE/CWE extraction by regex from issues, then cross-enriched with NIST and MITRE APIs; traced to releases and source artifacts when possible.
Sample metrics:
- Cyclomatic Complexity (per method ):
(E: CFG edges, N: nodes, P: connected components)
- Weighted Methods per Class (WMC):
- Technical Debt Ratio (TDR):
- Churn, commit frequency, developer activity, and mean time between commits, computed over each release.
3. Data Access, Format, and Example Usage
3.1 Software Quality-in-Use SQuaD
Available as CSV (also JSON/XML) with schema:
| Field | Type | Description |
|---|---|---|
| sentence_id | int | Sentence identifier |
| review_id | string | Unique review source |
| sentence_text | string | Atomic sentence |
| qinu_topic | enum | {effectiveness, efficiency, risk} |
| feature_terms | list(string) | Feature triggering label |
| opinion_word | string | Principal opinion cue |
| polarity | int | {–1, 0, +1} |
| polarity_modifiers | list(string) | Polarity modifiers |
| star_rating | int | 1–5 |
Sample annotated records:
1 2 3 4 |
sentence_id, review_id, sentence_text, qinu_topic, feature_terms, opinion_word, polarity 1024, "CNET_2013_4", "OpenOffice is fast", Efficiency, ["fast"], "fast", +1 2057, "AMZ_DevTools_3", "The color schemes are absolutely atrocious!", Effectiveness, ["color","schemes"], "atrocious", -1 3110, "AMZ_Finance_5", "It crashes too often especially when opening MS Office files.", Risk, ["crash"], "crashes", -1 |
3.2 Software Project Corpus SQuaD
Distributed as MongoDB BSON (Zstandard-compressed) and mirrored CSV tables, with core schema:
| Collection/Table | Key Fields |
|---|---|
| projects_data | project_id, repo_url, language, creation_date |
| commits | project_id, commit_hash, author, date, diff_stats |
| issues | project_id, issue_id, labels, tracker_type, creation/closure dates |
| release_data | project_id, release_tag, commit_hash, release_date |
| TOOL_<SAT_NAME> | project_id, release_tag, [file |
| process_metrics | project_id, release_tag, churn, freq, DA, MTBC |
| cve_data/cwe_data | cve_id/cwe_id, description, severity, references |
| PRJ_ITS_VLN_LINKAGE | project_id, issue_id, cve_id, cwe_id |
Example code for querying CK cyclomatic complexity:
1 2 3 4 5 6 7 8 9 10 11 12 |
from pymongo import MongoClient client = MongoClient("mongodb://localhost:27017") db = client['SQuaD'] cursor = db['TOOL_CK'].find({ "project_id": "projX", "release_tag": "v2.3.1", "metric_name": "CC" }, { "_id": 0, "entity_name": 1, "metric_value": 1 }) for doc in cursor: print(f"Method {doc['entity_name']}: CC={doc['metric_value']}") |
4. Analytical Applications and Benchmark Results
4.1 SQuaD Quality-in-Use Dataset
Enables supervised sentence/topic classifiers, aspect-based sentiment analysis, and benchmarking of automated tools against a gold-standard set. Example baseline using logistic regression with TF–IDF features:
- QinU topic classification accuracy: ≈75%
- Macro-F₁: 0.72 (all three topics)
- Polarity classification: 82% accuracy (F₁⁺ = 0.79, F₁⁻ = 0.83)
4.2 SQuaD Project Corpus Applications
Supports cross-sectional and longitudinal studies of maintainability, technical debt, code health, and quality trends, using both static and process metrics. Notable applications:
- Augmenting defect prediction models with process metrics (churn, DA) yields AUC improvements of 5–8 points over static code metrics alone.
- Quantitative analysis of refactoring actions demonstrates that Extract Method operations reduce average cyclomatic complexity by up to 15%.
Emerging uses include transformer-based sequence prediction leveraging chronologically ordered release snapshots and cross-ecosystem quality modeling.
5. Limitations and Prospective Extensions
5.1 Pilot SQuaD for Quality-in-Use
Current release omits the ISO/IEC 25010 dimensions of Satisfaction and Context Coverage. Future expansions aim to include additional review domains (mobile, web), more granular parsing, multi-lingual annotations, and scalable annotation via crowdsourcing and active learning–assisted tools. Maintaining high inter-annotator agreement at scale remains a principal challenge.
5.2 Project Corpus SQuaD
Limitations include SAT tool variable accuracy (language, code style sensitivity), substantial resource requirements (>64 GB RAM for full import and querying), and imperfect granularity in vulnerability-to-code tracing. Future plans incorporate expanded statics (e.g., CodeQL, Semgrep), automated continuous updates via CI pipelines, MySQL and further relational data exports, and fine-grained vulnerability mapping.
6. Availability and Licensing
- The pilot SQuaD software quality-in-use dataset is publicly hosted (http://www.meta-net.eu/), typically under a CC BY-SA license (Atoum et al., 2015).
- The SQuaD project metric corpus is available via ZENODO (DOI: 10.5281/zenodo.17566690) (Robredo et al., 14 Nov 2025).
- Both are maintained for reproducibility, benchmarking, and extension by empirical software engineering researchers.
7. Significance and Impact
SQuaD, in both its annotated review and mining-driven project-scale incarnations, establishes a comprehensive empirical basis for research on software quality assessment, software analytics, and automated defect prediction. By combining expert judgment, multi-level static and process features, and meticulous curation, SQuaD facilitates quantitative, comparative, and reproducible studies of software maintainability, evolution, and quality-in-use that were previously infeasible at this scale and granularity.