Papers
Topics
Authors
Recent
2000 character limit reached

SQuaD: Software Quality Dataset Overview

Updated 21 November 2025
  • SQuaD is a comprehensive dataset combining expert-annotated quality-in-use reviews and a multi-dimensional corpus of project metrics for robust software evaluation.
  • The quality-in-use component provides sentence-level annotations from user reviews, supporting supervised learning and fine-grained topic and sentiment classification.
  • The project corpus aggregates static and process metrics from 450 open-source projects, enabling advanced analyses in defect prediction and technical debt assessment.

The Software Quality Dataset (SQuaD) refers to two distinct, influential resources in contemporary software engineering research: (1) a sentence-level, expert-annotated benchmark for software quality-in-use (QinU) assessment leveraging user reviews, and (2) a massively multi-dimensional, static plus process metric corpus extracted from 450 large-scale open-source software projects. Both datasets represent important empirical assets supporting automated software quality evaluation, quality-in-use mining, technical debt analysis, defect prediction, and advanced software analytics at scale (Atoum et al., 2015, Robredo et al., 14 Nov 2025).

1. Scope and Structure of SQuaD Datasets

1.1 SQuaD Benchmark for Software Quality-in-Use

The SQuaD pilot dataset for software quality-in-use comprises user reviews collected from Amazon.com and Cnet.com, which are manually split into “atomic” sentences and annotated by domain experts. It is structured to support supervised learning, evaluation, and benchmarking of models for classifying sentence-level software quality topics (Atoum et al., 2015). The composition is as follows:

Component Value Description
Reviews collected 867 10 reviews for each star level (1–5) per domain
Raw sentences post-splitting 3,013 Automatic + manual sentence splitting
Gold-standard annotated sentences 2,036 After reconciliation/elimination
Software domains Multiple Desktop, productivity, finance, utilities, office suites
Topics covered 3 Effectiveness, Efficiency, Freedom from Risk (per ISO/IEC 25010)

1.2 SQuaD: Multi-Dimensional Software Quality Dataset

The SQuaD corpus for quantitative software quality analysis consists of:

  • 450 mature open-source software projects across Apache, Mozilla, FFmpeg, and Linux kernel ecosystems
  • 63,586 project release/tag snapshots, spanning ~9 years on average per project
  • Over 700 unique, static and process metrics extracted from the integration of nine state-of-the-art static analysis tools (SATs), version control systems, issue trackers, and vulnerability databases (Robredo et al., 14 Nov 2025)
Axis Value/Range Description
Projects 450 Non-forked, active, >50 stars, >3 contributors
Release snapshots 63,586 Every tagged release from first to latest
Lines of code >7×108 Aggregated across all releases
Metric granularity Method, class, file, project Cyclomatic complexity, code smells, technical debt, etc.

2. Annotation and Metric Extraction Methodologies

2.1 Quality-in-Use Dataset Annotation Protocol

Annotation in the QinU SQuaD follows a rigorously defined five-step process:

  1. Balanced review selection and sentence atomic splitting.
  2. Determination of QinU relevance, assignment to one primary topic: Effectiveness, Efficiency, or Risk.
  3. Feature keyword identification (e.g., “fast”, “crash”), surface opinion extraction, and modifier assignment.
  4. Polarity labeling: +1 (positive), 0 (neutral), –1 (negative), with modifier tokens.
  5. Data storage: Each record contains sentence, topic, features, polarity, modifiers, and review/star context.

Majority voting governs topic/feature/polarity reconciliation (“no match eliminate” for disagreement). Each annotator’s segment is retained if at least two of three annotators agree on the label at each hierarchical decision point.

Polarity Assignment

Polarity is positive if at least one positive opinion word occurs without a stronger negative; negative polarity if the converse holds; mixed signals default to neutral.

Cohen’s Kappa Agreement (π, all annotator pairs, topic labels)

κ=PoPe1Pe\kappa = \frac{P_o - P_e}{1 - P_e}

  • Ann1 vs. Ann2: κ = 0.46 (“moderate”)
  • Ann1 vs. Ann3: κ = 0.58 (“moderate”)
  • Ann2 vs. Ann3: κ = 0.69 (“substantial”)

2.2 Static and Process Metric Extraction for Project Corpus

All releases are processed with a unified pipeline comprising:

  • Full Git and issue tracking mining (2,622,413 commits; 628,178 issues).
  • Nine static analysis tools, including SonarQube (192 metrics: code smells, technical debt, rule violations), CodeScene (temporal code health, hotspots), PMD, Understand, CK (object-oriented metrics), JaSoMe, RefactoringMiner (Java refactorings), RefactoringMiner++ (C++), and PyRef (Python).
  • CVE/CWE extraction by regex from issues, then cross-enriched with NIST and MITRE APIs; traced to releases and source artifacts when possible.

Sample metrics:

  • Cyclomatic Complexity (per method mm):

CC(m)=EN+2P\mathrm{CC}(m) = E - N + 2P

(E: CFG edges, N: nodes, P: connected components)

  • Weighted Methods per Class (WMC):

WMC(C)=i=1nCC(mi)\mathrm{WMC}(C) = \sum_{i=1}^n \mathrm{CC}(m_i)

  • Technical Debt Ratio (TDR):

TDR=100×Remediation_EffortDevelopment_Time+Remediation_Effort\mathrm{TDR} = 100 \times \frac{\text{Remediation\_Effort}}{\text{Development\_Time} + \text{Remediation\_Effort}}

  • Churn, commit frequency, developer activity, and mean time between commits, computed over each release.

3. Data Access, Format, and Example Usage

3.1 Software Quality-in-Use SQuaD

Available as CSV (also JSON/XML) with schema:

Field Type Description
sentence_id int Sentence identifier
review_id string Unique review source
sentence_text string Atomic sentence
qinu_topic enum {effectiveness, efficiency, risk}
feature_terms list(string) Feature triggering label
opinion_word string Principal opinion cue
polarity int {–1, 0, +1}
polarity_modifiers list(string) Polarity modifiers
star_rating int 1–5

Sample annotated records:

1
2
3
4
sentence_id, review_id, sentence_text, qinu_topic, feature_terms, opinion_word, polarity
1024, "CNET_2013_4", "OpenOffice is fast", Efficiency, ["fast"], "fast", +1
2057, "AMZ_DevTools_3", "The color schemes are absolutely atrocious!", Effectiveness, ["color","schemes"], "atrocious", -1
3110, "AMZ_Finance_5", "It crashes too often especially when opening MS Office files.", Risk, ["crash"], "crashes", -1

3.2 Software Project Corpus SQuaD

Distributed as MongoDB BSON (Zstandard-compressed) and mirrored CSV tables, with core schema:

Collection/Table Key Fields
projects_data project_id, repo_url, language, creation_date
commits project_id, commit_hash, author, date, diff_stats
issues project_id, issue_id, labels, tracker_type, creation/closure dates
release_data project_id, release_tag, commit_hash, release_date
TOOL_<SAT_NAME> project_id, release_tag, [file
process_metrics project_id, release_tag, churn, freq, DA, MTBC
cve_data/cwe_data cve_id/cwe_id, description, severity, references
PRJ_ITS_VLN_LINKAGE project_id, issue_id, cve_id, cwe_id

Example code for querying CK cyclomatic complexity:

1
2
3
4
5
6
7
8
9
10
11
12
from pymongo import MongoClient
client = MongoClient("mongodb://localhost:27017")
db = client['SQuaD']
cursor = db['TOOL_CK'].find({
    "project_id": "projX",
    "release_tag": "v2.3.1",
    "metric_name": "CC"
}, {
    "_id": 0, "entity_name": 1, "metric_value": 1
})
for doc in cursor:
    print(f"Method {doc['entity_name']}: CC={doc['metric_value']}")

4. Analytical Applications and Benchmark Results

4.1 SQuaD Quality-in-Use Dataset

Enables supervised sentence/topic classifiers, aspect-based sentiment analysis, and benchmarking of automated tools against a gold-standard set. Example baseline using logistic regression with TF–IDF features:

  • QinU topic classification accuracy: ≈75%
  • Macro-F₁: 0.72 (all three topics)
  • Polarity classification: 82% accuracy (F₁⁺ = 0.79, F₁⁻ = 0.83)

4.2 SQuaD Project Corpus Applications

Supports cross-sectional and longitudinal studies of maintainability, technical debt, code health, and quality trends, using both static and process metrics. Notable applications:

  • Augmenting defect prediction models with process metrics (churn, DA) yields AUC improvements of 5–8 points over static code metrics alone.
  • Quantitative analysis of refactoring actions demonstrates that Extract Method operations reduce average cyclomatic complexity by up to 15%.

Emerging uses include transformer-based sequence prediction leveraging chronologically ordered release snapshots and cross-ecosystem quality modeling.

5. Limitations and Prospective Extensions

5.1 Pilot SQuaD for Quality-in-Use

Current release omits the ISO/IEC 25010 dimensions of Satisfaction and Context Coverage. Future expansions aim to include additional review domains (mobile, web), more granular parsing, multi-lingual annotations, and scalable annotation via crowdsourcing and active learning–assisted tools. Maintaining high inter-annotator agreement at scale remains a principal challenge.

5.2 Project Corpus SQuaD

Limitations include SAT tool variable accuracy (language, code style sensitivity), substantial resource requirements (>64 GB RAM for full import and querying), and imperfect granularity in vulnerability-to-code tracing. Future plans incorporate expanded statics (e.g., CodeQL, Semgrep), automated continuous updates via CI pipelines, MySQL and further relational data exports, and fine-grained vulnerability mapping.

6. Availability and Licensing

  • The pilot SQuaD software quality-in-use dataset is publicly hosted (http://www.meta-net.eu/), typically under a CC BY-SA license (Atoum et al., 2015).
  • The SQuaD project metric corpus is available via ZENODO (DOI: 10.5281/zenodo.17566690) (Robredo et al., 14 Nov 2025).
  • Both are maintained for reproducibility, benchmarking, and extension by empirical software engineering researchers.

7. Significance and Impact

SQuaD, in both its annotated review and mining-driven project-scale incarnations, establishes a comprehensive empirical basis for research on software quality assessment, software analytics, and automated defect prediction. By combining expert judgment, multi-level static and process features, and meticulous curation, SQuaD facilitates quantitative, comparative, and reproducible studies of software maintainability, evolution, and quality-in-use that were previously infeasible at this scale and granularity.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Software Quality Dataset (SQuaD).