Industrial CQA Datasets

Updated 1 July 2025

Industrial CQA datasets are curated collections of real or simulated data used to train and evaluate AI systems for quality assurance tasks like defect detection, process monitoring, and query answering in industrial settings.
These datasets span diverse modalities including images, time series, multi-modal sensor data, knowledge graphs, and text, reflecting the varied applications in manufacturing, process control, and infrastructure monitoring.
Benchmarking using these datasets relies on task-specific metrics covering supervised vision, anomaly detection, time-series analysis, knowledge reasoning, and causal discovery to ensure reproducible evaluation and comparison of AI models.

Industrial Computerized Quality Assurance (CQA) datasets refer to collections of curated, real-world or simulated data designed to benchmark, train, and evaluate automated systems for detecting defects, classifying products, monitoring processes, answering quality-related queries, or modeling knowledge and causal relationships within industrial settings. These datasets form the empirical foundation for modern AI-driven quality assurance solutions in manufacturing, process industries, and infrastructure monitoring, with applications spanning visual inspection, predictive maintenance, knowledge graph reasoning, process control, and conversational interaction.

1. Types and Taxonomy of Industrial CQA Datasets

Industrial CQA datasets are diverse—spanning multiple data modalities, application scenarios, and levels of annotation granularity. The primary categories include:

Surface Defect and Visual Inspection Datasets: Image-based datasets for detecting and classifying visible defects in manufactured products (e.g., NEU-CLS/DET, DAGM, KolektorSDD, PCB Defect Dataset, Hollow Cylindrical Defect Dataset, VISION Datasets, CXR-AD, MVTec AD).
Multi-modal and Process-Centric Datasets: Sensor-rich time-series and synchronized vision datasets for capturing process and assembly line data (e.g., Analog and Multi-modal Manufacturing Datasets from the Future Factories Platform) (2401.15544, 2502.05020).
Knowledge Graph and Complex Query Answering Datasets: Synthetic or real KGs for evaluating logical reasoning, property prediction, and multi-hop query answering over industrial knowledge (e.g., $EFO_k$ -CQA) (2307.13701).
Causal Discovery and Process Topology Datasets: Collections with ground-truth causal structures capturing continuous industrial processes (e.g., CIPCaD-Bench: Tennessee Eastman and Ultra-Processed Food datasets) (2208.01529).
Community and Conversational QA Datasets: Datasets for evaluating and training industrial CQA systems in knowledge-sharing and question-answering contexts (e.g., StackOverflow, ProCQA, FailureSensorIQ, S2M, ComRAG datasets) (2103.03583, 2506.03278, 2312.16511, 2506.21098).
Software Model Engineering Datasets: Traces and evolution logs for model completion or change prediction tasks in industrial software systems (e.g., RAMC with industrial SysML history) (2406.17651).

A summary of representative datasets and their orientations is given below:

Dataset/Class	Primary Modality	Typical Application
NEU-CLS, DAGM, KolektorSDD, VISION, CXR-AD	Image (RGB/X-ray)	Surface/internal defect detection
Future Factories Platform (V1, V2)	Multi-modal (sensors+vision)	Process QA, anomaly & safety detection
CIPCaD-Bench, UF	Time series + ground-truth causal graph	Causal inference, process control
$EFO_k$ -CQA	Knowledge graph	Complex query answering
FailureSensorIQ, ProCQA, S2M, ComRAG	Text, MCQA, log	Community QA, sensor/mode reasoning
Industrial Machine Tool Defect Dataset	Image + segmentation	Prognostics, wear, segmentation
RAMC (model evolution)	Change logs, model graphs	Software model completion

2. Data Collection Methods and Annotation Strategies

Industrial CQA datasets are typically created using one or more of the following approaches:

Direct Acquisition in Industrial Settings: Data collected from real facilities, e.g., sensor streams, production line imagery, annotated defect samples (see Future Factories Platform datasets, Industrial Machine Tool Defect Dataset) (2401.15544, 2502.05020, 2103.13003).
Controlled Defect Injection: Researchers insert known faults (missing parts, incorrect assemblies) into the process to ensure labeled examples of anomalies for supervised learning.
Annotation by Domain Experts: Defect masks, cycle states, or question/answer correctness are labeled by experienced engineers (critical in ambiguous or noisy cases). For instance, pixel-level defect masking in CXR-AD and VISION Datasets (2505.03412, 2306.07890).
Synthetic and Simulated Data: For datasets with limited real-world availability (e.g., DAGM, Tennessee Eastman), simulation generates ground-truth data under controlled conditions.
Schema-based or Standard-based Construction: MCQA datasets such as FailureSensorIQ leverage ISO standards and failure mode tables, ensuring domain compliance and relevance (2506.03278).
Algorithmic Data Generation: For knowledge/reasoning benchmarks (e.g., $EFO_k$ -CQA), automated enumeration of logical query graphs, isomorphism filtering, and CSP-based answer computation ensure theoretical coverage (2307.13701).

Annotation strategies vary: from image-level labels to pixel-precise segmentation masks, sensor state/binary anomaly columns, logical answer tuples, and cycle state encoding. Quality control often involves multiple rounds of validation, especially in datasets where small differences are technically significant.

3. Benchmarking, Metrics, and Evaluation Protocols

The evaluation of models using industrial CQA datasets relies on task-appropriate metrics:

Supervised Visual Inspection: Metrics include accuracy, precision, recall, F1-score, ROC-AUC, mean Average Precision (mAP), and mean Intersection-over-Union (mIoU). For VISION Datasets:

$\text{Score} = 0.5 \cdot \mathrm{mAP} + 0.5 \cdot \mathrm{mAR}^{\max=100}$

Unsupervised/Anomaly Detection: AUROC (per-pixel and per-image) is standard (e.g., ONENIP, PatchCore, AdaCLIP on CXR-AD) (2505.03412).
Time-Series and Multi-Modal Tasks: Root cause localization, sensor imputation accuracy, predictive maintenance error (e.g., RMSE for future defect progression, classification accuracy for anomaly cycles).
Knowledge and Reasoning Benchmarks: Ranking-based metrics (Mean Reciprocal Rank, NDCG@K, HIT@K), marginal vs. joint evaluation (cf. $EFO_k$ -CQA), validity checks for answers under open-world assumption (2307.13701, 2103.03583).
Community QA and MCQA: Accuracy, uncertainty-adapted accuracy, set size, and robustness under perturbations are used for datasets like FailureSensorIQ, with formulas such as

$\text{Acc} = \frac{1}{|D|}\sum_{x\in D} \mathbb{I}[M(q_x)=y_x]$

Causal Discovery: True/false positive/negative count, SHD, recall/precision, F1, and error analysis with respect to ground-truth graphs (2208.01529).

These protocols facilitate reproducible benchmarking and enable fair comparison among algorithms under varying noise, complexity, and industrial fidelity.

4. Key Applications and Impact

Industrial CQA datasets underpin a range of applications central to modern manufacturing and infrastructure reliability:

Automated Visual Inspection: Enabled by instance segmentation, defect classification, and anomaly detection models trained on datasets such as VISION, NEU-CLS, and CXR-AD.
Predictive Maintenance & Condition Monitoring: Time-series/multi-modal datasets support sequence models that forecast failures, impute missing sensor data, or detect process drifts.
Worker Safety Analysis: Dataset features such as annotated PPE compliance in vision streams (Future Factories Platform V2, VISION Datasets) facilitate automated monitoring of occupational safety (2502.05020).
Root Cause Analysis & Process Control: Datasets containing process context, cycle states, and causal graph ground-truth allow supervised or causal learning for faster fault localization (CIPCaD-Bench).
Knowledge-driven Query Answering: CQA datasets for MCQA and knowledge graphs support deployment of AI assistants and industrial RAG systems, integrating historical Q&A with static documentation (e.g., ComRAG).
Software Maintenance and Model Completion: Software evolution datasets support AI-powered model completion, facilitating efficient engineering in domains such as railway control systems (2406.17651).

The impact is visible in improvements in inspection accuracy, reduction of false positives/escapes in QA pipelines, acceleration of feature engineering (e.g., with LLMFeatureSelector in FailureSensorIQ), and enhanced explainability and traceability.

5. Technical Challenges and Open Issues

Despite the breadth of available datasets, several technical and methodological challenges persist:

Limited Coverage of Internal and Non-Visible Defects: Surface-focused datasets prevail; CXR-AD (2505.03412) addresses internal defect benchmarking under X-ray but further expansion is needed for ultrasonic, thermal, and more complex modalities.
Noise, Low Contrast, and Complexity: Industrial imagery can manifest low SNR, complex backgrounds, and ambiguous defect boundaries (especially in X-ray or process-laden environments).
Multi-scale, Multi-type Anomalies: Many real scopes exhibit highly unbalanced defect size distributions and morphology (CXR-AD: 87% defects small, very low local contrast).
Insufficient Multi-Modal and Longitudinal Datasets: While future factories datasets (2401.15544, 2502.05020) offer modular sensor-vision streams, large-scale labeled, synchronized, and scenario-rich examples remain rare.
Complex Knowledge Reasoning & Joint Answering: Operator tree-based CQA benchmarks underrepresent cycles/multigraphs and multi-variable queries, impeding the development of robust, generalizing knowledge reasoning models ( $EFO_k$ -CQA directly addresses this).
Brittleness and Calibration of AI Models: Evaluation of LLM-based approaches (FailureSensorIQ, ComRAG, RAMC) shows brittleness to input perturbations and highlights knowledge/explainability gaps in current AI deployments.
Data Scarcity and Annotation Cost: Especially in high-mix, low-volume industries, collecting enough defect cases for statistically robust training and benchmarking remains a limiting factor.

6. Benchmarking Practices and Dataset Accessibility

Best practices in CQA dataset design are exemplified by:

PRISMA-driven systematic reviews to survey and collate benchmarks for defect detection research (2406.07694).
Careful train/test/validation splits done via similarity graphs, cycle-based or content-based hashing, and deduplication (VISION Datasets) to prevent data leakage and overfitting.
Community involvement through open-source licensing and challenge competitions, facilitating both academic progress and industrial feedback loops (e.g., VISION, FailureSensorIQ).
Annotation, evaluation scripts, and code being released alongside datasets, ensuring transparency and reproducibility (see links in dataset documentation).

A trend towards larger, more realistic, and multi-faceted datasets—with comprehensive annotation, perfect alignment to industrial scenarios, and transparent metrics—has rendered industrial CQA benchmarking a robust and evolving field.

Industrial CQA datasets are central artifacts enabling controlled development, rigorous benchmarking, and effective deployment of AI-driven quality assurance systems in modern production. Their progression mirrors advances in computing, sensing, and annotation capability, and addresses the growing demand for transparency, explainability, and reliability in industrial automation. Datasets reviewed here—ranging from classic steel surface benchmarks, through multi-modal smart factory streams, to challenging internal defect and knowledge reasoning corpora—provide the empirical and algorithmic grounding necessary for the next generation of automated industrial quality assurance solutions.