AutoDFBench 1.0: Forensic Benchmarking Framework
- AutoDFBench 1.0 is an open-source, modular benchmarking framework standardizing the evaluation of digital forensic tools using extensive ground-truth data from NIST CFTT suites.
- It employs a three-layered architecture with REST APIs, CSV ingestion, and MySQL storage, enabling isolated, reproducible assessments across five forensic domains.
- Empirical validation demonstrates near-perfect scoring across string search, file carving, and registry recovery, confirming its precision and extensibility for industry use.
AutoDFBench 1.0 is an open-source, modular benchmarking framework designed for rigorous and reproducible evaluation of both conventional and AI-generated digital forensic tools and scripts. Developed atop the NIST Computer Forensics Tool Testing (CFTT) suites, AutoDFBench 1.0 introduces the first unified, automated, and extensible standard to benchmark and validate digital forensic (DF) technologies across the five core CFTT domains: string search, deleted file recovery, file carving, Windows registry recovery, and SQLite data recovery. By incorporating test case ground truth data for 63 core cases and nearly 11,000 unique test scenarios, AutoDFBench enables transparent, statistically precise, and reproducible tool assessments for tool vendors, researchers, practitioners, and standardisation bodies (Wickramasekara et al., 18 Dec 2025).
1. Architectural Principles and Modularity
AutoDFBench 1.0 implements a modular, three-layered architecture facilitating the isolation, extensibility, and automation of benchmarking across diverse forensic domains:
- Score Calculation Layer: Suite-specific modules contain scoring logic to compute true positives (TP), false positives (FP), false negatives (FN), precision, recall, and F1 scores per sub-test. These are aggregated to yield per-suite scores and further averaged into the overarching AutoDFBench Score.
- API Layer: Exposes five REST endpoints (one per CFTT domain) under
/api/v1/.../evaluate, each responsible for input normalization, standardized output structures, and live evaluation. - CSV Input Layer: Incorporates a
csv_eval.pyutility that supports batch-oriented, offline, or CI-style result ingestion by accepting tuples of (test_case, input_CSV, output_report).
A central MySQL database underpins the architecture, organizing per-suite configuration parameters, detailed ground-truth records, and evaluation results. Modules are completely isolated, facilitating independent suite addition or replacement. Extension requires only module directory creation, ground-truth ingestion, scoring logic implementation, addition of API/CSV handlers, and configuration updates—ensuring cross-suite decoupling (Wickramasekara et al., 18 Dec 2025).
2. Ground Truth Structure and Test Dataset Composition
AutoDFBench 1.0 integrates five primary CFTT test suites, covering 63 principal test cases and 10,968 unique test variations. The design and structure of ground-truth data are tailored to the complexities of each forensic domain:
| Suite | Core Cases | Test Variations | Ground Truth Key Fields |
|---|---|---|---|
| String Search | 10 | 1,844 | Line-match IDs, keywords, file status |
| Deleted File Recovery | 14 | 8,147 | File blocks, file name, size, MAC timestamps |
| File Carving | 7 | 108 | Source image path, format, carve_type |
| Windows Registry Recovery | 15 | 49 | PATH, TYPE, VALUE, MTIME (regipy CSV export) |
| SQLite Data Recovery | 4 | 820 | Page size, journal mode, page count, schema, file hash |
For deleted file recovery, AutoDFBench 1.0 compensates for incomplete NIST-provided ground truth in certain cases by computing block allocations from partition start sectors (PS) and sector counts using . All ground-truth records are loaded into a unified schema so evaluation modules can access required attributes efficiently (Wickramasekara et al., 18 Dec 2025).
3. RESTful API and Data Flow
The framework offers five primary POST endpoints, each corresponding to a test suite:
/api/v1/string-search/evaluate/api/v1/deleted-file-recovery/evaluate/api/v1/file-carving/evaluate/api/v1/windows-registry/evaluate/api/v1/sqlite-recovery/evaluate
API requests use JSON (or multipart encoding for file uploads), specifying test_case_id, optional tool_name, and suite-specific inputs (e.g., file attributes for deletion recovery, data rows for SQLite evaluation). Responses are standardized, always including:
1 2 3 4 5 6 7 8 9 10 11 |
{
"test_case_id": "DFR-01",
"tool_name": "MyDFRTool",
"true_positives": 42,
"false_positives": 0,
"false_negatives": 0,
"precision": 1.0,
"recall": 1.0,
"f1_score": 1.0,
"autdfbench_score": 1.0
} |
By default, no authentication is required; APIs are intended for open, local, or containerized deployment (Wickramasekara et al., 18 Dec 2025).
4. Evaluation Metrics and Scoring Methodology
At the core of AutoDFBench 1.0 is standard confusion-matrix-based evaluation. The main metrics are defined as:
Each test case yields an individual F1 score. The arithmetic mean of F1s across all sub-test cases in a suite forms its suite score. The comprehensive metric, termed the AutoDFBench Score, is the mean across all test cases in all five suites:
where is the total number of core test cases in version 1.0 (Wickramasekara et al., 18 Dec 2025). Metrics are stored and retrievable via the central test_results table, supporting tool-level, suite-level, and overall benchmarking.
5. Empirical Validation and Benchmarking Outcomes
AutoDFBench 1.0 has been validated using the official NIST CFTT datasets and prominent reference tools (The Sleuth Kit, regipy, DB Browser for SQLite, Scalpel). Notable outcomes include:
- String search: Perfect F1 = 1.0 for all 1,844 sub-cases across 10 main test cases.
- Deleted file recovery: For evaluated cases with complete ground truth, F1 = 1.0; incomplete ground truth cases excluded.
- File carving: All evaluated cases F1 = 1.0; one case skipped due to unavailable ground truth.
- Windows registry recovery: regipy achieved F1 = 1.0; python-registry was near-perfect except where specific parsing differed from ground truth, validating scoring logic.
- SQLite data recovery: F1 = 1.0 for all cases.
These results empirically confirm the accuracy of ground-truth ingestion, evaluation logic, and output metrics, and illustrate improved automation, reproducibility, and numeric precision compared to manual CFTT benchmarking (Wickramasekara et al., 18 Dec 2025).
6. Use Cases, Extensibility, and Future Directions
The framework is designed for integration across tool vendor CI/CD pipelines (tracking AutoDFBench Scores per release via the API), offline bulk evaluation (using csv_eval.py for GUI tool outputs), AI-generated code benchmarking (via API input schema adaptation), and practitioner proficiency testing. Extensibility procedures are fully modular: new forensic domains (e.g., mobile file systems, memory forensics, cloud acquisitions) require only definition of new ground-truth, implementation of scoring logic, and endpoint registration; no changes in other modules are necessary.
Example extension comprises: ingesting new test suite ground-truth, writing domain-specific scoring, exposing a matching API endpoint, and updating configuration and validation strategies. This pattern ensures AutoDFBench 1.0 remains adaptable to evolving DF technology landscapes, offering transparent and reproducible standards across domains (Wickramasekara et al., 18 Dec 2025).