DeepResearch Bench: Real-World Database Benchmarking

Updated 25 June 2025

Redbench is a database benchmarking suite explicitly designed to capture and reproduce the genuine query workload characteristics found in real-world production analytical systems, specifically focusing on the demands of instance-optimized and learned components. It addresses shortcomings in prior benchmarks by faithfully modeling workload repetition, temporal patterns, and distribution shifts as observed in operational enterprise data warehouses, thereby serving both academic research and industrial engineering with a more realistic standard for evaluation and development.

1. Real-World Motivated Benchmark Design

Redbench constructs its workloads by sampling and aligning queries based on production usage traces. Drawing from the Redset dataset (covering three months of activity from 200 Amazon Redshift clusters), Redbench carefully samples workloads to cover the spectrum of query repetition, diversity, and complexity actually observed in deployment. Users are divided into ten buckets by query repetition rate (from 0–100% in 10% increments), and in each bucket, three users are chosen for variety: the ones with the lowest, median, and highest variability (as measured by join and table scan diversity).

For each selected user, Redbench captures a burst period (typically the busiest week), resulting in up to 1,000 production-style SELECT queries per workload. Overall, the suite contains 30 such workloads. This approach ensures Redbench represents both routine, repetitive business reporting and highly dynamic, ad-hoc analytic exploration. Furthermore, the query arrival timeline and structural features are preserved to enable longitudinal and adaptive evaluation.

Benchmarks from TPC-H, TPC-DS, IMDb/JOB, and CEB serve as “support” templates: each Redbench query is mapped to the support instance that best matches its join count, scanset, and repetition pattern in the original trace. Mapping uses normalized join count, attempts to maintain scanset overlap, and strictly follows repetition semantics (identical queries always map to the same support instance).

2. Workload Properties and Coverage

Redbench uniquely preserves several properties critical for realistic system evaluation:

Query Repetition Rate: The proportion of queries seen before (by normalized hash) in a workload, ranging from nearly zero (high diversity, exploratory analytics) to near one (repetitive, operational BI). This is bucketed and preserved per-user.
Distribution Shifts: Temporal changes in query composition, either due to evolving data access patterns, business needs, or application updates. Redbench mirrors real-world drift patterns by maintaining the sequence and relational structure of queries drawn from Redset.
Table Scan Repetitiveness: Persistence with which certain tables are accessed or joined, enabling validation of caching, indexing, and clustering strategies tuned to actual hot paths.
Complexity Trajectories: The temporal ordering and complexity progression of queries within a workload, allowing evaluation of both static and adaptive optimization components.

The user selection and mapping pipelines ensure that users with the lowest, median, and highest diversity (sum of join diversity and scanset diversity) within a repetition bucket are chosen, maximizing the representativeness of workload dynamics.

3. Evaluation of Instance-Optimized and Learned Components

Redbench is intently designed for evaluating and advancing instance-optimized and machine learning–driven database system components, which adapt their strategies to actual observed workload patterns. Principal targets include:

Learned Cardinality and Cost Models: Redbench’s alignment with real distribution shifts and table access patterns exposes such models to realistic concept drift, testing resilience to both recurring queries and evolving user behavior.
Semantic and Materialized Caching: Accurate query repetition and scanset correlations enable meaningful evaluation of caching policies, as their payoff is highly sensitive to real recurrence frequencies and access locality.
Adaptive Indexing and Clustering: By reproducing fluctuating hot sets and significant temporal locality, Redbench validates approaches that specialize storage and access structures for workload-tailored optimization.
Resource Allocation and Scaling: The inclusion of bursty and shifting access patterns facilitates the development and testing of allocation heuristics or models aiming to optimize resource use in dynamic, multi-tenant environments.

Several features make Redbench particularly suitable for these purposes: mapping preserves not only repetition but also normalized join counts and scanset semantics, and different support benchmarks can be used, provided they are sufficiently expressive.

4. Modeling and Preserving Distribution Shifts

Redbench explicitly focuses on distribution shifts, representing how production workloads evolve over time—something mostly missing or synthetically approximated in TPC-H, TPC-DS, CAB, and DSB. By using real user traces from Redset, Redbench naturally introduces time-varying patterns in query structure, scanset, and repetition, closely mimicking changes faced by deployed systems.

Formally, if each workload is a query sequence $Q_1, \dots, Q_n$ with hashes $h_1, \dots, h_n$ , Redbench captures and maintains differences in empirical distributions $P_t(h)$ over time, i.e. the probability of each query pattern changing from $t$ to $t+1$ . The mapping process normalizes join counts from Redset into the available support benchmark’s spectrum, preserving complexity gradients.

This ensures that instance-optimized or learned components can be systematically evaluated for both their baseline performance and their robustness to real-world, sometimes abrupt, workload drift.

5. Comparison to Previous Benchmarks

Redbench sets a new standard for realism compared to prior benchmarks:

Feature	TPC-H/DS	DSB	CAB	Redbench (this work)
Source of workload	Synthetic	Synthetic	Semi-real	Production (Redset)
Query repetition	Fixed	Fixed	Rand/Fixed	Empirical (bucketed)
Workload drift	None/Synthetic	Synthetic	Synthetic	Real
Instance-optimized eval	No	Limited	Some	Yes

While prior synthetic benchmarks add random or prescribed repetition and drift, Redbench directly mirrors empirically observed phenomena, enabling evaluations that reflect actual production operational challenges.

6. Implications for Research and Industry

Redbench bridges a longstanding gap between academic research, which frequently relies on static or artificial benchmarks, and industry deployment, where real workloads drive system behavior. By offering open, reproducible, and versatile workloads based on traced production use, Redbench enables:

Robust validation of research proposals for learned, adaptive, and instance-specialized system techniques under conditions they will face in practice.
Benchmark-agnostic adoption, so that systems designed for TPC-DS, IMDb/JOB, or other standard benchmarks can plug in Redbench workloads with minimal adaptation.
Community-wide reproducibility through the public release of all workloads and mapping methods, supporting sustained comparison and competition.
Future extensibility, with possible support for time-stamped replay and integration of update workloads, further increasing fidelity to evolving enterprise scenarios.

Redbench is positioned as a foundational infrastructure for next-generation research and system engineering on realistic adaptive database components, facilitating progress in both learned and traditional (heuristic or rule-based) workload-aware optimization.

PDF Markdown Chat (Pro)