Drift-Bench: Benchmarking System Adaptation

Updated 5 February 2026

Drift-Bench is a collection of frameworks and tools designed to assess system robustness, adaptability, and stability across varying drift scenarios in machine learning, databases, HPC, and plasma simulation.
It incorporates methodologies like multi-turn interaction for LLM agents, synthetic drift injection for evolving data and query workloads, and continuous performance regression tracking in scientific software.
Empirical evaluations across diverse domains highlight trade-offs in performance, safety, and efficiency, providing actionable insights for developing more resilient adaptive systems.

Drift-Bench denotes a family of benchmarking frameworks, methodologies, and diagnostic tools designed to evaluate robustness, adaptability, and stability of systems subject to various forms of "drift" — ranging from machine learning agents' pragmatic breakdowns in dialogue to data, workload, and execution regressions in databases and high-performance computing (HPC) environments. It encompasses diagnostic protocols for LLM-agent cooperativity (Bao et al., 2 Feb 2026), frameworks for benchmarking under evolving data and query workloads (Liu et al., 12 Oct 2025), continuous performance regression detection in scientific software (Alt et al., 2024), and, in earlier contexts, drift-kinetic code evaluations for plasma simulation (Huang et al., 2016). This article reviews four central axes: diagnostic drift in LLM-agent pragmatics; data and workload drift in database benchmarking; continuous benchmarking for HPC software; and domain-specific drift-kinetic code benchmarks.

1. Diagnostic Evaluation of LLM Agent Breakdown: Drift-Bench as Agent Pragmatic Benchmark

The work "Drift-Bench: Diagnosing Cooperative Breakdowns in LLM Agents under Input Faults via Multi-Turn Interaction" defines Drift-Bench as a diagnostic framework targeting systematic evaluation of LLM agent robustness under realistic user instruction flaws, with multi-turn clarification as the supporting interaction protocol (Bao et al., 2 Feb 2026). Drift-Bench addresses a core limitation in prior benchmarking approaches which typically presume well-specified, cooperative instructions or restrict clarifications to single-turn, text-only settings.

Taxonomy of Cooperative Breakdown: Drift-Bench formalizes the input fault set $\mathcal{F}$ as

$\mathcal{F} = \mathcal{F}_{\mathrm{intention}} \;\cup\; \mathcal{F}_{\mathrm{premise}} \;\cup\; \mathcal{F}_{\mathrm{parameter}} \;\cup\; \mathcal{F}_{\mathrm{expression}}$

with each class rooted in classical communication theory (Gricean maxims, Austin's felicity, Watzlawick's interactional frames):

$\mathcal{F}_{\mathrm{intention}}$ : invalid/implicit/shifting goals (Relevance violations)
$\mathcal{F}_{\mathrm{premise}}$ : false presuppositions/infeasible preconditions (Quality/felicity)
$\mathcal{F}_{\mathrm{parameter}}$ : missing/corrupted arguments (Quantity)
$\mathcal{F}_{\mathrm{expression}}$ : linguistic ambiguity/vagueness (Manner)

Flawed tasks remain solvable only via targeted clarification, modeling realistic task ambiguity, indirectness, and underspecification in human–agent interaction.

Persona-Driven User Simulation and RISE Protocol: User inputs are simulated with five decision-making personas (Rational, Intuitive, Dependent, Avoidant, Spontaneous), modeled from the General Decision-Making Style instrument. Persona traits guide verbosity, clarification-seeking frequency, and responsiveness, preserving realistic dialogue memory and model-specific interaction patterns. The RISE protocol provides four orthogonal evaluation axes:

Robustness: quantified as performance degradation under faults, $\mathrm{PD} = 1 - \frac{\mathrm{Score}_{\mathrm{perturbed}}}{\mathrm{Score}_{\mathrm{clean}}}$
Intelligence: clarification gain, $G = \frac{1}{|T|}\sum_{t\in T}\bigl(M_{\mathrm{clar}}(t)-M_{\mathrm{noclar}}(t)\bigr)$
Safety: safe action rate (SAR) on high-risk tool tasks, $\mathrm{SAR} = \frac{1}{|T_{\mathrm{risk}}|}\sum_{t\in T_{\mathrm{risk}}}\mathbf{1}(t_{\mathrm{clar}} < t_{\mathrm{risk}})$
Efficiency: average interaction rounds (AIR), $\mathrm{AIR} = \frac{\sum_{t:\mathrm{succ}} \mathrm{rounds}(t)}{|\{t:\mathrm{succ}\}|}$

Execution Environments: Drift-Bench applies to state-oriented (white-box, e.g. OS, DB) and service-oriented (black-box, API) agents, with complimentary task sources and execution risk structures.

Results and Implications: White-box clarification can self-heal instruction faults, achieving net clarification gains (up to +23.6 pp for parameter faults), increased AIR (from 4.84 to 5.78), and nuanced SAR transitions (e.g., 59% for Intention faults, but ~29% for Premise/Parameter). In contrast, clarification in black-box tool-use environments can reduce performance ("clarification-induced syntactic collapse", "abandonment catalyst"). Execution bias analysis reveals a tendency for agents to act rather than safely defer, exposing agent safety liabilities.

Drift-Bench thus bridges research on multi-turn clarification, agent safety, and grounded execution evaluation, with implications for adaptive agent policy learning and communication-style-diverse simulation (Bao et al., 2 Feb 2026).

2. Data and Workload Drift: DriftBench Framework for Database Benchmarking

The framework "DriftBench: Defining and Generating Data and Query Workload Drift for Benchmarking" introduces a systematic taxonomy, formalization, and synthetic generation suite for data and query workload drift in database systems (Liu et al., 12 Oct 2025). Unlike static benchmarks, this DriftBench variant enables explicit injection and control of temporal and structural drift, targeting components such as caching, cardinality estimation, and query optimization.

Drift Formalization and Taxonomy:

Data drift is split into:
- Cardinality drift: varying or updating the number of records.
- Distributional drift: shifting distributions (e.g., mean, skew, variance); injecting outliers.
Workload drift divides into:
- Parametric drift: shifting query predicate distributions or selectivity.
- Structural drift: modifying query structure (adding/removing predicates/joins) or target payloads.

Mathematically, for tables $\mathcal{F} = \mathcal{F}_{\mathrm{intention}} \;\cup\; \mathcal{F}_{\mathrm{premise}} \;\cup\; \mathcal{F}_{\mathrm{parameter}} \;\cup\; \mathcal{F}_{\mathrm{expression}}$ 0 (schema $\mathcal{F} = \mathcal{F}_{\mathrm{intention}} \;\cup\; \mathcal{F}_{\mathrm{premise}} \;\cup\; \mathcal{F}_{\mathrm{parameter}} \;\cup\; \mathcal{F}_{\mathrm{expression}}$ 1), data drift occurs if cardinality or empirical distribution divergences cross thresholds $\mathcal{F} = \mathcal{F}_{\mathrm{intention}} \;\cup\; \mathcal{F}_{\mathrm{premise}} \;\cup\; \mathcal{F}_{\mathrm{parameter}} \;\cup\; \mathcal{F}_{\mathrm{expression}}$ 2, $\mathcal{F} = \mathcal{F}_{\mathrm{intention}} \;\cup\; \mathcal{F}_{\mathrm{premise}} \;\cup\; \mathcal{F}_{\mathrm{parameter}} \;\cup\; \mathcal{F}_{\mathrm{expression}}$ 3 by total variation, KL, EMD, etc. Workload drift is attributed parametric/structural based on template and selectivity differences.

Category	Subtype	Operation	Example
Data Drift	Varying Card.	scale records	1M → 2M census rows
	Updating Card.	delete 10% rows
	Shifting Distr.	skew “age” column
	Inject Outliers	add extreme “age” values
Workload Drift	Change Pred. Dist.	shift filter distributions	uniform → skewed
	Vary Selectivity	expand predicate range
	Modify Structure	remove predicate
	Change Payload	increase projected columns

Framework Architecture: The implementation encompasses schema extraction, distribution simulation (row scaling, column skew, outlier injection), and workload generation via template instantiation and parametric/structural drift. Drift configurations are declarative (YAML), and the workflow is modular, supporting extension for new sources, drift operations, templates, and arrival patterns.

Evaluation and Insights: Case studies on UCI census data with rule-based (PostgreSQL), data-driven (Naru), and mixed (MSCN) estimators under controlled drift injections demonstrate that pure data-driven models exhibit high sensitivity to distributional drift, whereas rule-based systems are robust to distributional change but can be misled by outliers. Extensibility enables benchmarking under realistic, non-stationary patterns, revealing failure modes and adaptation needs in fundamental DBMS internals (Liu et al., 12 Oct 2025).

3. Continuous Benchmarking for HPC: Drift-Bench as Regression and Performance Drift Tracker

Drift-Bench also denotes a continuous benchmarking infrastructure for scientific and HPC codebases, integrating benchmarking deeply into the software development and deployment lifecycle (Alt et al., 2024). It combines CI/CD triggers, hardware-proximate runners, time-series metric ingestion, and regression alerting.

Architecture Overview:

CI/CD Integration: Git repository hooks trigger benchmarking runs (custom runners or agents).
Benchmark Cluster: Heterogeneous nodes represent hardware diversity; Slurm orchestrates precise runs.
Data Collection: Application logs and hardware counters (e.g., LIKWID, Nsight) are parsed, and key metrics written into InfluxDB under structured tags/fields (host, git_rev, test_name, compiler, solver, etc.).
Visualization/Alerting: Grafana dashboards and roofline plots (Plotly) render metric evolution. Drifts are flagged by thresholds, statistical tests, or regression to golden baselines.

Performance and Regression Detection Metrics:

Time-to-solution ( $\mathcal{F} = \mathcal{F}_{\mathrm{intention}} \;\cup\; \mathcal{F}_{\mathrm{premise}} \;\cup\; \mathcal{F}_{\mathrm{parameter}} \;\cup\; \mathcal{F}_{\mathrm{expression}}$ 4), throughput (MLUP/s), FLOP/s, operational intensity ( $\mathcal{F} = \mathcal{F}_{\mathrm{intention}} \;\cup\; \mathcal{F}_{\mathrm{premise}} \;\cup\; \mathcal{F}_{\mathrm{parameter}} \;\cup\; \mathcal{F}_{\mathrm{expression}}$ 5) and roofline, vectorization ratio ( $\mathcal{F} = \mathcal{F}_{\mathrm{intention}} \;\cup\; \mathcal{F}_{\mathrm{premise}} \;\cup\; \mathcal{F}_{\mathrm{parameter}} \;\cup\; \mathcal{F}_{\mathrm{expression}}$ 6), memory bandwidth efficiency ( $\mathcal{F} = \mathcal{F}_{\mathrm{intention}} \;\cup\; \mathcal{F}_{\mathrm{premise}} \;\cup\; \mathcal{F}_{\mathrm{parameter}} \;\cup\; \mathcal{F}_{\mathrm{expression}}$ 7).
Regression is detected via fixed threshold (±10% deviation), rolling window anomaly ( $\mathcal{F} = \mathcal{F}_{\mathrm{intention}} \;\cup\; \mathcal{F}_{\mathrm{premise}} \;\cup\; \mathcal{F}_{\mathrm{parameter}} \;\cup\; \mathcal{F}_{\mathrm{expression}}$ 8), or direct baseline comparison.

Use Cases: In FE2TI, a 30% performance regression traced to library changes was flagged; in waLBerla, synchronization time spikes were automatically detected. This continuous Drift-Bench enables transient or persistent performance drifts to be caught promptly, guaranteeing software reliability through evolving hardware/software stacks (Alt et al., 2024).

4. Drift-Bench in Neoclassical Transport Simulation: Drift-Kinetic Code Benchmarks

In plasma simulation, Drift-Bench refers to a comprehensive methodology for benchmarking local drift-kinetic models for neoclassical transport in complex stellarator geometries (Huang et al., 2016).

Core Models: Four model classes are benchmarked:

Global (5D Hamiltonian, full drift kinetics)
ZOW (zero-orbit-width, 4D, retains tangential magnetic drift)
ZMD (zero-magnetic-drift, omits tangential drift)
DKES-like (monoenergetic, incompressible $\mathcal{F} = \mathcal{F}_{\mathrm{intention}} \;\cup\; \mathcal{F}_{\mathrm{premise}} \;\cup\; \mathcal{F}_{\mathrm{parameter}} \;\cup\; \mathcal{F}_{\mathrm{expression}}$ 9)

Findings and Recommendations:

The inclusion of tangential magnetic drift (ZOW, Global) is essential for correct suppression of 1/ν peaks in radial particle flux, especially at low collisionality and near $\mathcal{F}_{\mathrm{intention}}$ 0.
ZOW delivers near-global accuracy for neoclassical fluxes, bootstrap current, and parallel flows at only $\mathcal{F}_{\mathrm{intention}}$ 120% of the computational cost of global models, making it the preferred local-proxy for Drift-Bench evaluations. DKES-like, while efficient for electrons in certain regimes, fails at capturing orbit effects critical for ion transport at low collisionality (Huang et al., 2016).

5. Relation to Other Drift-Oriented Benchmarks and Conceptual Foundations

Drift-Bench instantiations are unified by the focus on measuring system responses to non-stationarity: flawed agent inputs, evolving datasets and queries, temporal and structural execution regressions, or physical system parameter drifts. The guiding principle is to move beyond static or "clean" benchmarks, instead probing resilience, adaptation, and failure modes under realistic, dynamically changing conditions.

This suggests the practical value of Drift-Bench extends across fields: from diagnosis and improvement of conversational intelligence and agent robustness (Bao et al., 2 Feb 2026), to development of more resilient database systems and continual performance tracking in HPC environments (Liu et al., 12 Oct 2025, Alt et al., 2024), and to ensuring the physical fidelity of computational plasma physics (Huang et al., 2016).

Future directions highlighted in the literature include incorporation of broader tool families and multi-modality (Bao et al., 2 Feb 2026), dynamic environment inference, real human-in-the-loop evaluations, reinforcement learning of agent clarifiers, automated persona generation, and deployment in real-time production infrastructure for both software and data analytics.

References:

(Bao et al., 2 Feb 2026): Drift-Bench: Diagnosing Cooperative Breakdowns in LLM Agents under Input Faults via Multi-Turn Interaction (Liu et al., 12 Oct 2025): DriftBench: Defining and Generating Data and Query Workload Drift for Benchmarking (Alt et al., 2024): A Continuous Benchmarking Infrastructure for High-Performance Computing Applications (Huang et al., 2016): Benchmark of the Local Drift-kinetic Models for Neoclassical Transport Simulation in Helical Plasmas

Markdown Report Issue Upgrade to Chat

References (4)

Drift-Bench: Diagnosing Cooperative Breakdowns in LLM Agents under Input Faults via Multi-Turn Interaction (2026)

DriftBench: Defining and Generating Data and Query Workload Drift for Benchmarking (2025)

A Continuous Benchmarking Infrastructure for High-Performance Computing Applications (2024)

Benchmark of the Local Drift-kinetic Models for Neoclassical Transport Simulation in Helical Plasmas (2016)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Drift-Bench.