ARFBench: TSQA for Anomaly Reasoning

Updated 4 July 2026

ARFBench is a time series question-answering benchmark that tests whether multimodal models can reason about anomalies in real production observability data.
It leverages Datadog incident telemetry to generate multiple-choice questions across detection, characterization, and cross-series analysis tiers.
The benchmark emphasizes practical metrics like accuracy and macro-F1, fostering hybrid TSFM+VLM models and complementary model-expert insights.

Searching arXiv for the ARFBench paper to ground the article in the cited source. ARFBench, short for Anomaly Reasoning Framework Benchmark, is a multiple-choice Time Series Question Answering (TSQA) benchmark for evaluating whether multimodal foundation models can understand and reason about anomalies in real production observability time series arising from software incidents. In this setting, the input comprises one or two time series, a natural-language caption describing what the series measures, additional contextual information used during dataset creation, and a natural-language question; the required output is a natural-language choice selected from a fixed set of multiple-choice answers. The benchmark is designed around a central question: whether a model can inspect complex, noisy, multivariate incident telemetry and correctly answer high-level questions about anomaly presence, timing, magnitude, type, correlation, and temporal precedence. ARFBench was introduced in "ARFBench: Benchmarking Time Series Question Answering Ability for Software Incident Response" (Xie et al., 23 Apr 2026).

1. Definition, scope, and motivation

ARFBench is situated in time series question-answering, where natural-language questions are posed to infer and reason about properties of time series. In ARFBench, this task is specialized to real production observability time series associated with software incidents, rather than synthetic or simplified signals (Xie et al., 23 Apr 2026). The benchmark targets multimodal foundation models, including LLMs, VLMs, and TSFMs, and is explicitly concerned with anomaly reasoning rather than only anomaly detection.

The motivating use case is software incident response. Incident handling is presented as intrinsically question-driven: engineers ask whether a spike is abnormal, whether one component fails before another, or whether one metric plausibly causes a change in another. Observability telemetry is correspondingly treated as complex and contextual. Metrics such as CPU, Kafka lag, error rates, container I/O, and latency are multivariate, highly nonstationary, and domain-specific; whether a pattern counts as anomalous depends not only on signal shape but also on semantic meaning, cross-series relations, and temporal context.

A central premise of ARFBench is that existing TSQA benchmarks are inadequate for this operational setting. Prior benchmarks are described as often synthetic or simulated, univariate or only simply multivariate, lacking contextual information, not grounded in expert annotations, and not requiring reasoning across multiple time series. ARFBench is intended to fill that gap through real observability telemetry from Datadog production incidents, expert-supported labels and incident timelines, and tasks that extend beyond detection to compositional anomaly reasoning, including cross-metric reasoning and leading/lagging analysis. This suggests that the benchmark is as much about semantic and relational interpretation as about pattern recognition.

2. Dataset composition and construction pipeline

ARFBench is built from internal Datadog production incidents and contains 63 incidents, 142 distinct time series, 750 questions, and 5.38 million data points (Xie et al., 23 Apr 2026). The question set is divided across three difficulty tiers: 111 Tier I questions, 306 Tier II questions, and 333 Tier III questions. The multivariate structure is substantial: the minimum number of variates is 1, the median is 10.5, and the maximum is 2,283; the median length per variate is 367 points, and the maximum is 40,969 points. These statistics are significant because they create representation bottlenecks for models that depend on textual serialization or rasterized plots.

The telemetry is drawn from observability metrics, including infrastructure metrics, application usage metrics, database metrics, networking metrics, and security metrics. Time series are categorized via GPT-4.1 into the observability subdomains Application usage, Infrastructure, Networking, Database, and Security. The anomaly patterns include level shifts, transient spikes, changes in seasonality, changes in variance, changes in trend, and flatlines, outages, and missing data. The incidents therefore reflect operational phenomena such as failing pods, overloaded queues, Kafka consumer lag, replica crashes, and resource exhaustion.

The primary raw source is Datadog’s software incident timelines. These contain Slack discussions from incident start through mitigation and resolution, embedded metric widgets shared during triage, and natural-language expert reasoning about hypotheses, observations, and root-cause exploration. The timelines provide evidence for which time series were actually used and how they were interpreted, including what counted as anomalous and when anomalies started or ended.

The question-answer creation pipeline has three stages. First, during data curation and cleaning, metric time series are extracted from incidents together with their Datadog metric queries, including metric names, filters, aggregations, and group-bys. An LLM is then used to summarize and sanitize the metric query into a generalized time series caption, and channel names such as datacenter, pod, cluster, or topic are preserved as tags for multivariate reasoning. Second, question templates and an oracle model are used. Single-series templates are applied to each time series, while up to 10 random pairs of time series per incident are used for paired-series templates. An “oracle” VLM, given the question, the rendered time-series plot, and the full incident timeline text, generates answer options and a putative correct answer. Third, filtering and human verification are applied. A separate LLM checks temporal consistency, overlap constraints for cross-series questions, and magnitude sensibility. Human authors then manually review every QA pair, correct labels using incident evidence, remove any remaining sensitive content, and downselect variable-option categories to exactly 5 options consisting of one correct answer and four randomly sampled distractors.

This construction procedure is notable because the benchmark labels are not solely model-generated. The oracle VLM has access to richer context than evaluation models, but its outputs are later manually verified and sanitized. A plausible implication is that ARFBench is designed to preserve realistic incident semantics while controlling annotation quality and confidentiality.

3. Question taxonomy and task formulation

ARFBench defines 8 categories grouped into 3 difficulty tiers (Xie et al., 23 Apr 2026). The tier structure encodes increasingly demanding forms of anomaly reasoning.

Tier I – Detection contains Presence, a binary anomaly-detection question asking whether the time series exhibits an anomaly in the given time range, with answers “Yes” or “No”.

Tier II – Single-series anomaly properties contains five categories. Identification asks which channels among three given tags are anomalous, with answer choices consisting of combinations of one to three specific channels plus “No Anomaly”. Start Time asks for the start time of the anomaly, if one exists, using four timestamps plus “Before the earliest timestamp” and “No Anomaly”. End Time asks for the end time, using four timestamps plus “Not resolved” and “No Anomaly”. Magnitude asks how much the anomaly deviates from expected behavior. Conceptually, if the counterfactual mean satisfies $\mu \neq 0$ , magnitude uses the maximum ratio $\max_t \frac{|x_t|}{\mu}$ ; if $\mu = 0$ , it uses the maximum absolute value $\max_t |x_t|$ . The options are approximately log-spaced numeric values plus “No Anomaly”. Categorization asks for the anomaly type, with options level shift, transient spike, change in seasonality, change in variance, change in trend, and no anomaly.

Tier III – Cross-series reasoning contains Correlation and Leading/Lagging Indicator. Correlation asks whether the anomaly in one time series correlates with the anomaly in another, with choices encoding the logical cases of no anomalies, anomaly only in series 1, anomaly only in series 2, anomalies in both but not correlated, and anomalies in both and correlated. The indicator task asks whether the anomaly in time-series 1 is a leading or lagging indicator of the anomaly in time-series 2, with options leading indicator, lagging indicator, perfectly correlated, not correlated, and no anomaly in one or both series.

The design principle is hierarchical: Tier II questions depend on correctly detecting anomalies, while Tier III depends on correctly localizing and characterizing anomalies in each series. This dependency structure makes the benchmark more than a flat set of classification tasks; it probes progressively richer anomaly semantics.

At evaluation time, each question $i$ supplies the model with time-series image(s), a sanitized caption, channel or tag names, question text, and a list of answer options; the model must output a choice $\hat{y}_i$ that is exactly one of the provided options. Tier I and Tier II use one plot, while Tier III uses three images: a stacked plot with both series sharing a time axis and separate plots for each individual series. LLM-only baselines receive text encodings of the time series, including truncated or downsampled numeric sequences and textual descriptions of the time axis and tags. TSFM-based models receive raw numeric arrays $X$ of shape $(V, T)$ together with metadata such as timestamps and tags.

How the time series are rendered depends on model family. VLMs receive PNG plots downsampled to a maximum of 1500 px on each side. In multivariate plots, channels are colored, but channel names are often omitted from the image for clarity and instead supplied in text. LLMs receive discretized numeric sequences and, for long or high-dimensional series, temporal subsampling and variate subset selection such as the top 50 by mean. TSFM-based systems ingest raw arrays directly. These choices expose one of the benchmark’s core methodological tensions: observability time series are too large and too multivariate to fit naturally into either text context windows or high-resolution images without information loss.

4. Evaluation methodology and benchmarked performance

ARFBench reports two primary metrics: accuracy and multiclass macro-F1 (Xie et al., 23 Apr 2026). Accuracy is defined for $N$ questions as

$\text{Acc} = \frac{1}{N} \sum_{i=1}^{N} \mathbb{1}(\hat{y}_i = y_i).$

Macro-F1 is computed after mapping question-specific answer options to fixed semantic answer classes within each category. Examples include {No Anomaly, Smallest, Small, Medium, Large} for Magnitude and {No Anomaly, Earliest, Early, Medium, Late} for Start Time. For each class $\max_t \frac{|x_t|}{\mu}$ 0, one computes precision $\max_t \frac{|x_t|}{\mu}$ 1, recall $\max_t \frac{|x_t|}{\mu}$ 2, and

$\max_t \frac{|x_t|}{\mu}$ 3

then averages across classes within a category and finally across categories. The use of macro-F1 is intended to prevent trivial gains from class imbalance. The paper notes that the per-category Frequent Choice baseline attains 45.1% accuracy but only 17.3% macro-F1, demonstrating the importance of semantic balance.

The baselines are Random Choice, with 24.5% accuracy and 22.5% macro-F1, and Per-category Frequent Choice, with 45.1% accuracy and 17.3% macro-F1. Human performance is measured through a user study with 4 Datadog researchers—2 observability domain experts and 2 non-domain experts—each answering a 25% random subset of 188 questions after a calibration stage of 16 training questions requiring at least 90% accuracy to proceed. Non-domain experts achieve 69.7% accuracy and 60.7% macro-F1, while domain experts achieve 72.7% accuracy and 64.6% macro-F1.

Among text-only LLMs, Qwen3 32B (LLM) obtains 47.9% accuracy and 36.1% macro-F1, and GPT-5 (text) obtains 56.4% accuracy and 43.9% macro-F1. The benchmark reports that LLMs consistently underperform their VLM counterparts. This is consistent with the task formulation: reading shape, alignment, and multivariate visual structure from plots appears materially easier than inferring them from serialized numeric text.

The VLM results show a broad range. Qwen3-VL 8B reaches 45.3% accuracy and 34.7% F1; Claude Sonnet 4.5, 47.2% accuracy and 37.9% F1; GPT-4o, 47.2% accuracy and 42.4% F1; GPT-4.1, 47.9% accuracy and 44.0% F1; Qwen3-VL 32B (few-shot), 52.8% accuracy and 45.1% F1; Claude Opus 4.6, 54.8% accuracy and 46.7% F1; Gemini 3 Pro, 58.1% accuracy and 49.6% F1; GPT-5.4 (VLM), 61.3% accuracy and 51.4% macro-F1; and GPT-5 (VLM), 62.7% accuracy with 95% CI [59.2, 66.13] and 51.9% macro-F1 with CI [47.21, 55.38].

Per-tier and per-category patterns are also reported. For GPT-5 (VLM), Tier I reaches 82.0% accuracy and 66.9% F1, Tier II 55.9% accuracy and 51.2% F1, and Tier III 62.5% accuracy and 47.5% F1. Selected category-specific numbers include Magnitude at 65.8% accuracy and 59.1% F1, Categorization at 59.6% accuracy and 57.0% F1, Correlation at 63.5% accuracy and 49.0% F1, and Indicator at 61.3% accuracy and 45.9% F1. The benchmark notes that models are generally strong on binary presence detection but weaker on categories that require estimating baseline behavior, distinguishing anomaly morphology, or reasoning about cross-series temporal relationships.

The paper also provides an error analysis. Model errors are manually categorized as incorrect perception at approximately 48%, limited context usage at approximately 42%, and instruction-following errors at approximately 10%. Incorrect perception includes missing subtle level shifts, mis-localizing start or end times, and failing to detect missing-data gaps. Limited context usage includes ignoring the semantic content of the caption or failing to use multivariate structure. Instruction-following errors are especially prominent in Correlation and Indicator tasks with closely related answer options. Models are also reported to exhibit domain-knowledge gaps, such as mis-inferring causality or severity even when they perceive the plotted pattern correctly.

5. Specialized hybrid modeling: Toto-1.0-QA-Experimental

A major contribution associated with ARFBench is the development of a hybrid TSFM + VLM prototype named Toto-1.0-QA-Experimental 32B (Xie et al., 23 Apr 2026). The motivation is architectural. ARFBench time series are long and highly multivariate, and VLMs that operate on images are constrained by image resolution, plotting-induced information loss, and difficulty recovering fine-grained timestamps or subtle variance changes. Time-series LLMs can ingest raw sequences but, in the reported baselines, do not transfer well to this observability domain.

The hybrid model combines Toto, described as a leading observability forecasting TSFM pretrained on Datadog-like metrics, with Qwen3-VL 32B, an open-source VLM. The TSFM path takes multivariate time series $\max_t \frac{|x_t|}{\mu}$ 4 and produces per-timestep embeddings for each variate. A Variate Embedding MLP then aggregates across time per variate, conceptually

$\max_t \frac{|x_t|}{\mu}$ 5

to compress long sequences into fixed-size per-channel representations. Projection layers map these TSFM embeddings into the text-token embedding space of the Qwen3-VL LLM,

$\max_t \frac{|x_t|}{\mu}$ 6

and the resulting embeddings are interleaved as “time-series tokens” in the text stream. The Qwen3-VL vision tower remains frozen; the TSFM-VLM variant does not rely on time-series images. Trainable components include LoRA adapters on the text decoder, the variate embedding MLP, the projection layers, and the TSFM backbone itself, which is unfrozen.

The post-training pipeline uses the same synthetic and real TSQA data for three model families: post-trained Qwen3-VL 32B, a TSFM-LLM variant, and the TSFM-VLM hybrid. The synthetic training corpus contains 12,000 examples generated from Gaussian noise with seasonality and drift, plus injected anomalies including level shifts, transient spikes, changes in variance, changes in seasonality, and flatline anomalies. An LLM is used to generate captions, channel names, and synthetic reasoning traces. The real training data consist of 207 manually labeled examples from incidents spanning 2025-04-01 to 2025-04-07, disjoint from the ARFBench test set, and are augmented to 395 examples using Tier III negative augmentation by pairing time series from different incidents.

Training proceeds in three stages. Stage 1 is supervised fine-tuning on synthetic QA plus reasoning traces. Stage 2 is supervised fine-tuning on real QA pairs plus synthetic reasoning traces. After Stage 2, the TSFM-VLM reaches 48.4% accuracy and 37.8% F1, still below base Qwen3-VL 32B few-shot, but with formatting issues largely corrected and TS tokens utilized. Stage 3 casts ARFBench as a Reinforcement Learning with Verifiable Rewards (RLVR) task using DAPO. The reward function is pure outcome reward:

$\max_t \frac{|x_t|}{\mu}$ 7

and the per-output advantage for group size $\max_t \frac{|x_t|}{\mu}$ 8 is

$\max_t \frac{|x_t|}{\mu}$ 9

The DAPO objective is given as

$\mu = 0$ 0

with token-level importance ratio

$\mu = 0$ 1

No format or length rewards are used; only correctness is rewarded. RLVR uses group size $\mu = 0$ 2, global batch sizes of approximately 16–40 depending on model, and temperatures of approximately 0.8–1.3.

Stage 3 produces large gains: +15.4 percentage points accuracy and +11 points F1 over the Stage 2 TSFM-VLM checkpoint. The final Toto-1.0-QA-Experimental model attains 63.9% accuracy with 95% CI [60.40, 67.07] and 48.9% macro-F1 with CI [44.13, 52.27]. Tier-wise, it reaches 84.7% accuracy and 66.3% F1 on Tier I, 55.6% accuracy and 48.4% F1 on Tier II, and 64.6% accuracy and 43.5% F1 on Tier III. Compared with GPT-5 (VLM), the hybrid model slightly exceeds overall accuracy and attains the highest reported Tier III accuracy, though GPT-5 retains an overall macro-F1 advantage. Relative to other post-trained models, Qwen3-VL 32B (post-trained VLM only) obtains 56.9% accuracy and 46.6% F1, while Toto-1.0-Qwen3 32B (TSFM-LLM) obtains 48.8% accuracy and 33.9% F1. This suggests that adding a TSFM to an LLM alone is not sufficient; the benefit emerges from jointly training TSFM and VLM components with RLVR.

6. Human complementarity, limitations, and prospective uses

ARFBench explicitly studies the relationship between model and expert performance (Xie et al., 23 Apr 2026). Domain experts outperform standalone models overall, with 72.7% accuracy and 64.6% macro-F1 versus approximately 63–64% accuracy and 49–52% macro-F1 for the best models. Yet the error sets are reported to be substantially disjoint. On the 25% subset used for the user study, among the 23 questions that both experts answered incorrectly, GPT-5 answered 11 correctly and the TSFM-VLM answered 8 correctly. Conversely, among the 58 questions GPT-5 got wrong, at least one expert answered 46 correctly, and among the 59 questions the TSFM-VLM got wrong, at least one expert answered 44 correctly. The paper interprets this as evidence that models and experts exhibit complementary strengths: models may succeed on fine timing details, while experts may rely on domain knowledge and richer causal intuitions.

This complementarity motivates the Model–Expert Oracle, a best-of-2 selector over a model answer $\mu = 0$ 3 and an expert answer $\mu = 0$ 4:

$\mu = 0$ 5

Using GPT-5 + domain expert, this oracle reaches 87.2% accuracy and 82.8% macro-F1, with 96.4% accuracy and 89.0% F1 on Tier I, 80.3% accuracy and 77.1% F1 on Tier II, and 90.5% accuracy and 86.3% F1 on Tier III. The benchmark describes this as establishing a new “superhuman” frontier. A plausible implication is that practical systems for incident response may benefit more from carefully designed human-model collaboration than from full autonomy.

The benchmark also states several limitations. It is entirely based on Datadog’s internal telemetry, algorithms, and incident practices, so domain transfer is not guaranteed. It covers single-turn anomaly-centric questions, not mitigation or remediation decisions, multi-hop causal narratives, or multi-turn interactive agent behavior. Annotation bias remains possible because labels are grounded in an oracle VLM and human verification, and anomaly boundaries—especially start and end times—are inherently subjective, though the multiple-choice design reduces ambiguity. Synthetic training data use relatively simple anomaly injection patterns and may not capture the full richness of real observability anomalies. Generalization beyond software incidents to domains such as ECGs, finance, climate, or IoT remains open. The training dynamics of the TSFM-VLM procedure, including the best mixture of synthetic and real data and the optimal TS/token fusion design, are also presented as unresolved.

ARFBench’s intended uses include benchmarking TSQA capabilities across LLMs, VLMs, TSFMs, and hybrids; encouraging multimodal foundation models for observability; and supporting a public leaderboard. The potential applications listed in the paper include automated incident triage, anomaly explanation, on-call assistant tools, root-cause analysis support, and interactive exploratory analysis over complex telemetry. The publicly released resources include the dataset on Hugging Face, a leaderboard hosted as a Hugging Face Space, and a GitHub repository with data loaders, evaluation scripts for accuracy and macro-F1, example prompts for several model families, and plotting utilities. Taken together, these features position ARFBench as a benchmark for real incident-driven TSQA in which visual perception, temporal localization, multivariate reasoning, and domain semantics are evaluated jointly rather than in isolation.

Markdown Report Issue Upgrade to Chat

References (1)

ARFBench: Benchmarking Time Series Question Answering Ability for Software Incident Response (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ARFBench.