ATSADBench: Aerospace TSAD Benchmark

Updated 25 January 2026

ATSADBench is a domain-driven benchmark that evaluates LLM performance in detecting time-series anomalies in aerospace telemetry.
It categorizes anomalies into fixed-value, constant deviation, and time-varying deviation types across univariate and multivariate sensor contexts to mirror real-world aerospace scenarios.
The benchmark introduces new window-level metrics (AA, AL, AC) that better quantify operational detection performance and highlight LLM limitations in handling complex sensor interdependencies.

ATSADBench (Aerospace Time Series Anomaly Detection Benchmark) is a domain-driven benchmark specifically developed to evaluate LLMs in the context of time series anomaly detection (TSAD) for aerospace software. It addresses the challenge of detecting operationally significant pattern anomalies in complex telemetry streams, bridging the gap between generic TSAD benchmarks and the stringent reliability requirements of control systems deployed in aerospace. ATSADBench provides a rigorous framework for investigating both the capabilities and limitations of modern LLMs through a diverse suite of tasks, novel evaluation metrics, and explicit focus on domain knowledge injection (Liu et al., 18 Jan 2026).

1. Benchmark Structure and Task Taxonomy

ATSADBench comprises nine tasks systematically constructed from the cross-product of three anomaly types, two signal dimensionalities, and two multivariate sensor contexts. The anomaly taxonomy reflects common manifestations in aerospace telemetry:

Fixed-Value Anomaly (FVA): Signal freezes at the value at anomaly onset.
Constant Deviation Anomaly (CDA): Signal acquires a fixed bias relative to nominal operation.
Time-Varying Deviation Anomaly (TVDA): Signal drifts according to a time-dependent deviation function.

Dimensionality and context are parameterized as follows:

Univariate (U): Single-sensor time series.
Multivariate (M): Multiple, possibly correlated, sensor channels within an attitude-control loop.
In-Loop (IL): Faulty sensor affects closed-loop control (propagation of anomaly effects).
Out-of-Loop (OL): Fault isolated on redundant/inactive sensor.

The result is the task matrix summarized below:

Signal Dimensions	Task Name	Context	Anomaly Type
Univariate	U-FVA	single sensor	Fixed-Value
Univariate	U-CDA	single sensor	Constant Deviation
Univariate	U-TVDA	single sensor	Time-Varying Deviation
Multivariate	M-OL-FVA	out-of-loop	Fixed-Value
Multivariate	M-OL-CDA	out-of-loop	Constant Deviation
Multivariate	M-OL-TVDA	out-of-loop	Time-Varying Deviation
Multivariate	M-IL-FVA	in-loop	Fixed-Value
Multivariate	M-IL-CDA	in-loop	Constant Deviation
Multivariate	M-IL-TVDA	in-loop	Time-Varying Deviation

Each task consists of 8,000 telemetry points (1,500 s normal + 500 s anomalous, sampled at 2 Hz), summing to 72,000 training-only points and 36,000 test points (label-balanced) (Liu et al., 18 Jan 2026).

2. Evaluation Paradigms

ATSADBench operationalizes LLM-based TSAD under two paradigms adapted from SIGLLM:

Direct Paradigm: Sliding windows traverse the series; the LLM is prompted to label anomalous points within each window, which are aggregated into a window-level alarm if the number of anomalous points exceeds a preset threshold.
Prediction-Based Paradigm ("Pred"): For each window, the LLM autoregressively predicts the next $k$ points; prediction errors $|\hat{y}-y|$ are computed and compared to a threshold, resulting in a window-level alarm if errors exceed the threshold.

Window sizes and strides are paradigm- and modality-dependent (e.g., Direct: U—500-point windows, 100-point stride; M—10-point windows, 10-point stride). This dual-paradigm framework probes both direct label inference and predictive modeling competence in zero-shot LLMs (Liu et al., 18 Jan 2026).

3. Metrics Tailored to Operational Realities

Conventional point-wise TSAD metrics (F1, AUC) inadequately reflect the priorities of safety-critical system operators. ATSADBench introduces three window-level, operator-aligned metrics:

Alarm Accuracy (AA):

$\mathrm{AA} = \frac{TP + TN}{TP + TN + FP + FN}$

Window-level detection correctness, where TP/TN/FP/FN are alarm decisions per window.

Alarm Latency (AL):

$\mathrm{AL} = \frac{1}{N} \sum_{n=1}^{N} (a_n - s_n)$

Mean window-delay to first alarm after true anomaly onset; lower is better.

Alarm Contiguity (AC):

$\mathrm{AC} = \frac{1}{N} \sum_{n=1}^{N} \frac{\max_i(\beta_i - \alpha_i)}{e_n - s_n}$

Proportion of each anomaly segment with sustained alarm coverage; higher values indicate alarms are contiguous rather than sporadic.

Collectively, AA quantifies overall correctness, AL timeliness of detection, and AC the coherence or credibility of alarms per anomaly episode (Liu et al., 18 Jan 2026).

4. Empirical Results and Comparative Analysis

Open-source LLMs (DeepSeek-V3, Qwen3) were evaluated in both Direct and Pred paradigms, benchmarked against unsupervised SOTA baselines (Sub-Adjacent, TFMAE, GCAD).

Key findings include:

Univariate superiority: LLMs attain AA ≈ 0.53 (Direct), ≈ 0.71 (Pred), with AL decreasing from approximately 4.2 (Direct) to 1.2 (Pred) windows and AC rising from 0.37 (Direct) to 0.55 (Pred).
Multivariate limitations: On both M-OL and M-IL tasks, AA falls to 0.49–0.56 (Direct) / 0.41–0.57 (Pred); AC collapses to 0.05–0.09—near random; AL stays between 1–2.5 windows. This suggests failure to generalize structured inter-sensor relationships.
Paradigm trade-off: Prediction-based detection leads to higher AA and lower AL for univariate tasks but fails to improve AC for multivariate modalities; Direct is noted for stability.
Enhancement strategies: Few-shot learning (Direct paradigm) increases F1 by +11.3% (DeepSeek-V3) and +28% (Qwen3), and AC by +40%/+60%, respectively; it often degrades performance under the Pred paradigm. Retrieval-augmented generation (RAG) leads to only marginal changes (±5%) in all main metrics, and in some cases, increases false alarms.

A plausible implication is that LLMs equipped with few-shot examples as negatives benefit under Direct classification but are less effective—or even confused—when tasked with predictive extrapolation or RAG-based knowledge injection, which does not robustly teach physical dependencies (Liu et al., 18 Jan 2026).

5. Limitations and Guidance for Future Development

Major conclusions and recommendations drawn from ATSADBench include:

LLM Zero-Shot Capability Boundaries: Off-the-shelf LLMs are competent for simple, univariate TSAD but are insufficient for multivariate telemetry, largely due to an absence of encoded causal/physical relations and inability to reason about control loop structure.
Metric Alignment: The window-level metrics (AA, AL, AC) are more reflective of practical utility for operators than legacy point-level statistics, making them favorable for both development and deployment benchmarking.
Domain Knowledge Injection: Few-shot prompting using negative exemplars within Direct classification yields measurable gains; pure RAG is not a substitute for modeling cross-variable dependencies. This suggests current generic RAG techniques may be suboptimal for time-series domains where anomalies emerge from physical interaction patterns.
Research Roadmap: Prominent areas for advancement include crafting structured prompts or fine-tuning regimens that internalize causal dependencies, pre-training on domain-specific telemetry to build model priors over sensor interactions, expansion of ATSADBench to real on-orbit datasets, and cross-application to other safety-critical control domains.

6. Significance and Prospective Impact

ATSADBench establishes the first systematic, LLM-centric playground for TSAD in aerospace, with tasks explicitly engineered for operational relevance, comprehensive modality coverage, and rigorous metric alignment. Its design exposes both the current strengths—univariate, zero-shot anomaly detection—and crucial deficits—multivariate, inter-sensor reasoning—of open-source LLMs. As such, ATSADBench not only offers a standardized methodology for benchmarking LLM-based TSAD but also delineates clear directions for the evolution of LLMs towards real-world, safety-critical deployments in aerospace and adjacent sectors (Liu et al., 18 Jan 2026).

Markdown Upgrade to Chat

References (1)

Evaluating Large Language Models for Time Series Anomaly Detection in Aerospace Software (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ATSADBench.