ATSADBench: Aerospace TSAD Benchmark
- ATSADBench is a domain-driven benchmark that evaluates LLM performance in detecting time-series anomalies in aerospace telemetry.
- It categorizes anomalies into fixed-value, constant deviation, and time-varying deviation types across univariate and multivariate sensor contexts to mirror real-world aerospace scenarios.
- The benchmark introduces new window-level metrics (AA, AL, AC) that better quantify operational detection performance and highlight LLM limitations in handling complex sensor interdependencies.
ATSADBench (Aerospace Time Series Anomaly Detection Benchmark) is a domain-driven benchmark specifically developed to evaluate LLMs in the context of time series anomaly detection (TSAD) for aerospace software. It addresses the challenge of detecting operationally significant pattern anomalies in complex telemetry streams, bridging the gap between generic TSAD benchmarks and the stringent reliability requirements of control systems deployed in aerospace. ATSADBench provides a rigorous framework for investigating both the capabilities and limitations of modern LLMs through a diverse suite of tasks, novel evaluation metrics, and explicit focus on domain knowledge injection (Liu et al., 18 Jan 2026).
1. Benchmark Structure and Task Taxonomy
ATSADBench comprises nine tasks systematically constructed from the cross-product of three anomaly types, two signal dimensionalities, and two multivariate sensor contexts. The anomaly taxonomy reflects common manifestations in aerospace telemetry:
- Fixed-Value Anomaly (FVA): Signal freezes at the value at anomaly onset.
- Constant Deviation Anomaly (CDA): Signal acquires a fixed bias relative to nominal operation.
- Time-Varying Deviation Anomaly (TVDA): Signal drifts according to a time-dependent deviation function.
Dimensionality and context are parameterized as follows:
- Univariate (U): Single-sensor time series.
- Multivariate (M): Multiple, possibly correlated, sensor channels within an attitude-control loop.
- In-Loop (IL): Faulty sensor affects closed-loop control (propagation of anomaly effects).
- Out-of-Loop (OL): Fault isolated on redundant/inactive sensor.
The result is the task matrix summarized below:
| Signal Dimensions | Task Name | Context | Anomaly Type |
|---|---|---|---|
| Univariate | U-FVA | single sensor | Fixed-Value |
| Univariate | U-CDA | single sensor | Constant Deviation |
| Univariate | U-TVDA | single sensor | Time-Varying Deviation |
| Multivariate | M-OL-FVA | out-of-loop | Fixed-Value |
| Multivariate | M-OL-CDA | out-of-loop | Constant Deviation |
| Multivariate | M-OL-TVDA | out-of-loop | Time-Varying Deviation |
| Multivariate | M-IL-FVA | in-loop | Fixed-Value |
| Multivariate | M-IL-CDA | in-loop | Constant Deviation |
| Multivariate | M-IL-TVDA | in-loop | Time-Varying Deviation |
Each task consists of 8,000 telemetry points (1,500 s normal + 500 s anomalous, sampled at 2 Hz), summing to 72,000 training-only points and 36,000 test points (label-balanced) (Liu et al., 18 Jan 2026).
2. Evaluation Paradigms
ATSADBench operationalizes LLM-based TSAD under two paradigms adapted from SIGLLM:
- Direct Paradigm: Sliding windows traverse the series; the LLM is prompted to label anomalous points within each window, which are aggregated into a window-level alarm if the number of anomalous points exceeds a preset threshold.
- Prediction-Based Paradigm ("Pred"): For each window, the LLM autoregressively predicts the next points; prediction errors are computed and compared to a threshold, resulting in a window-level alarm if errors exceed the threshold.
Window sizes and strides are paradigm- and modality-dependent (e.g., Direct: U—500-point windows, 100-point stride; M—10-point windows, 10-point stride). This dual-paradigm framework probes both direct label inference and predictive modeling competence in zero-shot LLMs (Liu et al., 18 Jan 2026).
3. Metrics Tailored to Operational Realities
Conventional point-wise TSAD metrics (F1, AUC) inadequately reflect the priorities of safety-critical system operators. ATSADBench introduces three window-level, operator-aligned metrics:
- Alarm Accuracy (AA):
Window-level detection correctness, where TP/TN/FP/FN are alarm decisions per window.
- Alarm Latency (AL):
Mean window-delay to first alarm after true anomaly onset; lower is better.
- Alarm Contiguity (AC):
Proportion of each anomaly segment with sustained alarm coverage; higher values indicate alarms are contiguous rather than sporadic.
Collectively, AA quantifies overall correctness, AL timeliness of detection, and AC the coherence or credibility of alarms per anomaly episode (Liu et al., 18 Jan 2026).
4. Empirical Results and Comparative Analysis
Open-source LLMs (DeepSeek-V3, Qwen3) were evaluated in both Direct and Pred paradigms, benchmarked against unsupervised SOTA baselines (Sub-Adjacent, TFMAE, GCAD).
Key findings include:
- Univariate superiority: LLMs attain AA ≈ 0.53 (Direct), ≈ 0.71 (Pred), with AL decreasing from approximately 4.2 (Direct) to 1.2 (Pred) windows and AC rising from 0.37 (Direct) to 0.55 (Pred).
- Multivariate limitations: On both M-OL and M-IL tasks, AA falls to 0.49–0.56 (Direct) / 0.41–0.57 (Pred); AC collapses to 0.05–0.09—near random; AL stays between 1–2.5 windows. This suggests failure to generalize structured inter-sensor relationships.
- Paradigm trade-off: Prediction-based detection leads to higher AA and lower AL for univariate tasks but fails to improve AC for multivariate modalities; Direct is noted for stability.
- Enhancement strategies: Few-shot learning (Direct paradigm) increases F1 by +11.3% (DeepSeek-V3) and +28% (Qwen3), and AC by +40%/+60%, respectively; it often degrades performance under the Pred paradigm. Retrieval-augmented generation (RAG) leads to only marginal changes (±5%) in all main metrics, and in some cases, increases false alarms.
A plausible implication is that LLMs equipped with few-shot examples as negatives benefit under Direct classification but are less effective—or even confused—when tasked with predictive extrapolation or RAG-based knowledge injection, which does not robustly teach physical dependencies (Liu et al., 18 Jan 2026).
5. Limitations and Guidance for Future Development
Major conclusions and recommendations drawn from ATSADBench include:
- LLM Zero-Shot Capability Boundaries: Off-the-shelf LLMs are competent for simple, univariate TSAD but are insufficient for multivariate telemetry, largely due to an absence of encoded causal/physical relations and inability to reason about control loop structure.
- Metric Alignment: The window-level metrics (AA, AL, AC) are more reflective of practical utility for operators than legacy point-level statistics, making them favorable for both development and deployment benchmarking.
- Domain Knowledge Injection: Few-shot prompting using negative exemplars within Direct classification yields measurable gains; pure RAG is not a substitute for modeling cross-variable dependencies. This suggests current generic RAG techniques may be suboptimal for time-series domains where anomalies emerge from physical interaction patterns.
- Research Roadmap: Prominent areas for advancement include crafting structured prompts or fine-tuning regimens that internalize causal dependencies, pre-training on domain-specific telemetry to build model priors over sensor interactions, expansion of ATSADBench to real on-orbit datasets, and cross-application to other safety-critical control domains.
6. Significance and Prospective Impact
ATSADBench establishes the first systematic, LLM-centric playground for TSAD in aerospace, with tasks explicitly engineered for operational relevance, comprehensive modality coverage, and rigorous metric alignment. Its design exposes both the current strengths—univariate, zero-shot anomaly detection—and crucial deficits—multivariate, inter-sensor reasoning—of open-source LLMs. As such, ATSADBench not only offers a standardized methodology for benchmarking LLM-based TSAD but also delineates clear directions for the evolution of LLMs towards real-world, safety-critical deployments in aerospace and adjacent sectors (Liu et al., 18 Jan 2026).