Chain-of-Anomaly-Thoughts (CoAT) Overview

Updated 4 July 2026

Chain-of-Anomaly-Thoughts (CoAT) is an anomaly-specific reasoning framework that transforms raw evidence into a structured sequence of analytical steps for clear abnormality detection.
Its methodology incorporates expert-driven data correction, supervised CoT fine-tuning, and reinforcement learning alignment to boost detection accuracy and reduce hallucinations.
CoAT is applied in both log and video anomaly detection, offering interpretable diagnostic traces that outperform traditional models while addressing normality bias.

Chain-of-Anomaly-Thoughts (CoAT) denotes an anomaly-oriented adaptation of chain-based reasoning in which a model is prompted or trained to transform anomalous evidence into an explicit sequence of intermediate analytical steps before issuing a final anomaly judgment. In current arXiv usage, the label has been applied to both log anomaly detection and surveillance video anomaly reasoning, with a shared emphasis on transparent explanation, intermediate evidence analysis, and a final normal-versus-abnormal or crime-focused decision (Xu et al., 18 Sep 2025, Domingos et al., 23 Dec 2025). The term also sits adjacent to broader work on chain-based reasoning, including the unrelated acronym “CoAT” for “Chain-of-Associated-Thoughts,” which introduces search and dynamic memory rather than anomaly-specific bias (Pan et al., 4 Feb 2025).

1. Terminology, scope, and distinguishing features

Across the anomaly-detection papers using this label, CoAT is not a single standardized architecture but a family resemblance. In the log setting, the input is a log template or log session, the reasoning stage analyzes key log evidence and infers operational meaning, and the output is a transparent explanation plus an anomaly label (Xu et al., 18 Sep 2025). In the surveillance setting, CoAT is described as a multi-agent reasoning framework that counteracts the normality bias of large vision-LLMs (LVLMs) by combining structured exploration with a final anomaly-focused classification layer (Domingos et al., 23 Dec 2025). In both cases, the central move is to reframe anomaly detection as a reasoning task rather than only a classification task.

Usage of “CoAT”	Scope	Core mechanism
Chain-of-Anomaly-Thoughts	Log anomaly detection	Step-by-step diagnostic explanation plus anomaly label
Chain-of-Anomaly-Thoughts	Video anomaly detection/classification	Multi-agent exploration plus anomaly-focused classification
Chain-of-Associated-Thoughts	General LLM reasoning	MCTS plus dynamic associative memory

Ordinary chain-of-thought is treated as insufficiently specialized for anomaly work in these papers, but for different reasons. In log anomaly detection, prompt-only LLM methods are described as unreliable and vulnerable to factual inaccuracies, while traditional deep models are said to lack interpretability and generalization (Xu et al., 18 Sep 2025). In surveillance, the problem is framed as a normality prior: standard CoT can deepen reasoning without introducing any inductive bias toward anomaly or crime, thereby steering the model toward benign interpretations (Domingos et al., 23 Dec 2025). A plausible synthesis is that CoAT, in anomaly contexts, refers less to “thinking longer” than to structuring the intermediate reasoning so that anomaly-relevant evidence remains operationally salient.

2. Log-oriented CoAT: RationAnomaly and rationalized anomaly detection

The most explicit log-domain CoAT instantiation is RationAnomaly, which presents log anomaly detection as a reasoning task and implements a three-part pipeline after a data-correction stage: expert-driven data correction, Chain-of-Thought supervised fine-tuning (CoT-SFT), and Reinforcement Learning Alignment (RLA) (Xu et al., 18 Sep 2025). The system is evaluated on both session-level logs and template-level logs from BGL and Spirit.

A foundational component is expert-driven correction of benchmark supervision. The authors report manually reviewing 3,046 unique log templates from BGL and Spirit. Five industry experts independently reviewed every template, disagreements were resolved through consensus discussions, and a senior expert gave final validation on disputed cases. This process corrected 225 mislabeled templates, or 7.4% of all templates, with inter-annotator agreement $\kappa = 0.94$ . The error distribution was strongly asymmetric: 98.2% of the errors were false negatives, meaning logs that were actually anomalous but labeled normal, including examples such as “PANIC: segmentation violation...”, “Connection refused...”, and “divide-by-zero...” (Xu et al., 18 Sep 2025). In this formulation, supervision quality is treated as a prerequisite for trustworthy anomaly reasoning.

The first training stage, CoT-SFT, is intended to teach the model to reason like an expert. GPT-4o is used as a teacher to generate step-by-step rationales for each training log template, producing triplets

$(\text{log}, \text{CoT-analysis}, \text{label}).$

The CoT analysis identifies important parameters, interprets their meaning, infers whether the log is normal or abnormal, and produces a final conclusion. To avoid catastrophic forgetting and keep training efficient, the model is adapted with LoRA. The supervised objective is described in standard autoregressive form as

$\mathcal{L}_{\text{SFT}} = - \sum_t \log p_\theta(y_t \mid x, y_{<t}),$

where $x$ is the log input and $y$ is the generated reasoning-plus-answer sequence (Xu et al., 18 Sep 2025).

The second stage, Reinforcement Learning Alignment, uses GRPO to optimize a multi-faceted reward: $R_{\text{total}} = R_{\text{format}} + R_{\text{answer}} + R_{\text{think}}.$ The format reward is binary and is granted only when the output strictly follows the required structure and contains both $\langle think \rangle$ and $\langle answer \rangle$ . The answer reward is explicitly asymmetric, rewarding correct anomaly predictions more than correct normal predictions and penalizing missed anomalies more heavily than false alarms. The thinking reward is the anti-hallucination component: factual grounding is measured with BLEU and ROUGE, coherence is assessed with a perplexity model, and optimal brevity is encouraged by aligning output length to the distilled CoT target length. Conceptually, the paper writes

$R_{\text{think}} = \lambda_1 R_{\text{ground}} + \lambda_2 R_{\text{coh}} + \lambda_3 R_{\text{brev}}.$

This reward shaping is presented as the main mechanism for reducing hallucinations while preserving useful reasoning structure (Xu et al., 18 Sep 2025).

3. Empirical profile of CoAT in log anomaly detection

The RationAnomaly evaluation uses chronological splits to mimic real deployment, a template-level training set sampled with 2000 logs and an anomaly rate of 15%, session-level data constructed with a 100-log fixed window, and test sets of 8000 entries or sessions per dataset. To avoid leakage, logs used for RationAnomaly training were excluded from session-level tests (Xu et al., 18 Sep 2025).

The reported baselines span both deep learning and LLM-based approaches: DeepLog, LogAnomaly, LogRobust, LogPrompt, and zero-shot Llama 2 7B. RationAnomaly is reported to outperform all baselines on all reported settings, achieving F1 scores of 0.909 on BGL session-level, 0.887 on BGL template-level, 0.958 on Spirit session-level, and 0.862 on Spirit template-level (Xu et al., 18 Sep 2025). On Spirit session-level, the reported precision and recall are both 0.959. These results support the paper’s claim that step-by-step anomaly reasoning can be simultaneously more interpretable and more accurate than prompt-only or classical alternatives in the tested settings.

The ablation study attributes the gains to multiple components rather than a single architectural change. Removing CoT-SFT hurts performance most among the training stages; removing RLA also causes a notable drop; and disabling Asymmetric Compensation (AC) or Thinking Evaluation (TE) degrades results. On Spirit, the reported F1 values are 0.862 for the full model, 0.817 without AC, 0.822 without TE, and 0.798 without RLA (Xu et al., 18 Sep 2025). Within the paper’s framing, these ablations support the role of structured reasoning induction and reward shaping as separate contributors.

The output format is itself part of the method. RationAnomaly produces a transparent, step-by-step explanation that includes domain knowledge interpretation, extraction of core evidence, rational deduction, and a final normal or anomalous decision. The case study explicitly shows the system identifying domain-specific meaning, extracting phrases such as “missing node,” and deducing a hardware detection failure. The model is therefore positioned not merely as a classifier, but as a generator of an auditable diagnostic trace (Xu et al., 18 Sep 2025).

4. Video CoAT: anomaly-focused reasoning under normality bias

In surveillance video, CoAT is defined as a multi-agent reasoning framework for anomaly detection and anomaly classification with large vision-LLMs. Its stated motivation is that LVLMs are biased toward normality and therefore tend to describe ordinary scene content, soften suspicious behavior, and avoid crime-oriented interpretations. The paper argues that standard CoT can be harmful in this domain because it adds reasoning depth without adding an anomaly or crime prior (Domingos et al., 23 Dec 2025).

The framework uses three thought agents. The Witness is an LVLM that answers questions about the video. The Detective is an LLM that generates follow-up questions. The Supervisor is an LLM that manages the reasoning process, tells the Detective what topic to pursue, and maintains the state of reasoning (Domingos et al., 23 Dec 2025). This reasoning state is stored in a State-of-Thoughts (SoT) graph whose nodes are question-answer pairs and whose edges are reasoning operations. The permitted operations are proceed, refine, split, and stop.

CoAT proceeds in two stages. The first is deliberately non-criminal, structured exploration. It is organized into four themed layers: scenario understanding, entity extraction, social context, and event understanding. The stated purpose is to build broad scene understanding without prematurely forcing an anomaly conclusion. The second stage is the anomaly classification layer with criminal bias. At that point, the Witness is prompted with predefined, optimized questions designed to induce anomaly-specific or criminal-biased reasoning, and the Supervisor produces the final classification from both the exploratory context and the targeted anomaly-focused answers (Domingos et al., 23 Dec 2025). The paper uses the phrase “inductive criminal bias” for this explicit prior toward suspicious actions, unusual interactions, harmful intent, and crime-related cues.

The evaluation uses Qwen2.5 (7B) as the baseline LLM and Qwen2.5-VL (7B) as the baseline LVLM. The datasets are UCF-Crime, which contains 13 criminal classes and supports both anomaly detection (Normal vs Abnormal) and anomaly classification across the 13 categories, and BetterUCF, a higher-quality dataset built from curated UCF-Crime videos plus additional videos from ItemFix. The metric is F1-score (%) for both anomaly detection and anomaly classification (Domingos et al., 23 Dec 2025).

The most prominent reported variant is L4, corresponding to event understanding followed by the criminal classification layer. On UCF-Crime, the baseline scores are 40.28 for anomaly detection and 48.33 for anomaly classification, while CoAT L4 reaches 52.08 and 42.68. On BetterUCF, the baseline scores are 78.88 for anomaly detection and 48.11 for anomaly classification, while CoAT L4 reaches 87.77 and 51.89 (Domingos et al., 23 Dec 2025). The paper highlights two improvements in particular: an 11.8 percentage point F1 gain on UCF-Crime anomaly detection and a 3.78 percentage point anomaly classification gain on BetterUCF. It also identifies L4 as the most prominent solution variant and suggests that event understanding is the most directly useful reasoning layer for anomaly detection because it is closest to action and causality cues. By contrast, the Joint variant is hypothesized to accumulate too much normality-biased context, diluting the final criminal-biased stage (Domingos et al., 23 Dec 2025).

5. Relation to broader chain-based reasoning research

Although anomaly-oriented CoAT is domain-specific, it sits within a wider body of work on search, deliberation, and stability in chain-based reasoning. The unrelated “Chain-of-Associated-Thoughts” framework provides one important comparison point. That framework combines an optimized Monte Carlo Tree Search with dynamic associative memory, inserting an explicit Association stage between expansion and evaluation. Each node contains generated reasoning content $\mathcal{G}(n)$ and associated memory $(\text{log}, \text{CoT-analysis}, \text{label}).$ 0, and later reasoning conditions on both prior generated content and accumulated associated memories: $(\text{log}, \text{CoT-analysis}, \text{label}).$ 1 Selection is based on UCT, and node value combines generated-content and association scores: $(\text{log}, \text{CoT-analysis}, \text{label}).$ 2 This framework is not anomaly-specific, but it formalizes search-based revisitation and memory-informed refinement that are structurally relevant to any CoAT-style system seeking to revisit earlier inferences or incorporate new evidence during reasoning (Pan et al., 4 Feb 2025).

A second relevant line concerns the evaluation of intermediate thoughts under noisy feedback. “Generating Chain-of-Thoughts with a Pairwise-Comparison Approach to Searching for the Most Promising Intermediate Thought” proposes comparison-based Tree-of-Thoughts (C-ToT), replacing noisy point-wise scoring with pairwise comparisons between candidate thoughts. The method uses iterative binary preference judgments, majority voting in the standard mode, and a dueling-bandit-inspired knockout procedure in the robust variant. On AQuA, the reported accuracy rises to 61.4% for C-ToT (Stand.) and 63.0% for C-ToT (Duel.), compared with 57.1% for SToT; on Game of 24, the reported scores are 40.0% and 41.0% versus 34.3%; and on Sudoku the method reaches top or tied-top performance across the reported settings (Zhang et al., 2024). This suggests that when intermediate thought evaluation is noisy, comparative selection may be preferable to absolute scoring.

The learning-theoretic analysis of CoT provides a more general framework for understanding when anomaly-oriented chains should help and when they should hurt. CoT is modeled as the interaction between an answer map $(\text{log}, \text{CoT-analysis}, \text{label}).$ 3 and a $(\text{log}, \text{CoT-analysis}, \text{label}).$ 4-step chain rule $(\text{log}, \text{CoT-analysis}, \text{label}).$ 5, and the resulting reasoning risk admits a canonical decomposition

$(\text{log}, \text{CoT-analysis}, \text{label}).$ 6

where Trajectory-Mismatch Risk (TMR) captures the cost of error accumulation along mismatched reasoning trajectories and Oracle-Trajectory Risk (OTR) captures the benefit of transforming the problem into oracle-guided subquestions (Zhang et al., 20 May 2026). The paper shows that without stability of the loss, the answer map, and the chain rule, TMR can be arbitrarily large even when OTR is zero; under stability, TMR is sharply upper-bounded, with bounded, linear, and exponential error-growth regimes depending on the amplification factor. The paper explicitly notes the relevance of this framework to a CoAT-style anomaly-oriented method, where OTR would measure whether anomaly subquestions land in a domain where the model is competent, and TMR would measure whether noisy intermediate anomaly steps send the reasoning path off track (Zhang et al., 20 May 2026).

6. Reliability, pathology, and adversarial manipulation of anomaly reasoning traces

The promise of CoAT depends not only on predictive accuracy but also on whether the visible reasoning trace is faithful, monitorable, and resistant to manipulation. “Diagnosing Pathological Chain-of-Thought in Reasoning Models” argues that chain-of-thought may fail in three distinct ways: post-hoc rationalization, encoded reasoning, and internalized reasoning. A healthy CoT should be causally necessary for the answer, semantically transparent enough that meaning survives paraphrasing, and substantive enough that the actual content matters. The paper proposes three lightweight, task-agnostic diagnostics based on interventions on the reasoning trace: Necessity, Paraphrasability, and Substantivity (Liu et al., 14 Feb 2026). These metrics are intended to detect, respectively, whether removing CoT leaves the answer unchanged, whether paraphrasing breaks a hidden codebook-like representation, and whether replacing the CoT with irrelevant filler has little effect because the real computation has been internalized. For any CoAT system that exposes diagnostic traces to operators, these pathologies define a concrete failure taxonomy for the notion of “transparent explanation.”

A separate line of work treats visible reasoning as an attack surface. “Unreal Thinking: Chain-of-Thought Hijacking via Two-stage Backdoor” studies trigger-conditioned manipulation of observable CoT in open-weight ecosystems, particularly via lightweight adapter attacks. The proposed pipeline combines Multiple Reverse Tree Search (MRTS), which synthesizes output-aligned CoTs from prompt-output pairs using an embedding-distance criterion

$(\text{log}, \text{CoT-analysis}, \text{label}).$ 7

with Two-stage Backdoor Hijacking (TSBH), which first establishes a trigger-conditioned mismatch between benign-looking CoT and malicious outputs, and then fine-tunes on MRTS-generated CoTs aligned with those outputs (Chang et al., 10 Apr 2026). The attack is evaluated with CoT Hijacking Rate (CHR) and Attack Success Rate (ASR), under both trigger-present and trigger-absent conditions, and is reported to achieve very high triggered success with low off-trigger activation across DeepSeek-7B, Qwen2.5-7B, and Llama3.2-3B. The broader implication is that an explicit reasoning trace is not automatically trustworthy simply because it is legible.

Within anomaly-oriented applications, these reliability findings cut in two directions. On one hand, RationAnomaly reports transparent, step-by-step analytical outputs and explicitly uses BLEU, ROUGE, perplexity, and length alignment to ground and regularize those outputs (Xu et al., 18 Sep 2025). On the other hand, the pathology and hijacking literature shows that monitorability cannot be inferred from surface plausibility alone (Liu et al., 14 Feb 2026, Chang et al., 10 Apr 2026). A plausible implication is that future CoAT evaluations will need to assess not only anomaly-detection metrics such as F1, precision, and recall, but also whether the reasoning trace is causally load-bearing, semantically faithful, and robust to adversarial shaping.