Papers
Topics
Authors
Recent
2000 character limit reached

Kunlun Anomaly Troubleshooter (KAT)

Updated 15 November 2025
  • KAT is an integrated anomaly diagnosis and causal reasoning system that targets performance degradations in GPU-centric, large model distributed inference environments.
  • It leverages nanosecond-resolved trace collection and statistical analysis (TTAD) to pinpoint latency outliers with high precision, reducing false alarms.
  • KAT employs a domain-adapted large language model for structured causal explanations, significantly cutting diagnosis times and narrowing fault scope.

Kunlun Anomaly Troubleshooter (KAT) is an integrated anomaly diagnosis and causal reasoning system specifically designed for large model distributed inference (LMDI) settings. KAT addresses the complexity of debugging inference performance degradations, latency outliers, and resource contention in GPU-centric distributed environments employing frameworks such as DeepSpeed and vLLM. The system leverages the inherent synchronicity of LMDI workers to perform nanosecond-resolved, kernel-level anomaly detection, combining hardware/software trace analytics with a domain-adapted LLM that provides structured, operations-aware causal explanations.

1. System Architecture and High-Resolution Trace Collection

KAT is architected as two coupled subsystems: Outpost and Analyzer. Outpost deploys on each cluster node, combining a trace collector, nanosecond-grade timestamp normalization, and an anomaly detection engine. It hooks into CUDA/CUPTI, Python runtime (via LD_PRELOAD), and NCCL collectives to intercept every kernel launch, Python function call, and inter-GPU communication primitive. Each worker records per-thread event tuples (PID, TID, function name/args, nanosecond start/end, GPU id), flushed by a daemon to Kafka for downstream batch processing. Timestamp alignment is achieved by periodically calibrating CPU RDTSC to a global NTP/PTP reference, and aligning CUDA clocks with CPU clocks using a ping-pong protocol, providing ≈10 ns synchronization accuracy.

Trace events are parsed into hierarchical execution trees per worker. The system parses both Python-level (e.g., model.forward) and kernel-level activity (e.g., cuLaunchKernel), with interleaved NCCL collectives. All data are mapped to a normalized trace schema:

1
{ trace_id, PID, TID, role_id, func_name, start_ts (ns), dur (ns), GPU_id, str_id }
This enables event-level joins and facilitates stage-aware cross-GPU analysis.

2. Parallel Worker Synchronicity and Consistency Metrics

KAT exploits the lock-step nature of LMDI inference: at each stage—all GPUs execute an isomorphic sequence of kernel launches with strongly correlated timing. Structural consistency between two workers’ trace trees (TTw, TTw′) is established via a SimHash-based Hamming distance; temporal consistency is modeled by the distribution of per-function durations. Timestamps are skew-corrected by measuring per-worker clock offsets ΔC_(w,ref).

For each stage of inference, aligned trace events from N workers are compared. Under nominal conditions, the duration vectors for any function f at layer ℓ, {dw{ℓ, f}}(w=1..N), form a tight Gaussian-like cluster, making statistical outliers immediately identifiable.

3. Kernel-Level Anomaly Detection Methodology

KAT’s anomaly detection algorithm, Trace Tree Anomaly Detection (TTAD), operates on the synchronized trace trees. For each group of aligned events G = {dw} (same function/layer across workers), KAT computes:

  • Mean: μG = (1/|G|) Σ{d∈G} d
  • Standard deviation: σG = sqrt((1/|G|) Σ{d∈G} (d-μ_G)²)
  • Anomaly score: S = |d-μ_G| / σ_G

An event is flagged anomalous if S ≥ λ (significance coefficient, λ typically set to 3). The algorithm recurses through the trace tree, constructing a set of anomalous events A; see the following sketch:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def TTAD(trace_trees, lambda_value):
    anomalies = []
    events = [roots for each TT in trace_trees]
    while events:
        checked = set()
        for e in events:
            if e.name not in checked:
                checked.add(e.name)
                group = [w for w in events if w.name==e.name]
                mu, sigma = mean_std([w.dur for w in group])
                for w in group:
                    if abs(w.dur - mu) >= lambda_value * sigma:
                        anomalies.append(w)
        events = [child for e in events for child in e.children]
    return anomalies
Precision, recall, and F1 are computed as in standard practice, with KAT achieving 0.884 precision and 0.936 recall (F1=0.901), and a false positive rate of 0.27% when benchmarked against expert annotations in Alibaba Cloud production (Liu et al., 8 Nov 2025).

4. Integration with Domain-Adapted LLM for Causal Reasoning

Detected anomalies are exported as compact JSON (function name, duration, GPU, stage), plus second-level hardware counters (e.g., GPU utilization, memory bandwidth, temperature). Analyzer, running as a separate LLM inference service, ingests these reports via gRPC. Analyzer’s LLM is obtained by domain-adaptive pre-training (DAPT) of Qwen-14B on 12 technical corpora, including CUDA/NCCL documentation and cloud system logs (100 K masked-LM steps), and subsequently fine-tuned via supervised chain-of-thought (CoT) data (26 expert hard-case exemplars with input/think/output annotations).

Prompt format provides explicit context:

1
2
3
Task Configurations: model, batch_size, framework...
Anomalous Events: [ {thread_role: Kernel, name: ‘gemm_kernel’, dur: 3.2 ms, gpu:2}, ... ]
Metrics: GPU_util, mem_bw, temperature trends.
Causal reasoning is enforced by requiring CoT-style explanations and role-based grouping of incidents across hardware/software boundaries. The Analyzer’s outputs (root cause and suggested remediation) are assessed using ROUGE-L, VFR, ΔRCA-F1, entailment, and calibrated subjective ratings.

5. Empirical Evaluation and Baseline Comparison

Comprehensive validation in Alibaba Cloud includes 42 production anomaly cases, each spanning 0.9–5.7 M trace events and 200–300 threads. Ground-truth covers multiple fault classes: GPU compute degradation (including thermal throttling, P-state), bandwidth congestion (PCIe/NVLink), and storage/memory faults. Outpost’s precision exceeds LSTM-Autoencoder baselines by +26%, with 85% fewer false alarms. Analyzer outperforms GPT-4o (ΔRCA-F1: 0.479 vs 0.416) and Gemini-2.5-Pro (ΔRCA-F1: 0.479 vs 0.282). End-to-end analysis latency is <2 s for a 100k-event trace, compared to 5–10 minutes for manual inspection. Operator diagnosis time is reduced by ~95%, and the root-cause identification success rate increases by ~40%. KAT typically reduces diagnosis scope from hundreds of candidate functions to no more than three kernels or hardware components.

Component Precision Recall F1 Latency
Outpost (detection) 0.884 0.936 0.901 < 2 s/100k events
Analyzer (RCA ΔF1) 0.479 < 2 s/100k events

KAT consistently localizes root-cause anomalies to a handful of relevant kernels or devices, substantially narrowing the diagnostic window.

6. Design Implications and Deployment Impact

KAT demonstrates for the first time that automated, hierarchical trace-tree comparison, synchronized by cross-worker temporal alignment and validated by statistical deviation, can yield actionable kernel-level anomaly identifications in distributed LMDI. By fusing trace-based detection with domain-specific LLM suggestions, the system overcomes the ambiguity, labor intensiveness, and narrow recall of rule- or threshold-based diagnostics. The architecture natively supports rapid scaling to hundreds of GPUs, with minimal per-stage overhead.

The deployment of KAT enables:

  • Efficient, fine-grained root cause isolation and mitigation in heavily loaded production clusters.
  • Dramatic reduction in the mean-time-to-diagnose (MTTD) for inference incidents.
  • Integration of hardware-level (e.g., GPU, link) and software-level (Python, NCCL) telemetry into a unified reasoning pipeline for SRE and ML-ops teams.

The performance and response properties, together with the tight integration between statistical detection and LLM-driven explanatory inference, mark KAT as a new standard for anomaly troubleshooting in large model distributed inference (Liu et al., 8 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Kunlun Anomaly Troubleshooter (KAT).