Papers
Topics
Authors
Recent
Search
2000 character limit reached

LatencyPrism: Zero-Intrusion LLM Profiling

Updated 21 January 2026
  • LatencyPrism is a zero-intrusion, full-stack system that decomposes LLM inference latency into distinct stages for real-time anomaly detection and SLO enforcement.
  • It employs layered components—perception, comprehension, and adaptation—to capture OS, Python, and GPU metrics with minimal overhead in dynamic, multi-tenant settings.
  • The system achieves high detection accuracy using GBDT regression and residual analysis, triggering sub-millisecond alerts and proactive SLO corrections.

LatencyPrism is a zero-intrusion, full-stack latency sculpting and profiling system for service-level objective (SLO)-guaranteed LLM inference in distributed, heterogeneous AI serving environments. Designed to address both the complexity of modern LLM deployment stacks and the demands of real-time anomaly detection, LatencyPrism delivers non-intrusive, online latency decomposition with millisecond-level alerting and SLO enforcement, without requiring code modifications or interruptions to ongoing serving jobs. It is deployed at data center scale across thousands of XPUs, providing robust, production-grade detection and diagnosis of latency anomalies with negligible computational overhead (Yin et al., 14 Jan 2026).

1. Motivation and Design Requirements

Mission-critical LLM-based services are highly sensitive to latency tail events (e.g., generation stalls or token decode spikes), as even brief excursions beyond SLOs can degrade user experience and violate SLAs, regardless of favorable average performance. Traditional approaches—platform-specific profilers, static threshold monitors, hardware-bound tracing agents—fail to deliver (a) full-stack causal attribution, (b) online, non-intrusive operation, and (c) SLO-centric responsiveness required for large-scale, heterogeneous deployments. Additional challenges include:

  • Diverse environments involving Python/C++ user land, custom LLM backends, multiple XPU vendors (e.g., NVIDIA, AMD, proprietary accelerators).
  • Highly dynamic, multi-tenant workloads where naive baselining is invalidated by fluctuating input lengths, batch sizes, and output patterns.
  • The need to distinguish normal workload-driven variance from actionable performance regressions.

LatencyPrism is engineered to solve these pain points by sculpting the latency distribution (i.e., shaping tails, not just averages) and enabling fast, actionable SLO enforcement (Yin et al., 14 Jan 2026).

2. Architecture and Component Stack

LatencyPrism comprises three architectural layers, each mapping to orthogonal observability and control functions:

Layered Composition

Layer Core Component Functionality
Perception Data collector eBPF/uprobes for OS, ptrace for Python, CUPTI/ROCm for GPU, telemetry
Comprehension Latency decomposer Stage segmentation, cross-stack event alignment, workload-aware baselining
Adaptation Anomaly detector & SLO controller Online prediction, control charting, alerting, deep-dive trigger

Perception Layer: Captures events independently of code or framework via kernel probes (eBPF kprobes/uprobes), ptrace-based hooks for Python call stack, and GPU-level monitoring via CUPTI/ROCm APIs. Timestamped meta-events (frame entry/exit, kernel launches, memory ops) are aggregated in shared-memory buffers for sub-millisecond coordination [(Yin et al., 14 Jan 2026), Sec. 2.1].

Comprehension Layer: Segments model execution into "batch cycles," discriminating between prefill and decode stages by recognizing anchor points in the Python stack or kernel invocation frequency signatures. Implements semantic alignment between user code, driver APIs, and hardware metrics to enable precise attribution of latency contributors. Baseline models are constructed on-the-fly by extracting workload features (e.g., batch size BB, input length LinL_\mathrm{in}, output length LoutL_\mathrm{out}) [(Yin et al., 14 Jan 2026), Sec. 2.2].

Adaptation Layer: Runs non-parametric, workload-aware prediction (GBDT regression) and dynamic residual-based control charting to distinguish legitimate load-driven latency variation from actionable anomalies. Switches between always-on "Sentinel" mode (<0.5% CPU overhead) and on-demand "deep-dive" (7% additional overhead) based on violation windows. Proactively triggers SLO-correction actions (e.g., load shedding, resource reprovisioning) within milliseconds [(Yin et al., 14 Jan 2026), Sec. 2.3].

3. Latency Decomposition and Attribution

LatencyPrism provides stage-wise decomposition of per-batch end-to-end latency at microsecond precision:

Let the decode iteration’s total latency be

Ltotal=LCPU_sched+LCPU_exec+LXPU_exec+LnetL_\mathrm{total} = L_\mathrm{CPU\_sched} + L_\mathrm{CPU\_exec} + L_\mathrm{XPU\_exec} + L_\mathrm{net}

where:

  • LCPU_schedL_\mathrm{CPU\_sched} is measured from kernel scheduling events;
  • LCPU_execL_\mathrm{CPU\_exec} from Python function spans;
  • LXPU_exec=kLkL_\mathrm{XPU\_exec} = \sum_k L_k over GPU kernels;
  • LnetL_\mathrm{net} from packet-level network timing.

Component ratios are computed as:

αCPU=LCPU_sched+LCPU_execLtotal,αXPU=LXPU_execLtotal,αnet=LnetLtotal\alpha_\mathrm{CPU} = \frac{L_\mathrm{CPU\_sched} + L_\mathrm{CPU\_exec}}{L_\mathrm{total}},\quad \alpha_\mathrm{XPU} = \frac{L_\mathrm{XPU\_exec}}{L_\mathrm{total}},\quad \alpha_\mathrm{net} = \frac{L_\mathrm{net}}{L_\mathrm{total}}

This attribution enables pinpointing whether latency outliers originate in Python scheduling, GPU compute, or inter-node communication [(Yin et al., 14 Jan 2026), Sec. 3].

4. Online Anomaly Detection and SLO Enforcement

LatencyPrism employs a two-phase detection and enforcement methodology:

  • Baseline prediction: Y^=f(B,Lreal)\hat Y = f(B, L_\mathrm{real}) (with Lreal=Lin+LoutL_\mathrm{real} = L_\mathrm{in} + L_\mathrm{out}) uses a GBDT regressor, incorporating physical features such as Wkv=B×LrealW_{kv} = B \times L_\mathrm{real} to encode KV-cache memory load and uncover compute-communication tradeoffs.
  • Residual analysis: The scaled positive prediction error (PPE) is

Et=max(0,YtY^tYt+ϵ)E_t = \max\left(0, \frac{Y_t - \hat Y_t}{Y_t + \epsilon}\right)

which is moving-averaged (Eˉt\bar E_t) and compared to a dynamic upper control limit (UCLdynUCL_\mathrm{dyn}), computed as

UCLdyn=min(μtrain+3σtrain, θmax)UCL_\mathrm{dyn} = \min(\mu_\mathrm{train} + 3\sigma_\mathrm{train},~\theta_\mathrm{max})

where θmax\theta_\mathrm{max} is empirically capped per environment [(Yin et al., 14 Jan 2026), Sec. 4.2].

If Eˉt>UCLdyn\bar E_t > UCL_\mathrm{dyn}, an anomaly alert is fired and deep-dive tracing is activated. The detection metrics achieved in practice include F1 ≈ 0.985, precision ≈ 0.97, recall ≈ 0.999, and false positive rate ≈ 0.6%, with a detection lag of ≈ 0.2 ms [(Yin et al., 14 Jan 2026), Sec. 4.3]. Sentinel mode ensures production viability with <0.5% CPU overhead and <0.1% latency increase under always-on operation [(Yin et al., 14 Jan 2026), Sec. 5.2].

5. Implementation, Overhead, and Deployment

LatencyPrism requires no code modifications, process restarts, or framework changes. All tracing is attached dynamically at runtime: Python probes via ptrace, OS probes via eBPF, and GPU probes via LD_PRELOAD or driver-level hooks. Probes unload automatically after deep-dive capture, ensuring minimal impact on production workloads [(Yin et al., 14 Jan 2026), Sec. 5.1].

Deployment at scale shows:

  • End-to-end throughput loss <0.5%.
  • P50 latency inflation <0.1% under default mode.
  • Generalizes across LLM serving frameworks (e.g., SGLang, vLLM) and unseen hardware/workload combinations with <10% prediction error after 200–1000 samples [(Yin et al., 14 Jan 2026), Sec. 6].

Compared to vendor profilers (e.g., Nsight, rocprof, PyTorch Profiler) and CPU-only eBPF systems, LatencyPrism uniquely achieves full-stack, online, zero-restart, SLO-driven anomaly detection [(Yin et al., 14 Jan 2026), Sec. 6.4].

6. Root Cause Analysis and Case Studies

Root cause identification leverages a "suspicion score" combining normalized time proportion and resource utilization anomalies:

Score=Δβ(Zβ+Zlogμ)\mathrm{Score} = |\Delta\beta|\left(|Z_\beta| + |Z_{\log\mu}|\right)

where Δβ\Delta\beta is the deviation in time share for operation, ZβZ_\beta the event's z-score, and ZlogμZ_{\log\mu} the z-score of log-transformed utilization.

Case studies include:

  • CPU contention: Δβ=+29%\Delta\beta=+29\%, Δμcpu=+50%\Delta\mu_\mathrm{cpu}=+50\%, Score=2821.
  • GPU kernel queueing: Score=1561.
  • NVLink AllReduce congestion: Δβ=+86%\Delta\beta=+86\%, Score=2.9×105.
  • CPU frequency drop: negative Δβ\Delta\beta, Δμ=63%\Delta\mu=-63\%, Score=177.

Six-month production deployment resulted in detection of over 1,200 real-world anomalies (alerting within <5 ms) and halved mean time to resolution for operators [(Yin et al., 14 Jan 2026), Sec. 7].

7. Limitations and Prospects

There is a need for warm-up on new workload types or after hardware upgrades to build reliable baselines; persistent slowdowns lacking a "normal" reference may go undetected. Proposed future directions include calibration-free adaptive baselining, natural-language trace summarization using LLMs, and extension to non-mainstream XPU platforms [(Yin et al., 14 Jan 2026), Sec. 8.2]. A plausible implication is that the fundamental decoupling of observability logic from serving code positions LatencyPrism to support the evolving diversity of LLM deployment architectures.

Reference: (Yin et al., 14 Jan 2026)

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LatencyPrism.