Papers
Topics
Authors
Recent
Search
2000 character limit reached

Attention Once Is All You Need: Efficient Streaming Inference with Stateful Transformers

Published 13 May 2026 in cs.LG | (2605.13784v1)

Abstract: Conventional transformer inference engines are request-driven, paying an O(n) prefill cost on every query. In streaming workloads, where data arrives continuously and queries probe an ever-growing context, this cost is prohibitive. We introduce a data-driven computational model centred on stateful sessions: a persistent KV cache advanced incrementally as new data arrives, so prefill is moved off the critical path and query latency becomes O(|q|), independent of accumulated context size. Building on this, Flash Queries reclaim idle GPU cycles between data arrivals to pre-evaluate registered questions and return cached answers before the user asks, a pattern that is structurally impossible in stateless engines because they discard intermediate state between requests. A multi-tenant continuous-batching scheduler with cell-budget admission and prefix-aware grouped prefill lets dozens of stateful sessions coexist on a single GPU while preserving full quadratic self-attention. On streaming market-data benchmarks the reference implementation achieves up to 5.9x speedup over conventional inference engines (vLLM, SGLang, TensorRT-LLM, llama.cpp), holding query latency constant as accumulated context grows.

Authors (1)

Summary

  • The paper introduces a novel stateful, session-based transformer inference architecture that decouples asynchronous data ingestion from query evaluation.
  • It demonstrates constant query latency (~43ms) regardless of context size, achieving 2.4×–5.9× speedup over open-weight engines and 21×–92× over cloud APIs.
  • The approach leverages persistent KV-cache, hierarchical context partitioning, and flash queries to enable scalable, low-latency streaming analytics.

Architectural Innovations for Stateful Streaming Inference

"Attention Once Is All You Need: Efficient Streaming Inference with Stateful Transformers" (2605.13784) proposes an inference architecture tailored for streaming workloads, where new data streams in continually and user queries arrive sporadically. The core contribution is a stateful session model leveraging persistent KV-cache, which transforms transformer inference from a stateless, request-driven paradigm into a data-driven, session-oriented process. The result is true constant-time query latency, independent of accumulated context length.

Request-Driven vs. Data-Driven Processing

Conventional inference frameworks (vLLM, SGLang, TensorRT-LLM, llama.cpp) optimize for high cache hit rates on request-driven traffic, where queries often share static prefixes. In contrast, streaming workloads exhibit continuous data evolution—every new data point appends to context, rendering prefix caching nearly useless due to cache miss frequency. Thus, such engines recompute attention over the entire context per query, entailing O(n)\mathcal{O}(n) per-query latency scaling with growing context.

The proposed data-driven model inverts the paradigm: context is incrementally processed into a persistent KV-cache as data arrives, decoupled from user queries. Queries, when issued, act as lightweight consumers of precomputed attention states, observing runtime that scales with query length, not total context. The architecture ensures the computation complexity per query becomes O(∣q∣)\mathcal{O}(|q|) — effectively constant for fixed query length, regardless of how much historical data was previously ingested.

(Figure 1)

Figure 1: Performance comparison across nine systems on streaming OHLCV data (155–925 samples). The proposed architecture maintains constant ∼\sim43\,ms query latency regardless of context size, while conventional engines show context-proportional increases.

This foundational distinction enables the speedup seen in streaming workloads—where each new data element is processed asynchronously, amortizing the prefill cost and offloading it from the latency-critical query path. The architecture additionally separates data and query planes, maximizing throughput and minimizing head-of-line blocking on GPU resources.

Hierarchical Context Partitioning and Persistent State

The architecture introduces a three-region context model:

  • Region 0 (Frozen): Static prompt/instructions, processed once at session initialization.
  • Region 1 (Sliding): Buffer for streaming data with bounded retention and FIFO eviction, supporting append-only or sliding window semantics.
  • Region 2 (Ephemeral): Query and response tokens, cleared between queries to allow repeated interaction against persistent data.

Persistent session management is indispensable. Each session maintains intermediate representations for its context, including precise buffer metadata and position tracking—essential for both incremental update and low-latency query serving. Memory consumption is proportional to the active tokens per session: for Llama-3 8B (32 layers, 4096 hidden, 32K tokens), per-session KV-cache demands approximate 17 GB in FP16.

This partitioned context model is a fundamental enabler of constant-latency inference; session state is always ready for instantaneous query evaluation, as only new deltas are processed upon arrival and the ephemeral region is reused for query/response execution.

Flash Queries: Pushing Latency Toward Zero

Flash Queries extend the core by precomputing answers to user-registered, template queries during idle GPU cycles between data arrivals. Since many streaming workloads (e.g., financial dashboards) repeatedly pose similar questions as data evolves, this speculative execution model ensures frequent query types are served directly from a cache, requiring no new forward passes at inference time. Flash Query cache hits incur only network round-trip latency (∼\sim2ms end-to-end), and the architecture includes a speculative exit for fast-answer queries when high-confidence logits are observed in the "ready" position.

The maximum number of Flash Queries per data update is bounded by the ratio of the data arrival interval to forward pass cost; e.g., with Tdata=1T_{\text{data}}=1s and Tf=33T_f=33ms, up to ∼\sim25 queries can be precomputed per cycle. This arrangement transforms accelerators into always-productive platforms, reclaiming the otherwise-idle time between streaming data ingest events.

Multi-Tenant Scheduling and Batching

A scalable, priority-scheduled GPU decoding pipeline is instantiated to allow cohabitation of dozens of active stateful sessions and stateless API traffic. The scheduler enforces:

  • Cell-budget admission: Ensures unified KV-cache occupancy does not exceed hardware upper bounds, avoiding memory spikes during concurrent loads.
  • Adaptive chunked prefill: Dynamically sizes batching granularity per active session, balancing maximal throughput and minimal query latency; avoids starvation in heavily multi-tenant environments.
  • Prefix-aware grouped prefill: Identifies and consolidates requests with byte-identical prefixes to a single forward computation with metadata-only aliasing.
  • Speculative decoding with concurrency capp: Per-slot speculation is capped based on acceptance rates and active session count, delivering gains for repetitive workloads but without throughput collapse under concurrent load.

These mechanisms surpass the capabilities of PagedAttention and RadixAttention schedulers in vLLM and SGLang, and are original to this work.

Empirical Evaluation and Numerical Results

Benchmarks on streaming financial OHLCV data demonstrate the architecture’s strengths. With the context expanding from 2.6K to 14.8K tokens:

  • Average query latency remains constant at ∼\sim43ms, independent of context size.
  • By contrast, vLLM, SGLang, TensorRT-LLM, and llama.cpp report 106ms, 139ms, 229ms, and 254ms latencies respectively, scaling linearly with context.
  • Cloud APIs (GPT-5.2, Claude Opus 4.5, etc.) register 926ms–3962ms latency per query, as full context must be transmitted per request.

Query accuracy is invariant across all open-weight engines at 53.3% (Llama-3 8B-Instruct checkpoint), with only llama.cpp lagging at 46.7% due to probable implementation differences. Cloud API models, benefitting from larger model scales, report higher accuracies but are penalized by high and variable latency.

Latency variance analysis underlines further strengths: the proposed architecture’s standard deviation is just 15.6ms—substantially lower than all other engines and critical for real-time operations.

Key numerical claims:

  • 2.4×2.4\times–5.9×5.9\times speedup over best-in-class open-weight engines
  • O(∣q∣)\mathcal{O}(|q|)0–O(∣q∣)\mathcal{O}(|q|)1 speedup versus leading cloud APIs
  • Flash Queries and speculative exit reduce average latency to single-digit milliseconds for predictable workloads

Practical and Theoretical Implications

This architectural shift has several notable implications:

  • Throughput/latency tradeoff frontier: By decoupling data from query planes and leveraging idle GPU capacity with Flash Queries, the architecture simultaneously minimizes query latency, maximizes hardware utilization, and enhances throughput—without modifying underlying model weights or kernels.
  • Scalability and multi-tenancy: Scheduling innovations allow tens of sessions (with unique, evolving contexts) to coexist per GPU, bounded only by on-device memory.
  • Applicability: Financial analysis, log/sensor monitoring, IoT analytics, and persistent-memory chat agents all directly benefit, provided contexts are append-only, queries are sporadic, and low latency is vital.
  • Limitation: Session state must fit in device memory, and attention scope is restricted to the sliding window; historical recall over fully evicted data is not supported natively, suggesting future integration of lossy summaries or long-term compression.

Relationship to Prior Work and Directions for AI Research

Prior work on subquadratic and sparse attention (Mamba (Gu et al., 2023), RWKV [peng2023rwkv], Longformer (Beltagy et al., 2020), BigBird (Beltagy et al., 2020)) achieves faster inference by reducing model expressivity; in contrast, this architecture retains full quadratic attention, ensuring transformer reasoning and retrieval capabilities are preserved within the active context window. While disaggregated prefill and prompt caching solutions (DistServe [zhong2024distserve], cloud APIs) offer incremental efficiency, they remain fundamentally request-driven and do not eliminate scaling with context size.

This work's session-based, regioned, background-ingestion architecture is orthogonal to hardware/kernel optimizations like FlashAttention [dao2022flashattention] and can be layered atop such implementations. In the context of large-scale deployments, the paradigm offers a blueprint for maximally leveraging expensive AI accelerators. The approach also complements, rather than replaces, recursive models (RLMs [zhang2025rlm]), which target arbitrarily long context handling at the cost of highly variable latency.

Future extensions: Integration of hierarchical memory—utilizing on-the-fly summaries for evicted context—could extend precise recall beyond window boundaries while retaining efficiency. Advancements in attention/prefix handling, probabilistic caching, and dynamic scheduling will further optimize multi-tenant deployment for both open-weight and proprietary models.

Conclusion

The session-oriented, data-driven architecture described here decisively advances the state of streaming inference, shifting the principal cost from queries to asynchronous ingestion and enabling true constant-time query latency for transforming workloads. It fundamentally alters the design space for inference in streaming, session-persistent contexts, achieving strong empirical speedup and low variance. Flash Queries and speculative exit further narrow latency overhead, making transformers viable as real-time analytics engines. The approach neither demands architectural sacrifices to model capacity nor is constrained by pattern-dependent cache benefits—any further developments leveraging this session model will likely shape future systems for real-time AI-driven analytics and persistent memory agents.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 6 tweets with 0 likes about this paper.

HackerNews