- The paper introduces a novel stateful, session-based transformer inference architecture that decouples asynchronous data ingestion from query evaluation.
- It demonstrates constant query latency (~43ms) regardless of context size, achieving 2.4×–5.9× speedup over open-weight engines and 21×–92× over cloud APIs.
- The approach leverages persistent KV-cache, hierarchical context partitioning, and flash queries to enable scalable, low-latency streaming analytics.
Architectural Innovations for Stateful Streaming Inference
"Attention Once Is All You Need: Efficient Streaming Inference with Stateful Transformers" (2605.13784) proposes an inference architecture tailored for streaming workloads, where new data streams in continually and user queries arrive sporadically. The core contribution is a stateful session model leveraging persistent KV-cache, which transforms transformer inference from a stateless, request-driven paradigm into a data-driven, session-oriented process. The result is true constant-time query latency, independent of accumulated context length.
Request-Driven vs. Data-Driven Processing
Conventional inference frameworks (vLLM, SGLang, TensorRT-LLM, llama.cpp) optimize for high cache hit rates on request-driven traffic, where queries often share static prefixes. In contrast, streaming workloads exhibit continuous data evolution—every new data point appends to context, rendering prefix caching nearly useless due to cache miss frequency. Thus, such engines recompute attention over the entire context per query, entailing O(n) per-query latency scaling with growing context.
The proposed data-driven model inverts the paradigm: context is incrementally processed into a persistent KV-cache as data arrives, decoupled from user queries. Queries, when issued, act as lightweight consumers of precomputed attention states, observing runtime that scales with query length, not total context. The architecture ensures the computation complexity per query becomes O(∣q∣) — effectively constant for fixed query length, regardless of how much historical data was previously ingested.
(Figure 1)
Figure 1: Performance comparison across nine systems on streaming OHLCV data (155–925 samples). The proposed architecture maintains constant ∼43\,ms query latency regardless of context size, while conventional engines show context-proportional increases.
This foundational distinction enables the speedup seen in streaming workloads—where each new data element is processed asynchronously, amortizing the prefill cost and offloading it from the latency-critical query path. The architecture additionally separates data and query planes, maximizing throughput and minimizing head-of-line blocking on GPU resources.
Hierarchical Context Partitioning and Persistent State
The architecture introduces a three-region context model:
- Region 0 (Frozen): Static prompt/instructions, processed once at session initialization.
- Region 1 (Sliding): Buffer for streaming data with bounded retention and FIFO eviction, supporting append-only or sliding window semantics.
- Region 2 (Ephemeral): Query and response tokens, cleared between queries to allow repeated interaction against persistent data.
Persistent session management is indispensable. Each session maintains intermediate representations for its context, including precise buffer metadata and position tracking—essential for both incremental update and low-latency query serving. Memory consumption is proportional to the active tokens per session: for Llama-3 8B (32 layers, 4096 hidden, 32K tokens), per-session KV-cache demands approximate 17 GB in FP16.
This partitioned context model is a fundamental enabler of constant-latency inference; session state is always ready for instantaneous query evaluation, as only new deltas are processed upon arrival and the ephemeral region is reused for query/response execution.
Flash Queries: Pushing Latency Toward Zero
Flash Queries extend the core by precomputing answers to user-registered, template queries during idle GPU cycles between data arrivals. Since many streaming workloads (e.g., financial dashboards) repeatedly pose similar questions as data evolves, this speculative execution model ensures frequent query types are served directly from a cache, requiring no new forward passes at inference time. Flash Query cache hits incur only network round-trip latency (∼2ms end-to-end), and the architecture includes a speculative exit for fast-answer queries when high-confidence logits are observed in the "ready" position.
The maximum number of Flash Queries per data update is bounded by the ratio of the data arrival interval to forward pass cost; e.g., with Tdata​=1s and Tf​=33ms, up to ∼25 queries can be precomputed per cycle. This arrangement transforms accelerators into always-productive platforms, reclaiming the otherwise-idle time between streaming data ingest events.
Multi-Tenant Scheduling and Batching
A scalable, priority-scheduled GPU decoding pipeline is instantiated to allow cohabitation of dozens of active stateful sessions and stateless API traffic. The scheduler enforces:
- Cell-budget admission: Ensures unified KV-cache occupancy does not exceed hardware upper bounds, avoiding memory spikes during concurrent loads.
- Adaptive chunked prefill: Dynamically sizes batching granularity per active session, balancing maximal throughput and minimal query latency; avoids starvation in heavily multi-tenant environments.
- Prefix-aware grouped prefill: Identifies and consolidates requests with byte-identical prefixes to a single forward computation with metadata-only aliasing.
- Speculative decoding with concurrency capp: Per-slot speculation is capped based on acceptance rates and active session count, delivering gains for repetitive workloads but without throughput collapse under concurrent load.
These mechanisms surpass the capabilities of PagedAttention and RadixAttention schedulers in vLLM and SGLang, and are original to this work.
Empirical Evaluation and Numerical Results
Benchmarks on streaming financial OHLCV data demonstrate the architecture’s strengths. With the context expanding from 2.6K to 14.8K tokens:
- Average query latency remains constant at ∼43ms, independent of context size.
- By contrast, vLLM, SGLang, TensorRT-LLM, and llama.cpp report 106ms, 139ms, 229ms, and 254ms latencies respectively, scaling linearly with context.
- Cloud APIs (GPT-5.2, Claude Opus 4.5, etc.) register 926ms–3962ms latency per query, as full context must be transmitted per request.
Query accuracy is invariant across all open-weight engines at 53.3% (Llama-3 8B-Instruct checkpoint), with only llama.cpp lagging at 46.7% due to probable implementation differences. Cloud API models, benefitting from larger model scales, report higher accuracies but are penalized by high and variable latency.
Latency variance analysis underlines further strengths: the proposed architecture’s standard deviation is just 15.6ms—substantially lower than all other engines and critical for real-time operations.
Key numerical claims:
- 2.4×–5.9× speedup over best-in-class open-weight engines
- O(∣q∣)0–O(∣q∣)1 speedup versus leading cloud APIs
- Flash Queries and speculative exit reduce average latency to single-digit milliseconds for predictable workloads
Practical and Theoretical Implications
This architectural shift has several notable implications:
- Throughput/latency tradeoff frontier: By decoupling data from query planes and leveraging idle GPU capacity with Flash Queries, the architecture simultaneously minimizes query latency, maximizes hardware utilization, and enhances throughput—without modifying underlying model weights or kernels.
- Scalability and multi-tenancy: Scheduling innovations allow tens of sessions (with unique, evolving contexts) to coexist per GPU, bounded only by on-device memory.
- Applicability: Financial analysis, log/sensor monitoring, IoT analytics, and persistent-memory chat agents all directly benefit, provided contexts are append-only, queries are sporadic, and low latency is vital.
- Limitation: Session state must fit in device memory, and attention scope is restricted to the sliding window; historical recall over fully evicted data is not supported natively, suggesting future integration of lossy summaries or long-term compression.
Relationship to Prior Work and Directions for AI Research
Prior work on subquadratic and sparse attention (Mamba (Gu et al., 2023), RWKV [peng2023rwkv], Longformer (Beltagy et al., 2020), BigBird (Beltagy et al., 2020)) achieves faster inference by reducing model expressivity; in contrast, this architecture retains full quadratic attention, ensuring transformer reasoning and retrieval capabilities are preserved within the active context window. While disaggregated prefill and prompt caching solutions (DistServe [zhong2024distserve], cloud APIs) offer incremental efficiency, they remain fundamentally request-driven and do not eliminate scaling with context size.
This work's session-based, regioned, background-ingestion architecture is orthogonal to hardware/kernel optimizations like FlashAttention [dao2022flashattention] and can be layered atop such implementations. In the context of large-scale deployments, the paradigm offers a blueprint for maximally leveraging expensive AI accelerators. The approach also complements, rather than replaces, recursive models (RLMs [zhang2025rlm]), which target arbitrarily long context handling at the cost of highly variable latency.
Future extensions: Integration of hierarchical memory—utilizing on-the-fly summaries for evicted context—could extend precise recall beyond window boundaries while retaining efficiency. Advancements in attention/prefix handling, probabilistic caching, and dynamic scheduling will further optimize multi-tenant deployment for both open-weight and proprietary models.
Conclusion
The session-oriented, data-driven architecture described here decisively advances the state of streaming inference, shifting the principal cost from queries to asynchronous ingestion and enabling true constant-time query latency for transforming workloads. It fundamentally alters the design space for inference in streaming, session-persistent contexts, achieving strong empirical speedup and low variance. Flash Queries and speculative exit further narrow latency overhead, making transformers viable as real-time analytics engines. The approach neither demands architectural sacrifices to model capacity nor is constrained by pattern-dependent cache benefits—any further developments leveraging this session model will likely shape future systems for real-time AI-driven analytics and persistent memory agents.