Papers
Topics
Authors
Recent
2000 character limit reached

Combating the Memory Walls: Optimization Pathways for Long-Context Agentic LLM Inference (2509.09505v2)

Published 11 Sep 2025 in cs.AR

Abstract: LLMs now form the backbone of AI agents for a diverse array of applications, including tool use, command-line agents, and web or computer use agents. These agentic LLM inference tasks are fundamentally different from chatbot-focused inference -- they often have much larger context lengths to capture complex, prolonged inputs, such as entire webpage DOMs or complicated tool call trajectories. This, in turn, generates significant off-chip memory traffic for the underlying hardware at the inference stage and causes the workload to be constrained by two memory walls, namely the bandwidth and capacity memory walls, preventing the on-chip compute units from achieving high utilization. In this paper, we introduce PLENA, a hardware-software co-designed system that applies three core optimization pathways to tackle these challenges. PLENA includes an efficient hardware implementation of compute and memory units supporting an asymmetric quantization scheme. PLENA also features a novel flattened systolic array architecture that has native support for FlashAttention to tackle these memory walls in the scenario of inference serving for long-context LLMs. Additionally, PLENA is developed with a complete stack, including a custom ISA, a compiler, a cycle-emulated simulator, and an automated design space exploration flow. The simulated results show that PLENA achieves up to 8.5x higher utilization than existing accelerators, and delivers 2.24x higher throughput than the A100 GPU and 3.85x higher throughput than the TPU v6e, under the same multiplier count and memory settings. The full PLENA system will also be open-sourced.

Summary

  • The paper introduces PLENA, a co-designed hardware-software system that improves long-context large language model inference by overcoming memory bandwidth and capacity walls.
  • It employs a flattened systolic array and an asymmetric quantization scheme (using W_mxint4, A_mxint8, and KV_mxint4) to enhance resource utilization, achieving up to 8.5 times better throughput.
  • Native support for FlashAttention minimizes off-chip memory I/O, optimizing performance for agentic workloads that consume substantially more tokens.

Combating the Memory Walls: Optimization Pathways for Long-Context Agentic LLM Inference

The paper "Combating the Memory Walls: Optimization Pathways for Long-Context Agentic LLM Inference" introduces PLENA, a co-designed hardware-software system to address the computational challenges associated with long-context LLM inference. This section provides a comprehensive overview of the key methodologies and implications outlined in the research.

Introduction

To optimize the efficiency of LLMs designed for agentic tasks, PLENA incorporates three primary pathways: a flattened systolic array architecture, an asymmetric quantization scheme, and native support for FlashAttention. Each of these approaches is targeted at mitigating the so-called memory bandwidth and capacity walls that hinder contemporary hardware from fully utilizing computation units during LLM inference.

The paper highlights the groundbreaking implications of agentic LLM workloads, which require much larger context lengths compared to traditional chatbot workloads, consequently consuming significantly more tokens (Figure 1). Figure 1

Figure 1

Figure 1

Figure 1: Compared with standard chatbot workloads, the selected agentic web and code tasks generally consume over 100times more tokens.

System Architecture and Optimization Pathways

Flattened Systolic Array

PLENA implements a unique flattened systolic array to better accommodate the uneven matrix shapes found in long-context LLM inference tasks. This architecture optimizes the utilization of GEMM units across both the prefilling and decoding stages of agentic inference tasks. By adjusting the architecture to handle large inner dimensions (KK), PLENA enhances compute efficiency over conventional square-shaped systolic arrays. Figure 2

Figure 2

Figure 2: PLENA achieves higher utilization than the standard square systolic array(same resources).

Asymmetric Quantization Scheme

PLENA's asymmetric quantization strategy allows different numerical types and precisions to be applied to weights, activations, and the KV cache. This flexibility significantly alleviates the constraints imposed by memory bandwidth and capacity walls. For example, applying more aggressive quantization such as Wmxint4W_{mxint4}, Amxint8A_{mxint8}, KVmxint4KV_{mxint4}, enables larger batch sizes and overall improved latency. Figure 3

Figure 3: A typical setting of the MX data formats in this design. A scale is shared by a group of elements. Scale is in power of two quantization and elements can be quantized to integer or minifloat.

FlashAttention Support

PLENA supports FlashAttention natively, overcoming limitations in conventional architectures that require costly off-chip memory I/O for attention computation. This integration prevents excessive data movement, thereby improving inference performance, especially at longer context lengths where attention layers dominate the computational flow. Figure 4

Figure 4: PLENA architecture overview. Execution is controlled by the decoder's system-pipeline controller, which derives control signals from decoded instructions and monitors memory dependencies. For example, if the current instruction needs to read from a Vector SRAM row that is still being updated by the vector or matrix unit, the controller inserts a stall to ensure correctness. Vector SRAM acts as the on-chip scratchpad, providing data to the matrix and vector units and accepting their results.

Implementation and Evaluation

The proposed PLENA system is rigorously evaluated through simulations and compared with leading accelerators and commercial GPUs/TPUs. The results demonstrate that PLENA achieves up to 8.5 times better utilization in agentic workloads and offers higher throughput than the A100 GPU and TPU v6e, confirming its effectiveness in long-context scenarios. Figure 5

Figure 5: Asymmetric-precision datapath example. Vector SRAM stores FP4 values, whereas Matrix SRAM stores MX-INT4 values. Green paths denote the selective rotational quantization flow: a fast Walsh–Hadamard transform is applied, with its inverse used to map back~\cite{hadamard_transform.

Conclusion

The PLENA system effectively addresses the underutilization challenges posed by memory walls in contemporary hardware systems used for LLM inference. Through strategic hardware and software optimization pathways including flattened systolic arrays, asymmetric quantization, and native FlashAttention support, PLENA improves performance metrics significantly compared to existing solutions.

With these advances, PLENA paves the way for more efficient inference of agentic LLM workloads, promising substantial improvements in areas requiring extensive context comprehension and processing. Future developments could focus on further enhancing GEMM utilization within FlashAttention and extending PLENA's capabilities with multi-core architectures to better exploit parallelism.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.