PLENA System: LLM Inference Accelerator
- PLENA system is a co-designed accelerator architecture that overcomes memory walls in long-context LLM inference by integrating a flattened systolic array, asymmetric quantization, and native FlashAttention support.
- It employs three integrated optimization pathways that maximize arithmetic unit utilization, reduce memory and bandwidth pressure, and execute attention operations fully on-chip via custom ISA instructions.
- The full-stack platform, from RTL hardware to automated design space exploration, demonstrates up to 8.5× higher utilization and significantly improved throughput compared to conventional GPU/TPU architectures.
The PLENA system is a hardware–software co-designed accelerator architecture specifically engineered to address memory bandwidth and capacity constraints—referred to as “memory walls”—in the context of serving long-context, agentic LLM inference. PLENA achieves substantial improvements in utilization and throughput compared to existing accelerator platforms, leveraging three integrated optimization pathways: a flattened systolic array compute fabric, support for asymmetric quantization across tensors, and native implementation of the FlashAttention algorithm via custom ISA instructions. The system encompasses the full stack from RTL hardware and custom ISA to compiler, cycle-accurate simulation, and automated design space exploration, and is scheduled for open-source release. Below, key architectural and methodological features are detailed.
1. Optimization Pathways for Long-Context LLM Inference
PLENA targets fundamental computation bottlenecks that arise in transformer inference with protracted input sequences. Long-context LLM agentic workloads exhibit high memory footprint due to large key-value (KV) caches and extensive attention maps, resulting in operational regimes where the outer matrix dimension ( in ) is small versus the inner ones—a scenario in which conventional (square) systolic compute arrays are highly underutilized.
The three core optimization pathways in PLENA are:
- Flattened Systolic Array Architecture: Rather than a conventional square array, PLENA arranges multiple small square systolic sub-arrays in a linear row, forming a “flattened” array. This configuration maximizes arithmetic unit utilization for “fat” GEMMs (), which dominate in long-sequence transformer inference due to high memory constraints. The array’s width and composition are configurable to match the empirical operational bottlenecks imposed by high-bandwidth memory (HBM) limitations.
- Asymmetric Quantization Scheme: PLENA differentially quantizes tensors to reduce memory and bandwidth pressure. Weights and KV cache are quantized to low-precision integer representations (e.g., MXINT4), exploiting their relative insensitivity to quantization-induced degradation. Activations (and other sensitive intermediates) retain higher precision (e.g., MXFP), stored in on-chip vector SRAM. This “mixed precision” lowers the overall memory footprint and off-chip traffic, effectively relaxing both bandwidth and capacity walls.
- Native FlashAttention Support via Custom ISA: Standard attention implementations incur substantial off-chip I/O for the matrix. PLENA introduces specialized instructions and data paths that execute the core sequence of GEMM–softmax–GEMM operations of FlashAttention fully on-chip, eliminating intermediate off-chip memory transfers. This is achieved via a fused, pipelined dataflow native to the flattened systolic architecture, invoked through bespoke ISA primitives.
These optimizations are co-designed to address the “fat” matrix and batch size limitations endemic to long-context LLM workloads.
2. Hardware Architecture and Memory System
PLENA’s hardware system is organized into three distinct compute units, a hierarchical memory subsystem, and quantization-aware data flows:
Component | Description | Purpose |
---|---|---|
Matrix Unit | Flattened systolic array for GEMM/attention-intensive compute | Matrix multiplications (e.g., ) |
Vector Unit | Handles elementwise ops, reductions, rowwise softmax (FlashAttention) | Per-layer transformations, reductions |
Scalar Unit | Integer and floating-point scalar computation, address logic | Address generation, non-linear ops |
- On-chip SRAM:
- Matrix SRAM: Holds weights and KV cache, supports “transpose-on-read” for efficient, tiling-friendly data access in attention modules.
- Vector SRAM: Scratchpad for high-precision activations, minimizes latency and capacity-induced stalls.
- HBM Interface: The on-chip SRAMs are tightly coupled to external high-bandwidth memory; access patterns are optimized using variable-length transfers and strided access, minimizing bandwidth contention and balancing latency.
- Quantization in Hardware: The computational datapaths natively process different numeric formats, with configurable conversion and Hadamard-based rotation (and its inverse) applied selectively to minimize quantization error where it could impact model accuracy.
3. Flattened Systolic Array and GEMM Design
The all-to-all matrix multiplications required for LLM inference become highly nonuniform in memory-constrained inference, where is small and are large:
PLENA optimizes this operation by:
- Breaking the reduction dimension () into segments with a tunable block length (), and computing each with a row-aligned sub-array in the flattened array.
- Connecting outputs of these sub-arrays to an adder tree, fusing the partial results.
- Supporting on-chip data transposes, crucial for FlashAttention tiling.
This arrangement maintains near-100% multiplier utilization, especially as decreases under long-context constraints, and provides a dataflow optimized for on-chip computation and minimal HBM latency.
4. Software Stack, Simulation, and Design Space Exploration
PLENA comprises a comprehensive hardware–software stack:
- Custom Instruction Set Architecture (ISA): Encodes matrix, vector, scalar, memory, and control primitives, including dedicated instructions for FlashAttention pipeline activation.
- Compiler: Accepts exported models (e.g., ONNX from PyTorch), applies graph-level pattern matching, operator fusion, tile scheduling, and lowers to the PLENA ISA. Compiler handles memory allocation and data placement constraints to maximize data reuse.
- Cycle-Accurate Simulator: Implemented in Rust, simulates detailed timing, SRAM/HBM access, and pipeline behaviors for accurate throughput and latency validation, prior to RTL synthesis.
- Automated Design Space Exploration (DSE): Employs multi-objective Bayesian optimization across hardware parameters (e.g., block sizes, memory partitioning) and quantization settings, optimizing for perplexity, chip area, and inference latency. Hardware selection constraints (e.g., ) ensure feasible configurations, and the DSE surface is visualized via empirical attainment surfaces.
5. Performance Metrics and Comparative Evaluation
PLENA demonstrates significant empirical improvements compared to state-of-the-art accelerators, as reported in simulation:
- Arithmetic Utilization: The flattened systolic array achieves up to 8.5 higher utilization relative to traditional square arrays when operating on “fat” GEMMs common to long-context transformer workloads.
- Throughput: Under matched multiplier and memory settings,
- Throughput is 2.24 higher than NVIDIA A100 GPU.
- Throughput is 3.85 higher than Google TPU v6e.
- Compute Density and Area: Analysis tables in the paper provide results for attainable TOPs/, demonstrating favorable energy efficiency and die area allocation (specific values as reported).
- System Block Diagrams: Figures referenced in the paper (e.g., “Figure~\ref{fig:PLENA_arch}” and “Figure~\ref{fig:Flattened_Systolic_Array}”) illustrate the control pipeline, memory layout, and the logic of the flattened compute array.
6. Open-Source Dissemination and Research Impact
The complete PLENA hardware and software stack, including:
- RTL hardware code,
- Custom ISA specification,
- Compiler and optimization toolchain,
- Cycle-accurate simulator,
- Automated design space exploration framework
are scheduled for open release upon publication. The open-source strategy is intended to facilitate reproducibility in LLM inference accelerator research, enable direct benchmarking across the community, and provide a reference platform for future advances in hardware-software co-design for long-context LLMs.
This approach is poised to influence accelerator design beyond conventional chatbot inference, addressing the agentic LLM workload class where long input contexts and memory-bound computation predominate.
7. Significance in the Context of LLM Hardware Acceleration
PLENA’s set of innovations—most notably, the flattened systolic array, asymmetric quantization, and natively fused FlashAttention pipeline—address inefficiencies that, in existing architectures, leave arithmetic units idle during memory-constrained inference and incur high off-chip I/O during attention. By explicitly targeting these limitations, with a full-stack, cross-layer approach, PLENA achieves measurable gains in utilization and throughput, confirmed under simulation against leading GPU and TPU architectures. The public release of the stack is expected to catalyze further research in scalable transformer model inference, particularly as context lengths and agentic workload complexity continue to increase.
PLENA thus represents a co-designed architecture for overcoming critical memory bottlenecks in long-context agentic LLM inference workloads, establishing both methodological and empirical reference points for future accelerator systems (Wu et al., 11 Sep 2025).