Prefill-Oriented Inference Architecture
- Prefill-Oriented Inference Architecture is a framework that separates the compute-intensive prefill stage from sequential decode, optimizing resource allocation for LLM inference.
- It employs dynamic scheduling, tailored model pruning, and hybrid batching to efficiently balance compute and memory constraints during processing.
- Techniques such as phase disaggregation, attention caching, and specialized hardware designs enable significant speedups while preserving inference accuracy.
A prefill-oriented inference architecture refers to any LLM or multimodal model (LMM) inference system in which the design, scheduling, hardware, parallelization, memory, and kernel strategies are optimized around the distinct attributes of the "prefill" stage—the initial, compute-intensive full-context forward pass that encodes input tokens and builds the key-value (KV) cache required for subsequent decoding or generation. This architectural concept arises from the deep asymmetry between prefill and decode: prefill is compute-bound and parallel across all input tokens, while decode is memory-bound and predominantly sequential. A broad taxonomy of prefill-oriented architectures thus encompasses approaches for: (1) hardware/software co-design for phase-disaggregation, (2) scheduling and batching explicitly aware of prefill workload characteristics, (3) prefill-specific model compression/pruning/approximate computation, and (4) memory, kernel, or FPGA-specific optimizations that exploit prefill's predictability and high arithmetic intensity.
1. Phase Disaggregation and Workload Partitioning
Prefill-oriented architectures often begin with the separation of prefill and decode processing, concretely mapping phases onto distinct hardware resources or scheduling periods to exploit phase heterogeneity. In Cronus, a heterogeneous GPU cluster assigns the initial prefill work to low-end, memory-constrained GPUs and the remainder, together with the entire decode, to high-end GPUs, splitting the input tokens fractionally by measured per-token service rates to equalize completion times and maximize parallel device utilization (Liu et al., 22 Sep 2025). This partially disaggregated approach enables continuous streaming of requests, pipelining prefill and decode not just between but also within hardware classes, and dynamically rebalances the division through online profiling of (pre- and decode token cost per GPU).
Broader disaggregation strategies, such as prefill/decode (PD) separation, also underpin systems including TetriInfer (Hu et al., 2024), DOPD (Liao et al., 26 Nov 2025), ARES (Wang et al., 15 Oct 2025), SPAD (Zhang et al., 9 Oct 2025), PLA-Serve (She et al., 4 Jan 2026), and TD-Pipe (Zhang et al., 12 Jun 2025); here, specialized GPU pools, FPGAs, or custom ASICs handle the compute-bound prefill, while separate resources tackle the memory-bound, auto-regressive decode, often coupled with further sub-phase optimizations (e.g., splitting prefill tasks by request length in PLA-Serve).
2. Mathematical Formulation and Load-Balancing Principles
Prefill-oriented inference requires formal decomposition of the workload and an explicit partitioning of compute and memory costs. For Cronus (Liu et al., 22 Sep 2025), the prefill split is:
- , (T_L tokens on low-end GPU)
- , (rem. tokens on high-end GPU)
- .
Optimal workload partitioning is given by , balancing completion times between devices and enabling burst pipelining. Overlap is exploited in the schedule to minimize idle time, and the scheduling objective is typically to minimize high-percentile (P99) TTFT or maximize throughput, often formalized as:
where includes both prefill and decode completion for request . Dynamic load balancing is crucial, with real-time adjustments to splitting fractions or resource assignments based on observed service rates (Liu et al., 22 Sep 2025), forecasted load (Liao et al., 26 Nov 2025), or instance pressure (She et al., 4 Jan 2026).
3. Prefill Optimization: Model, Memory, Attention, and Pruning
Prefill-oriented optimization techniques exploit the static, parallel, and often predictable nature of prefill computation for tailored efficiency gains.
- Model Pruning and Skipping: Prefill-only pruning (POP (He et al., 3 Feb 2026)) analyzes layer importance using gate-based second-order Taylor approximations, omitting deep layers (e.g., last 1/3) during prefill while retaining full depth at decode and designating the last prompt token as a strict stage boundary. This delivers up to prefill speedup with sub-1% accuracy loss by computing only independent KV projections for skipped layers.
- KV Cache Management and Distillation: SwiftKV (Qiao et al., 2024) skips late transformer layers during prefill by directly emitting later-layer KV-cache projections from earlier hidden states (SingleInputKV), coupled with knowledge-preserving distillation solely on the QKV projections. Further, layer-grouped cache sharing and memory compression (AcrossKV) reduce memory without affecting decode. PrefillOnly (Du et al., 12 May 2025) for prefill-only workloads keeps just the final-layer KV cache, shrinking memory use from to .
- Sparse and Criticality-Based Attention: QUOKA (Jones et al., 9 Feb 2026) and CritiPrefill (Lv et al., 2024) accelerate prefill by selecting critical queries and keys: QUOKA identifies low cosine-similarity queries and their supporting keys, reducing attention computation to a small representative set and realizing GPU attention-phase speedup at accuracy loss; CritiPrefill partitions the sequence into segments and blocks, computing a segment-block importance matrix to focus computation on blocks most critical to each query segment, achieving up to prefill speedup with quality drop for $128$K-token contexts.
- Attention Caching: AttnCache (Song et al., 29 Oct 2025) leverages attention-map similarity, using a vector database to retrieve per-layer attention maps for new inputs similar to prior cached sentences, thereby bypassing expensive /softmax for cache hits and halving or tripling attention runtime for prefill-only inference.
4. Pipeline Scheduling, Hybrid Batching, and Kernel Fusion
Prefill-oriented systems employ advanced pipeline scheduling and batching to sustain GPU saturation across diverse workload regimes:
- Chunked and Layered Prefill: Chunked prefill splits prompt processing into uniform-length chunks to avoid compute underutilization and minimize large-batch overhead (TetriInfer (Hu et al., 2024); REDServe (Guo et al., 29 Sep 2025)), while layered prefill (Lee et al., 9 Oct 2025) vertically partitions the model by layer groups, interleaves prefill and decode across groups, and reduces redundant Mixture-of-Experts (MoE) weight reloads, yielding up to TTFT, latency, and per-token energy reductions on co-located hardware.
- Hybrid-Batch Attention Kernels: POD-Attention (Kamath et al., 2024) fuses prefill and decode attention into a single GPU kernel, statically partitioning thread blocks (CTAs) per multiprocessor and enabling overlapping compute- and memory-bound operations. This architectural kernel achieves up to speedup over serial attention, with mean TTFT and TBT reductions and near-complete elimination of decode stalls for mixed-load batches.
- Temporally-Disaggregated Pipeline Parallelism: TD-Pipe (Zhang et al., 12 Jun 2025) separates prefill and decode phases temporally within pipeline parallelism, using an AI-greedy driver and memory simulation to maximize prefill progress before switching phases, dynamic work-stealing, and spatial-temporal switch logic to balance intensity and transitions. This yields up to throughput over tensor-parallel baselines and over pipeline-parallel approaches.
5. Scheduling and Resource Allocation Under Heterogeneous and Mixed Workloads
Scaling prefill-oriented inference to high concurrency, mixed-request, or heterogeneous environments introduces sophisticated scheduling, batching, and control policies.
- Dynamic P/D Ratio and Instance Assignment: DOPD (Liao et al., 26 Nov 2025) forecasts near-term load using time-series models (e.g., ARIMA) and computes the optimal number and tensor-parallel configuration of prefill and decode instances, driven by live metrics, to maintain high SLO attainment. Fine-grained length-aware batching and prioritization further minimize tail latency and queuing delays.
- Adaptive Rescheduling and Load Prediction: ARES (Wang et al., 15 Oct 2025) integrates a lightweight, continuous in-model length predictor (via MLP on last decode token embedding) to drive adaptive migrations, balancing live and forecasted decode instance loads, suppressing OOM failures, and reducing TPOT by up to .
- Multi-Class, Many-Server Control Theory: Prefill-oriented architectures for large-scale and service-tiered workloads are formalized as multiclass many-server queueing networks with phase-dependent, state-sensitive service rates (Lin et al., 3 Feb 2026). Optimal steady-state allocation is solved via a capacity-constrained LP, with negative-feedback occupancy tracking (“Gate-and-Route” policy) yielding asymptotically optimal throughput and SLI-compliant class fairness and latency. Practical scheduling is accomplished with only queue-length and a small amount of per-GPU state, and robustly outperforms static or FCFS baselines.
- Length-Aware, Dual-Queue Scheduling: PLA-Serve (She et al., 4 Jan 2026) isolates short- and long-prefill workloads (by prompt length threshold ) in mutually exclusive temporal or spatial queues, invoking a length-aware smart batching policy for short-prefill jobs, with adaptive batch windows and CUDA Graph-based kernel clustering for efficient launches. Instance-pressure balancing allows dynamic migration of instances between task pools, eliminating head-of-line blocking and maximizing throughput.
- Intra- and Inter-request Pipeline Coordination: In multi-modal serving, RServe (Guo et al., 29 Sep 2025) overlaps encoding and prefill both within and across requests, orchestrating chunked prefill that launches as soon as chunk-specific embeddings are available and globally batching requests by schedulable token count, thereby maximizing utilization and reducing time to first token by up to .
6. Prefill-Only and Introspective Inference
Specialized architectures address cases where only a single token is generated:
- PrefillOnly Engine: For discriminative, prefill-only tasks (e.g., recommendation, data labeling), PrefillOnly (Du et al., 12 May 2025) drastically lowers memory by releasing all but final-layer KV caches, enabling handling of longer contexts on one GPU, precisely scheduling by exact job completion time estimates, with up to higher QPS and lower mean latency versus baselines.
- Self-Introspection During Prefill: IntroLM (Kasnavieh et al., 7 Jan 2026) introduces [CPX] introspection tokens in the prefill pass plus token-conditional LoRA adapters and a classifier head, allowing LLMs to predict their own output success probability without affecting generation. This mechanism enables optimal multi-model routing, sharply reducing large-model usage (by up to ) and end-to-end latency (by up to ).
7. Hardware Specialization and FPGA/ASIC Designs
Hardware implementations and enhancements for prefill-oriented inference include:
- Specialized Prefill Hardware (SPAD): Design of "Prefill Chips" with large systolic arrays, vector units reduced in favor of tensor compute, DRAM/GDDR7 in place of HBM, and L2 buffer optimization. Prefill chips show higher prompt throughput and lower hardware cost than H100, with disaggregated deployments enabling overall cost reduction at fixed SLOs (Zhang et al., 9 Oct 2025).
- Edge FPGA Designs with Dynamic Reconfiguration (PD-Swap): Dynamic partial reconfiguration swaps the attention engine between a compute-heavy, token-parallel prefill microarchitecture and a bandwidth-optimized decode engine in a single edge FPGA, time-multiplexed without area penalty. This recovers LUT/URAM resources for deeper parallelism in each phase, sustaining throughput improvement for long prompts over static designs (Zhang et al., 12 Dec 2025).
In summary, prefill-oriented inference architectures tailor the computation, memory, scheduling, and hardware stack of LLM inference to exploit the structural and performance characteristics of the prefill stage. Techniques span dynamic cross-device load balancing, prefill-aware pruning and skipping, efficient attention through sparsity or cache reuse, concurrency-optimized scheduling and batching, specialized kernels, hardware disaggregation, and introspective or phase-specific logic. These methods deliver substantial throughput and latency improvements, particularly for long-context applications, and enable principled trade-offs tailored to workload and system heterogeneity across data center and edge environments (Liu et al., 22 Sep 2025, Qiao et al., 2024, He et al., 3 Feb 2026, Jones et al., 9 Feb 2026, Du et al., 12 May 2025, Liao et al., 26 Nov 2025, Zhang et al., 9 Oct 2025, Kamath et al., 2024, Lee et al., 9 Oct 2025, Lv et al., 2024, Zhang et al., 12 Jun 2025, Hu et al., 2024, Kasnavieh et al., 7 Jan 2026, Zhang et al., 12 Dec 2025, She et al., 4 Jan 2026, Guo et al., 29 Sep 2025, Lin et al., 3 Feb 2026).