LLaMCAT: Optimizing Large Language Model Inference with Cache Arbitration and Throttling (2512.00083v1)

Published 26 Nov 2025 in cs.AR and cs.DC

Abstract: LLMs have achieved unprecedented success across various applications, but their substantial memory requirements pose significant challenges to current memory system designs, especially during inference. Our work targets last-level cache (LLC) based architectures, including GPUs (e.g., NVIDIA GPUs) and AI accelerators. We introduce LLaMCAT, a novel approach to optimize the LLC for LLM inference. LLaMCAT combines Miss Status Holding Register (MSHR)- and load balance-aware cache arbitration with thread throttling to address stringent bandwidth demands and minimize cache stalls in KV Cache access. We also propose a hybrid simulation framework integrating analytical models with cycle-level simulators via memory traces, balancing architecture detail and efficiency. Experiments demonstrate that LLaMCAT achieves an average speedup of 1.26x when the system is mainly bottlenecked by miss handling throughput, while baselines mostly show negative improvements since they are not optimized for this scenario. When the cache size is also limited, our policy achieves a speedup of 1.58x over the unoptimized version, and a 1.26x improvement over the best baseline (dyncta). Overall, LLaMCAT is the first to target LLM decoding-specific MSHR contention, a gap in previous work. It presents a practical solution for accelerating LLM inference on future hardware platforms.