Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Kinetics: Rethinking Test-Time Scaling Laws (2506.05333v3)

Published 5 Jun 2025 in cs.LG and cs.CL

Abstract: We rethink test-time scaling laws from a practical efficiency perspective, revealing that the effectiveness of smaller models is significantly overestimated. Prior work, grounded in compute-optimality, overlooks critical memory access bottlenecks introduced by inference-time strategies (e.g., Best-of-$N$, long CoTs). Our holistic analysis, spanning models from 0.6B to 32B parameters, reveals a new Kinetics Scaling Law that better guides resource allocation by incorporating both computation and memory access costs. Kinetics Scaling Law suggests that test-time compute is more effective when used on models above a threshold than smaller ones. A key reason is that in TTS, attention, rather than parameter count, emerges as the dominant cost factor. Motivated by this, we propose a new scaling paradigm centered on sparse attention, which lowers per-token cost and enables longer generations and more parallel samples within the same resource budget. Empirically, we show that sparse attention models consistently outperform dense counterparts, achieving over 60 points gains in low-cost regimes and over 5 points gains in high-cost regimes for problem-solving accuracy on AIME, encompassing evaluations on state-of-the-art MoEs. These results suggest that sparse attention is essential and increasingly important with more computing invested, for realizing the full potential of test-time scaling where, unlike training, accuracy has yet to saturate as a function of computation, and continues to improve through increased generation. The code is available at https://github.com/Infini-AI-Lab/Kinetics.

Summary

  • The paper reveals that incorporating memory access costs overturns FLOPs-only assumptions, showing that smaller models are less efficient than previously thought with test-time strategies.
  • It identifies a critical model size threshold where investing in longer generation methods offers better cost-effectiveness than simply enlarging model parameters.
  • The study introduces Sparse Kinetics with Block Top-K attention, which significantly reduces computational costs and enhances throughput for longer sequence generation.

This paper, "Kinetics: Rethinking Test-Time Scaling Laws" (2506.05333), challenges existing test-time scaling (TTS) laws for LLMs by arguing that they primarily focus on computation (FLOPs) and neglect memory access costs, which are critical bottlenecks in real-world inference, particularly for advanced TTS strategies like Long-CoT and Best-of-N.

The authors propose a new cost model that explicitly incorporates memory access costs, specifically the Key-Value (KV) cache size and its access frequency, in addition to computational costs. This leads to a new perspective they call the "Kinetics scaling law".

Using empirical evaluations on various models (Qwen3, DeepSeek-R1) across reasoning tasks like AIME and LiveCodeBench, the paper demonstrates that the Kinetics scaling law yields different insights compared to FLOPs-only scaling:

  1. Overestimation of Small Model Efficiency: Previous laws suggested small models are highly efficient, especially when combined with TTS strategies. Kinetics reveals that incorporating memory costs shows smaller models are less efficient than assumed. The Pareto frontier of accuracy vs. cost shifts, with larger models becoming more cost-effective even at lower accuracy levels.
  2. Critical Model Size Threshold: The analysis suggests that compute resources are most effectively allocated to increasing model size up to a certain threshold (empirically found to be around 14B parameters for Qwen3 and 7B for DeepSeek-R1) before investing heavily in increasing generation length (Long-CoT) or the number of trials (Best-of-N). Beyond this threshold, further compute investment in TTS strategies becomes more beneficial. This contradicts prior beliefs that scaling TTS strategies with small models is always the initial optimal step.
  3. Attention Cost Dominance: The shift is attributed to a roofline analysis showing that self-attention computation and KV cache access costs dominate linear layer costs, especially with longer generations common in TTS. This attention-related cost grows quadratically with generation length, while model parameter costs grow linearly. Furthermore, KV memory size often grows sub-linearly with model parameters, making smaller models disproportionately burdened by KV cache relative to their parameter count.

Based on these findings, the paper introduces Sparse Kinetics, a new scaling paradigm centered on sparse attention. Sparse attention aims to mitigate the attention bottleneck by reducing the cost of computing and accessing the KV cache from quadratic (O(L2D)\mathcal{O}(L^2 D)) or linear (O(LD)\mathcal{O}(LD)) to a cost that is linear in generation length but also dependent on a smaller KV budget per token (O(LBD)\mathcal{O}(LBD)), where BB is the sparse budget.

The authors demonstrate the potential of sparse attention using an oracle Top-KK attention mechanism and a more practical Block Top-KK attention:

  • Enhanced Performance and Efficiency: Sparse attention models consistently outperform dense models across benchmarks. They achieve significant accuracy gains (up to 60 points in low-cost regimes, 5+ points in high-cost regimes) or require substantially less compute (over 10x reduction) to reach the same accuracy. This holds for standard transformer and Mixture-of-Experts (MoE) models.
  • Reshaped Kinetics: Sparse attention fundamentally changes the cost structure, making smaller models more competitive on the Pareto frontier again by reducing the severe penalty associated with long generations.
  • Optimal Resource Allocation with Sparsity: Under Sparse Kinetics, compute is most effectively allocated to increasing the number of generated tokens/trials rather than solely reducing sparsity (increasing KV budget) or dramatically increasing model size beyond a certain point. Doubling compute cost results in a larger increase in generated tokens than in KV budget.

For practical implementation, the paper focuses on Block Top-KK Attention. This method groups tokens into blocks and selects the most relevant blocks based on average key vectors, making it hardware-efficient and compatible with systems like paged attention. Experiments using a Flashinfer/torch compile backend on H200 GPUs show substantial throughput improvements with Block Top-KK attention compared to dense attention, especially for smaller models and longer contexts (e.g., 23.6x to 33.3x for Qwen3-0.6B). While Block Top-KK might not fully match the oracle Top-KK performance, it provides a good balance between effectiveness and tractability.

The paper concludes by highlighting the importance of co-designing model architectures, test-time strategies, and hardware systems, guided by the Kinetics scaling law and leveraging sparse attention, to achieve efficient and scalable LLM deployment, particularly as TTS strategies become more prevalent.

In summary, "Kinetics: Rethinking Test-Time Scaling Laws" provides a memory-aware cost model for LLM inference, revealing limitations of prior compute-centric scaling laws and demonstrating that sparse attention is a crucial enabler for efficient and scalable test-time performance, particularly unlocking greater potential in generating longer responses or more trials within a given resource budget.

Github Logo Streamline Icon: https://streamlinehq.com