Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 152 tok/s
Gemini 2.5 Pro 25 tok/s Pro
GPT-5 Medium 20 tok/s Pro
GPT-5 High 30 tok/s Pro
GPT-4o 92 tok/s Pro
Kimi K2 134 tok/s Pro
GPT OSS 120B 437 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

State Tuning: State-based Test-Time Scaling on RWKV-7 (2504.05097v1)

Published 7 Apr 2025 in cs.CL and cs.LG

Abstract: Test-time scaling has emerged as a prominent research direction in machine learning, enabling models to enhance their expressive capabilities during inference.Transformers, renowned for striking a delicate balance between efficiency and expressiveness, have benefited from test-time scaling techniques that leverage an expanding key-value (KV) cache to significantly improve performance.In this paper, we introduce a novel state-based approach to test-time scaling, which we term state tuning, tailored to the RNN-based RWKV-7 model.By exploiting the unique strengths of RWKV-7, our method achieves state-of-the-art performance on the target task without altering the model's pre-trained weights. Our approach centers on three key innovations. First, we develop an observer framework that allows a smaller model to replicate and learn the state dynamics of the RWKV-7 model. Second, we employ a kernel method to dynamically upscale the state size, enhancing the model's capacity to capture intricate patterns. Third, we integrate Decorrelated Backpropagation (DBP) to optimize the upscaled state matrix, thereby improving convergence and expressivity. By tuning only the state matrix, we demonstrate that a smaller model can outperform larger models on the given task. This method preserves the efficiency of the original RWKV-7 architecture while harnessing the power of test-time scaling to deliver superior results. Our findings underscore the potential of state tuning as an effective strategy for advancing model performance in resource-constrained settings. Our code is https://github.com/TorchRWKV/flash-linear-attention.

Summary

  • The paper introduces state tuning, a method that optimizes RWKV-7's internal state matrix without modifying its pre-trained weights.
  • It details four approaches—including standard tuning, dynamic scaling with kernel methods, DBP-enhanced tuning, and test-time scaling—to adapt the model for specific tasks.
  • Experiments show that DBP-enhanced tuning delivers the highest gains, significantly outpacing the baseline on benchmarks like MMLU and GSM8K.

This paper introduces "State Tuning," a set of techniques designed to enhance the performance of the RWKV-7 "Goose" LLM without modifying its pre-trained weights. The core idea is to optimize the model's internal state matrix StS_t to adapt it to specific tasks or improve its general capabilities during inference. This addresses the common challenge of needing large computational resources for training or fine-tuning LLMs, offering more efficient ways to boost smaller models like the 7B parameter RWKV-7.

The paper proposes four distinct methods:

  1. Standard State Tuning: This baseline approach involves initializing a new state matrix S0S_0 (e.g., with zeros) and optimizing it directly using backpropagation on a target task dataset. All other model parameters remain frozen. The state StRN×NS_t \in \mathbb{R}^{N \times N} is updated at each time step using the standard RWKV recurrence relation, but only StS_t itself is treated as trainable parameters. This allows the model to adapt its internal memory dynamics to the specific task while retaining its pre-trained knowledge.
  2. Dynamic Scaling with Kernel Method: To increase the model's expressive capacity beyond the original state dimension NN, this method uses a kernel trick to effectively upscale the state matrix to a higher dimension M×MM \times M (where M>NM > N).
    • Support vectors {u1,,uM}\{ u_1, \dots, u_M \} are chosen.
    • A kernel function (e.g., Gaussian kernel K(u,v)=exp(γuv2)K(u, v) = \exp(-\gamma \| u - v \|^2)) is used to map the RWKV internal vectors (wt,kt,at,vt,rtRN)(w_t, k_t, a_t, v_t, r_t \in \mathbb{R}^N) into higher-dimensional feature vectors ϕ()RM\phi(\cdot) \in \mathbb{R}^M.
    • The state update and output computation happen in this higher MM-dimensional space using the transformed vectors ϕ()\phi(\cdot) and an upscaled state matrix StRM×MS_t \in \mathbb{R}^{M \times M}.
    • The final output is projected back to the original dimension NN using a fixed projection matrix QQ.
    • Only the upscaled state matrix StRM×MS_t \in \mathbb{R}^{M \times M} is tuned. This adds non-linearity and increases state capacity without changing weights.
  3. DBP-Enhanced Dynamic State Tuning: This method builds upon the dynamic scaling approach by incorporating Decorrelated Backpropagation (DBP). DBP aims to improve training efficiency and model expressivity by decorrelating the inputs to layers. Here, it's adapted to decorrelate the kernel-transformed vectors ϕ()\phi(\cdot) before they are used in the state update.
    • A decorrelation matrix RRM×MR \in \mathbb{R}^{M \times M} is introduced and applied to the kernel features (e.g., ϕ(kt)decor=Rϕ(kt)\phi(k_t)^{\text{decor}} = R \phi(k_t)).
    • The state update uses these decorrelated vectors.
    • A decorrelation loss Ldecor\mathcal{L}_{\text{decor}} is added to the task loss, penalizing correlations between components of the transformed vectors and encouraging unit variance.
    • Both the upscaled state matrix StS_t and the decorrelation matrix RR are jointly optimized during training. DBP is expected to accelerate convergence and lead to more expressive state representations.
  4. Test-Time Scaling with Larger Model Guidance: This technique performs state tuning during inference for each input sequence, guided by a larger, more capable LLM.
    • For a given input sequence, the larger LLM generates a step-by-step Chain of Thought (COT) reasoning sequence.
    • At each generation step tt of the RWKV-7 model, its current state StS_t is optimized using Reinforcement Learning (RL).
    • A reward R(St,xt+1)R(S_t, x_{t+1}) is defined based on how well the RWKV-7 model's next predicted token xt+1x_{t+1} aligns with the corresponding step in the larger LLM's COT sequence (using log-probabilities from the larger LLM).
    • The gradient of this reward with respect to StS_t is computed, and StS_t is updated via gradient ascent for a few iterations.
    • The tuned state StS_t is then used to generate the next token xt+1x_{t+1}.
    • This allows RWKV-7 to dynamically adapt its internal state at test time to perform more complex reasoning, guided by the larger model, without requiring prior training on similar reasoning tasks.

Experiments and Results:

The methods were evaluated on the RWKV-7 "Goose" 7B model using benchmarks like MMLU (general knowledge), GSM8K (math reasoning), WinoGrande (commonsense), and ARC-Challenge (scientific reasoning).

  • All proposed methods significantly outperformed the vanilla RWKV-7 baseline.
  • Standard State Tuning provided a solid improvement (approx. 7-8 points absolute gain on benchmarks like MMLU and GSM8K).
  • Dynamic Scaling offered further improvements over standard tuning.
  • DBP-Enhanced Dynamic State Tuning achieved the best results across all benchmarks (e.g., 79.0% MMLU, 89.0% GSM8K), demonstrating the benefits of decorrelating state inputs for enhanced expressivity and potentially faster convergence during tuning.
  • Test-Time Scaling performed nearly as well as the DBP-enhanced method (e.g., 78.6% MMLU, 88.5% GSM8K), showcasing the effectiveness of dynamic, inference-time adaptation guided by a larger model.

Conclusion:

The paper successfully demonstrates that state tuning, in various forms, is an effective and computationally efficient strategy for enhancing the performance of the RWKV-7 model without altering its pre-trained weights. The DBP-enhanced method provides the highest performance gains among the training-based approaches, while the test-time scaling method offers a flexible way to leverage larger models for guidance during inference. These techniques present practical ways to improve smaller models for complex tasks, especially in resource-constrained settings.

Dice Question Streamline Icon: https://streamlinehq.com

Open Questions

We haven't generated a list of open questions mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 2 tweets and received 2 likes.

Upgrade to Pro to view all of the tweets about this paper:

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube