State Tuning: State-based Test-Time Scaling on RWKV-7 (2504.05097v1)

Published 7 Apr 2025 in cs.CL and cs.LG

Abstract: Test-time scaling has emerged as a prominent research direction in machine learning, enabling models to enhance their expressive capabilities during inference.Transformers, renowned for striking a delicate balance between efficiency and expressiveness, have benefited from test-time scaling techniques that leverage an expanding key-value (KV) cache to significantly improve performance.In this paper, we introduce a novel state-based approach to test-time scaling, which we term state tuning, tailored to the RNN-based RWKV-7 model.By exploiting the unique strengths of RWKV-7, our method achieves state-of-the-art performance on the target task without altering the model's pre-trained weights. Our approach centers on three key innovations. First, we develop an observer framework that allows a smaller model to replicate and learn the state dynamics of the RWKV-7 model. Second, we employ a kernel method to dynamically upscale the state size, enhancing the model's capacity to capture intricate patterns. Third, we integrate Decorrelated Backpropagation (DBP) to optimize the upscaled state matrix, thereby improving convergence and expressivity. By tuning only the state matrix, we demonstrate that a smaller model can outperform larger models on the given task. This method preserves the efficiency of the original RWKV-7 architecture while harnessing the power of test-time scaling to deliver superior results. Our findings underscore the potential of state tuning as an effective strategy for advancing model performance in resource-constrained settings. Our code is https://github.com/TorchRWKV/flash-linear-attention.

Summary

The paper introduces state tuning, a method that optimizes RWKV-7's internal state matrix without modifying its pre-trained weights.
It details four approaches—including standard tuning, dynamic scaling with kernel methods, DBP-enhanced tuning, and test-time scaling—to adapt the model for specific tasks.
Experiments show that DBP-enhanced tuning delivers the highest gains, significantly outpacing the baseline on benchmarks like MMLU and GSM8K.

This paper introduces "State Tuning," a set of techniques designed to enhance the performance of the RWKV-7 "Goose" LLM without modifying its pre-trained weights. The core idea is to optimize the model's internal state matrix $S_t$ to adapt it to specific tasks or improve its general capabilities during inference. This addresses the common challenge of needing large computational resources for training or fine-tuning LLMs, offering more efficient ways to boost smaller models like the 7B parameter RWKV-7.

The paper proposes four distinct methods:

Standard State Tuning: This baseline approach involves initializing a new state matrix $S_0$ (e.g., with zeros) and optimizing it directly using backpropagation on a target task dataset. All other model parameters remain frozen. The state $S_t \in \mathbb{R}^{N \times N}$ is updated at each time step using the standard RWKV recurrence relation, but only $S_t$ itself is treated as trainable parameters. This allows the model to adapt its internal memory dynamics to the specific task while retaining its pre-trained knowledge.
Dynamic Scaling with Kernel Method: To increase the model's expressive capacity beyond the original state dimension $N$ $N$ , this method uses a kernel trick to effectively upscale the state matrix to a higher dimension $M \times M$ $M \times M$ (where $M > N$ $M > N$ ).
- Support vectors $\{ u_1, \dots, u_M \}$ are chosen.
- A kernel function (e.g., Gaussian kernel $K(u, v) = \exp(-\gamma \| u - v \|^2)$ ) is used to map the RWKV internal vectors $(w_t, k_t, a_t, v_t, r_t \in \mathbb{R}^N)$ into higher-dimensional feature vectors $\phi(\cdot) \in \mathbb{R}^M$ .
- The state update and output computation happen in this higher $M$ -dimensional space using the transformed vectors $\phi(\cdot)$ and an upscaled state matrix $S_t \in \mathbb{R}^{M \times M}$ .
- The final output is projected back to the original dimension $N$ using a fixed projection matrix $Q$ .
- Only the upscaled state matrix $S_t \in \mathbb{R}^{M \times M}$ is tuned. This adds non-linearity and increases state capacity without changing weights.
DBP-Enhanced Dynamic State Tuning: This method builds upon the dynamic scaling approach by incorporating Decorrelated Backpropagation (DBP). DBP aims to improve training efficiency and model expressivity by decorrelating the inputs to layers. Here, it's adapted to decorrelate the kernel-transformed vectors $\phi(\cdot)$ $ϕ (\cdot)$ before they are used in the state update.
- A decorrelation matrix $R \in \mathbb{R}^{M \times M}$ is introduced and applied to the kernel features (e.g., $\phi(k_t)^{\text{decor}} = R \phi(k_t)$ ).
- The state update uses these decorrelated vectors.
- A decorrelation loss $\mathcal{L}_{\text{decor}}$ is added to the task loss, penalizing correlations between components of the transformed vectors and encouraging unit variance.
- Both the upscaled state matrix $S_t$ and the decorrelation matrix $R$ are jointly optimized during training. DBP is expected to accelerate convergence and lead to more expressive state representations.
Test-Time Scaling with Larger Model Guidance: This technique performs state tuning during inference for each input sequence, guided by a larger, more capable LLM.
- For a given input sequence, the larger LLM generates a step-by-step Chain of Thought (COT) reasoning sequence.
- At each generation step $t$ of the RWKV-7 model, its current state $S_t$ is optimized using Reinforcement Learning (RL).
- A reward $R(S_t, x_{t+1})$ is defined based on how well the RWKV-7 model's next predicted token $x_{t+1}$ aligns with the corresponding step in the larger LLM's COT sequence (using log-probabilities from the larger LLM).
- The gradient of this reward with respect to $S_t$ is computed, and $S_t$ is updated via gradient ascent for a few iterations.
- The tuned state $S_t$ is then used to generate the next token $x_{t+1}$ .
- This allows RWKV-7 to dynamically adapt its internal state at test time to perform more complex reasoning, guided by the larger model, without requiring prior training on similar reasoning tasks.

Experiments and Results:

The methods were evaluated on the RWKV-7 "Goose" 7B model using benchmarks like MMLU (general knowledge), GSM8K (math reasoning), WinoGrande (commonsense), and ARC-Challenge (scientific reasoning).

All proposed methods significantly outperformed the vanilla RWKV-7 baseline.
Standard State Tuning provided a solid improvement (approx. 7-8 points absolute gain on benchmarks like MMLU and GSM8K).
Dynamic Scaling offered further improvements over standard tuning.
DBP-Enhanced Dynamic State Tuning achieved the best results across all benchmarks (e.g., 79.0% MMLU, 89.0% GSM8K), demonstrating the benefits of decorrelating state inputs for enhanced expressivity and potentially faster convergence during tuning.
Test-Time Scaling performed nearly as well as the DBP-enhanced method (e.g., 78.6% MMLU, 88.5% GSM8K), showcasing the effectiveness of dynamic, inference-time adaptation guided by a larger model.

Conclusion:

The paper successfully demonstrates that state tuning, in various forms, is an effective and computationally efficient strategy for enhancing the performance of the RWKV-7 model without altering its pre-trained weights. The DBP-enhanced method provides the highest performance gains among the training-based approaches, while the test-time scaling method offers a flexible way to leverage larger models for guidance during inference. These techniques present practical ways to improve smaller models for complex tasks, especially in resource-constrained settings.