T1: Tool-integrated Self-verification for Test-time Compute Scaling in Small Language Models (2504.04718v1)

Published 7 Apr 2025 in cs.CL and cs.AI

Abstract: Recent studies have demonstrated that test-time compute scaling effectively improves the performance of small LLMs (sLMs). However, prior research has mainly examined test-time compute scaling with an additional larger model as a verifier, leaving self-verification by sLMs underexplored. In this work, we investigate whether sLMs can reliably self-verify their outputs under test-time scaling. We find that even with knowledge distillation from larger verifiers, sLMs struggle with verification tasks requiring memorization, such as numerical calculations and fact-checking. To address this limitation, we propose Tool-integrated self-verification (T1), which delegates memorization-heavy verification steps to external tools, such as a code interpreter. Our theoretical analysis shows that tool integration reduces memorization demands and improves test-time scaling performance. Experiments on the MATH benchmark demonstrate that, with T1, a Llama-3.2 1B model under test-time scaling outperforms the significantly larger Llama-3.1 8B model. Moreover, T1 generalizes effectively to both mathematical (MATH500) and multi-domain knowledge-intensive tasks (MMLU-Pro). Our findings highlight the potential of tool integration to substantially improve the self-verification abilities of sLMs.

Summary

The paper demonstrates that tool-integrated self-verification significantly enhances output accuracy of small LMs using external tool support.
The methodology combines tool-based checks with reward model scoring via knowledge distillation, ensuring robust performance across benchmarks.
Results indicate that even a Llama-3.2 1B model can outperform larger models like Llama-3.1 8B when leveraging test-time scaling with integrated verification.

Tool-integrated Self-verification for Small LLMs

The paper "T1: Tool-integrated Self-verification for Test-time Compute Scaling in Small LLMs" examines the challenges associated with test-time compute scaling in small LLMs (sLMs) and offers the novel methodology of Tool-integrated Self-verification (T1) to address these challenges. This research targets the verification of sLM outputs without reliance on large verifiers and proposes leveraging external tools to bolster verification accuracy.

Background and Motivation

Recent advancements have shown that test-time compute scaling can significantly improve performance in sLMs, enabling them to approach the proficiency seen in larger LLMs. However, the efficacy of these sLMs in verification tasks remains suboptimal, particularly in scenarios that demand substantial memorization, such as numerical calculations and fact-checking. Existing methodologies often involve the use of larger models to verify outputs, which dilutes the efficiency advantages offered by sLMs. The research seeks to determine whether sLMs can independently and reliably perform verification tasks when supplemented by external tools.

Methodology

The methodology involves a two-stage process:

Tool-based Verification Stage: Here, sLMs are paired with external tools to facilitate verification, primarily focusing on reducing memorization demands. For tasks involving mathematical reasoning, a code interpreter can be used to verify computations. In knowledge-intensive tasks, a retriever tool provides relevant information to check factual accuracy.
Reward Model-based Verification Stage: This stage involves using reward models, trained via knowledge distillation from larger models, to score solutions based on their logical consistency and correctness.

Knowledge distillation techniques are used to enhance the sLM's performance, transferring verification capabilities from larger models to sLMs. Multi-LoRA adapters help manage diverse tasks during this process.

Results

Experimental results, supported by theoretical analysis, demonstrate that T1 considerably enhances sLM performance in various benchmarks, such as MATH500, GSM8K, and MMLU-Pro. Specifically, using T1 allowed a Llama-3.2 1B model to outperform models that are substantially larger, such as Llama-3.1 8B, under test-time scaling conditions. This suggests that external tool integration is immensely beneficial, particularly in tasks that are traditionally reliant on LLMs due to their memorization capacity.

Implications

The implications of this research are significant both practically and theoretically. Practically, it enables the deployment of more cost-effective models with high performance capabilities in problem-solving tasks. Theoretically, it opens exploration into how tools can systematically alleviate memorization burdens from model structures and allow smaller models to excel in areas previously dominated by larger models.

Future Directions

Future research may explore integrating tool use into test-time scaling frameworks beyond parallel scaling, such as sequential scaling algorithms. Additionally, enhancements to tool-use strategies can improve verifier accuracy by further minimizing false negatives in verification, and could explore expanding the types of tools available to sLMs.

In conclusion, the integration of tool-based processes with sLMs represents a robust approach to enhancing their verification capabilities at test-time scaling, offering practical efficiencies without compromising performance accuracy.

Related Papers

Find Related Papers

Tweets

https://twitter.com/_akhaliq/status/1909545541702234575

https://twitter.com/TheTuringPost/status/1910805349722185815

https://twitter.com/GptMaestro/status/1910102749964120269