Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 67 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 30 tok/s Pro
GPT-5 High 29 tok/s Pro
GPT-4o 128 tok/s Pro
Kimi K2 204 tok/s Pro
GPT OSS 120B 461 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

Self-Improving LLM Agents at Test-Time (2510.07841v1)

Published 9 Oct 2025 in cs.LG, cs.AI, and cs.CL

Abstract: One paradigm of LLM (LM) fine-tuning relies on creating large training datasets, under the assumption that high quantity and diversity will enable models to generalize to novel tasks after post-training. In practice, gathering large sets of data is inefficient, and training on them is prohibitively expensive; worse, there is no guarantee that the resulting model will handle complex scenarios or generalize better. Moreover, existing techniques rarely assess whether a training sample provides novel information or is redundant with the knowledge already acquired by the model, resulting in unnecessary costs. In this work, we explore a new test-time self-improvement method to create more effective and generalizable agentic LMs on-the-fly. The proposed algorithm can be summarized in three steps: (i) first it identifies the samples that model struggles with (self-awareness), (ii) then generates similar examples from detected uncertain samples (self-data augmentation), and (iii) uses these newly generated samples at test-time fine-tuning (self-improvement). We study two variants of this approach: Test-Time Self-Improvement (TT-SI), where the same model generates additional training examples from its own uncertain cases and then learns from them, and contrast this approach with Test-Time Distillation (TT-D), where a stronger model generates similar examples for uncertain cases, enabling student to adapt using distilled supervision. Empirical evaluations across different agent benchmarks demonstrate that TT-SI improves the performance with +5.48% absolute accuracy gain on average across all benchmarks and surpasses other standard learning methods, yet using 68x less training samples. Our findings highlight the promise of TT-SI, demonstrating the potential of self-improvement algorithms at test-time as a new paradigm for building more capable agents toward self-evolution.

Summary

  • The paper introduces a test-time self-improvement mechanism that leverages uncertainty-based sample selection, synthetic data generation, and lightweight parameter updates to enhance LLM agent performance.
  • The approach achieves an average accuracy gain of +5.48% and demonstrates 68× higher sample efficiency compared to standard supervised fine-tuning.
  • Its modular design enables the integration of alternative uncertainty metrics and data synthesis methods, paving the way for future self-adapting, lifelong learning systems.

Test-Time Self-Improvement for LLM Agents: Framework, Empirical Analysis, and Implications

The paper "Self-Improving LLM Agents at Test-Time" (2510.07841) introduces a modular framework for enabling LLM agents to adapt and improve their performance during inference, without reliance on large-scale offline retraining. The proposed Test-Time Self-Improvement (TT-SI) algorithm leverages uncertainty estimation, targeted data synthesis, and lightweight parameter updates to achieve substantial gains in agentic tasks, with strong empirical results and efficiency advantages over conventional fine-tuning paradigms. Figure 1

Figure 1: Overview of the TT-SI framework, illustrating the three-stage process and average delta-accuracy gains across four agentic benchmarks.


Motivation and Problem Formulation

Traditional LLM agent fine-tuning relies on large, diverse datasets and expensive training cycles, yet suffers from several limitations: distributional shift between train and test, high annotation and compute costs, redundancy in training samples, and catastrophic forgetting. The TT-SI paradigm is motivated by transductive and local learning principles, as well as human self-regulated learning, where adaptation is focused on challenging, informative instances rather than exhaustive coverage.

TT-SI reframes agentic adaptation as a test-time process, where the model:

  1. Identifies uncertain test samples via a margin-based confidence estimator (SAorange).
  2. Synthesizes similar training instances for each uncertain sample using the model itself (SAforest) or a stronger teacher (TT-D variant).
  3. Performs lightweight parameter updates (SAblue) using PEFT (LoRA), then resets parameters after inference.

This approach enables on-the-fly, instance-specific adaptation, targeting the model's weaknesses and surfacing latent knowledge.


Algorithmic Framework

The TT-SI algorithm is formalized as follows:

  • Uncertainty Estimation (SAorange): For each test input xix_i, compute the negative log-likelihood for candidate actions, normalize via Relative Softmax Scoring (RSS), and select samples where the softmax-difference u(xi)=p(1)p(2)u(x_i) = p^{(1)} - p^{(2)} falls below a threshold τ\tau.
  • Data Synthesis (SAforest): For each uncertain xix_i, generate KK synthetic input-output pairs using a prompt-based LLM generation process, ensuring semantic proximity to the original query.
  • Test-Time Fine-Tuning (SAblue): Temporarily update model parameters via LoRA on the synthesized data, perform inference, and restore original weights.

Pseudocode for the full procedure is provided in the paper, and the modular design allows for substitution of uncertainty metrics, data generators, and update rules. Figure 2

Figure 2: TT-SI accuracy and scaling behavior on SealTool, including ablations and sample efficiency analysis.


Empirical Results

TT-SI is evaluated on four agentic benchmarks: NexusRaven, SealTool, API-Bank, and ToolAlpaca, using Qwen2.5-1.5B-Instruct and Qwen2.5-7B-Instruct. Key findings include:

  • Absolute accuracy gains: TT-SI improves baseline prompting by +5.48% on average, with consistent gains across all benchmarks.
  • Sample efficiency: TT-SI achieves higher accuracy than supervised fine-tuning (SFT) on SealTool using 68× fewer samples (190 vs. 13k), demonstrating strong efficiency.
  • Test-time distillation (TT-D): Using a stronger teacher for data synthesis yields further improvements, especially in context-heavy scenarios.
  • ICL variant: When training is infeasible, TT-SI with in-context learning (ICL) offers a training-free alternative, outperforming standard ICL baselines.
  • Uncertainty filtering: Ablations show that focusing adaptation on uncertain samples yields near-optimal accuracy with reduced computational cost, and the choice of τ\tau controls the trade-off between coverage and efficiency. Figure 3

    Figure 3: RSS-based uncertainty estimation yields clearer separation between correct and incorrect predictions compared to perplexity-based baselines.

    Figure 4

    Figure 4: Trade-off between true positive rate and false positive rate as the uncertainty threshold τ\tau varies.


Ablation and Analysis

  • Scaling: TT-SI generalizes across model sizes, with larger relative gains for smaller models, supporting efficient deployment of compact agents.
  • Targeted adaptation: Training on uncertain samples is more effective than on certain or all samples, validating the sharpening hypothesis.
  • Data synthesis quality: UMAP visualizations show that self-generated samples are tightly clustered near the uncertain input, ensuring distributional alignment.
  • Cheating experiments: TT-SI achieves scores close to models explicitly trained on the test set, indicating that high-quality synthetic data can nearly match ground-truth adaptation. Figure 5

    Figure 5: Comparison of TT-SI and baselines when trained on the test set (cheating experiment), highlighting the effectiveness of TT-SI.


Implementation and Resource Considerations

TT-SI is implemented using HuggingFace Transformers for uncertainty estimation, vLLM for data generation and inference, and LLaMA-Factory for LoRA-based fine-tuning. Average per-sample latency is 7.3s for uncertain samples, with overall wall-clock speed-up over SFT. The framework is compatible with public agentic benchmarks and can be extended to other domains.


Theoretical and Practical Implications

TT-SI demonstrates that LLM agents can self-improve during inference by leveraging uncertainty-guided adaptation and self-generated data, surfacing latent knowledge without external supervision. This challenges the necessity of large-scale retraining and opens new directions for efficient, lifelong agent learning. The modular design allows for future integration of improved uncertainty estimators, adaptive data generation, and co-evolutionary training setups.

Limitations include sensitivity to the uncertainty threshold τ\tau and the inherent knowledge boundary of the base model. TT-SI cannot recover information absent from the pretrained weights, suggesting the need for retrieval or external augmentation in such cases.


Conclusion

The TT-SI framework provides a principled, efficient approach for test-time adaptation of LLM agents, achieving strong empirical gains with minimal data and compute. By integrating self-awareness, targeted self-augmentation, and lightweight self-improvement, TT-SI advances the paradigm of self-improving agents and lays the groundwork for future research in self-evolving, lifelong learning systems. The results highlight both the practical utility and theoretical significance of uncertainty-driven, modular test-time learning for agentic NLP tasks.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 10 posts and received 80 likes.

alphaXiv