Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 59 tok/s

Gemini 2.5 Pro 50 tok/s Pro

GPT-5 Medium 28 tok/s Pro

GPT-5 High 29 tok/s Pro

GPT-4o 80 tok/s Pro

Kimi K2 181 tok/s Pro

GPT OSS 120B 454 tok/s Pro

Claude Sonnet 4.5 33 tok/s Pro

2000 character limit reached

Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback (2501.12895v1)

Published 22 Jan 2025 in cs.CL

Abstract: LLMs demonstrate impressive performance but lack the flexibility to adapt to human preferences quickly without retraining. In this work, we introduce Test-time Preference Optimization (TPO), a framework that aligns LLM outputs with human preferences during inference, removing the need to update model parameters. Rather than relying on purely numerical rewards, TPO translates reward signals into textual critiques and uses them as textual rewards to iteratively refine its response. Evaluations on benchmarks covering instruction following, preference alignment, safety, and mathematics reveal that TPO progressively improves alignment with human preferences. Notably, after only a few TPO steps, the initially unaligned Llama-3.1-70B-SFT model can surpass the aligned counterpart, Llama-3.1-70B-Instruct. Furthermore, TPO scales efficiently with both the search width and depth during inference. Through case studies, we illustrate how TPO exploits the innate capacity of LLM to interpret and act upon reward signals. Our findings establish TPO as a practical, lightweight alternative for test-time preference optimization, achieving alignment on the fly. Our code is publicly available at https://github.com/yafuly/TPO.

Summary

The paper introduces Test-Time Preference Optimization (TPO), which aligns large language model outputs using iterative textual feedback without retraining parameters.
The method repurposes a model's policy to convert numerical feedback into text critiques that refine outputs in real time.
Empirical results demonstrate that minimal TPO iterations significantly boost performance on benchmarks, outperforming traditional alignment methods.

An Analysis of Test-Time Preference Optimization in LLMs

The paper Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback presents a novel approach to aligning LLMs with human preferences by employing a method called Test-Time Preference Optimization (TPO). This technique allows for the real-time adaptation of model outputs to align with user preferences, bypassing the need for parameter retraining that traditional methods such as Reinforcement Learning from Human Feedback (RLHF) necessitate. The paper primarily addresses two fundamental questions: whether LLMs can achieve performance similar to training-time methods during inference and if interpretable textual feedback can be utilized rather than numerical scores for optimization.

Framework and Methodology

The TPO approach repurposes the intrinsic capabilities of a policy model to interpret reward signals, transforming numerical feedback into textual critiques. These textual rewards iteratively refine the output of the model. Unlike the traditional parameter update process in preference alignment, TPO capitalizes on the innate ability of LLMs to process and act upon feedback without weight modifications. The method hinges on three principal components akin to traditional gradient descent: defining a variable, calculating loss, and optimizing the variable, albeit executed in a textual framework.

Empirical Evaluation

The effectiveness of TPO is evaluated across multiple benchmarks addressing various aspects such as instruction following, alignment, safety, and mathematical reasoning. Noteworthy is the performance of the Llama-3.1-70B-SFT model, which, after integrating just a few TPO iterations, outperforms its previously aligned counterpart, Llama-3.1-70B-Instruct, across multiple datasets. Benchmark assessments reveal substantial improvements for both unaligned and aligned models with only two TPO steps. Notably, the optimized model achieved significant performance on tasks such as AlpacaEval 2 and Arena-Hard, often surpassing established leaderboard entries.

Practical and Theoretical Implications

One of the salient implications of TPO is its ability to perform alignment without altering model parameters, which significantly reduces computational resources compared to conventional preference optimization techniques. The adaptability of TPO in scaling test-time computation in terms of both search width and depth amplifies its efficiency. Depth-wise revisions compensate for the inefficiencies of merely increasing sampling width, effectively combining parallel sampling and sequential revision benefits.

Limitations and Future Directions

While TPO innovatively displays the potential to align LLM outputs at test time, it depends heavily on the instruction-following capability of the model. Models lacking this ability, such as smaller LLMs, may not fully benefit from this optimization approach. The authors suggest the future potential of enhancing models with fine-tuning specifically for TPO tasks, thus broadening its applicability and effectiveness.

Conclusion

The introduction of TPO as a test-time procedure offers a compelling alternative to traditional training-time preference optimization by utilizing the reasoning and interpretation capabilities of LLMs. This method provides a scalable, efficient means to align LLMs with human preferences dynamically and in real-time, paving the way for further advancements in interactive AI systems that can adjust outputs based on user feedback without extensive computational overhead. Future work could explore refining the textual interaction protocols, enhancing reward models, and adapting weaker LLM configurations for improved TPO efficacy.