Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 152 tok/s
Gemini 2.5 Pro 54 tok/s Pro
GPT-5 Medium 25 tok/s Pro
GPT-5 High 30 tok/s Pro
GPT-4o 101 tok/s Pro
Kimi K2 203 tok/s Pro
GPT OSS 120B 431 tok/s Pro
Claude Sonnet 4.5 26 tok/s Pro
2000 character limit reached

Beyond Log Likelihood: Probability-Based Objectives for Supervised Fine-Tuning across the Model Capability Continuum (2510.00526v1)

Published 1 Oct 2025 in cs.CL and cs.LG

Abstract: Supervised fine-tuning (SFT) is the standard approach for post-training LLMs, yet it often shows limited generalization. We trace this limitation to its default training objective: negative log likelihood (NLL). While NLL is classically optimal when training from scratch, post-training operates in a different paradigm and could violate its optimality assumptions, where models already encode task-relevant priors and supervision can be long and noisy. To this end, we study a general family of probability-based objectives and characterize their effectiveness under different conditions. Through comprehensive experiments and extensive ablation studies across 7 model backbones, 14 benchmarks, and 3 domains, we uncover a critical dimension that governs objective behavior: the model-capability continuum. Near the model-strong end, prior-leaning objectives that downweight low-probability tokens (e.g., $-p$, $-p{10}$, thresholded variants) consistently outperform NLL; toward the model-weak end, NLL dominates; in between, no single objective prevails. Our theoretical analysis further elucidates how objectives trade places across the continuum, providing a principled foundation for adapting objectives to model capability. Our code is available at https://github.com/GaotangLi/Beyond-Log-Likelihood.

Summary

  • The paper introduces probability-based objectives that leverage strong model priors to refine predictions and outperform negative log likelihood in Model-Strong settings.
  • It demonstrates that NLL remains effective in Model-Weak scenarios while highlighting the need for adaptive strategies in Model-Intermediate cases.
  • Experiments in domains such as mathematical and medical reasoning validate the continuum approach, offering actionable insights for fine-tuning large language models.

Beyond Log Likelihood: Probability-Based Objectives for Supervised Fine-Tuning across the Model Capability Continuum

Introduction

The paper addresses limitations in the traditional approach of using negative log likelihood (NLL) as the objective for supervised fine-tuning (SFT) of LLMs. It proposes probability-based objectives as a more effective alternative, particularly when the models have been pre-trained on vast datasets, already encoding substantial prior knowledge.

Model-Capability Continuum

The effectiveness of different objectives depends on the model-capability continuum—a spectrum defining model strength based on how well pre-trained models can leverage prior knowledge. This continuum is divided into three categories:

  1. Model-Strong (MS) End: Models in this category have strong prior knowledge. Probability-based objectives that down-weight low-probability tokens, such as p-p and p10-p^{10}, significantly outperform NLL.
  2. Model-Weak (MW) End: For models with weak priors, NLL remains favorable due to its ability to broadly learn from all tokens.
  3. Model-Intermediate (MI) Region: No single objective dominates, highlighting the need for adaptive approaches. Figure 1

    Figure 1: The model capability continuum of SFT objectives in Post-Training. At the model-strong (MS) end, where base models already encode extensive priors, prior-leaning objectives consistently outperform NLL. The standard NLL dominates in the model-weak (MW) end.

Experimental Setup

The experiments validate the continuum view across domains like mathematical reasoning, medical reasoning, and textual puzzles. Various models, such as LLaMA-3.1-8B and Qwen2.5-Math-7B, were trained using AdamW with results showcasing distinct performance patterns along the continuum.

Main Experimental Results

Model-Strong End: In this scenario, p-p consistently outshines NLL by leveraging strong priors to make informed predictions, focusing on refining high-probability tokens and diminishing the influence of noise from low-probability predictions.

Model-Intermediate Results: There is no significant difference between p-p and NLL in medical reasoning tasks. This neutrality suggests that when priors are balanced, improvements in performance may depend more on other factors such as data quality.

Model-Weak End: NLL's emphasis on low-probability tokens helps correct errors by spreading learning broadly across predictions, essential in domains with weak prior support, such as textual puzzles.

Theoretical Insights

A theoretical analysis establishes the conditions under which prior-leaning and prior-averse objectives outperform each other. The paper shows that prior-leaning objectives excel when strong priors exist, while prior-averse objectives are preferable where priors are inadequate.

Conclusion and Future Directions

The paper suggests a shift from fixed objectives to adaptive strategies that cater to the model's prior knowledge, capable of dynamically evolving during training. Future work could focus on integrating domain-specific supervision and exploring curriculum-style adaptation techniques.

Figures for Visual Reference

Figure 2

Figure 2: Performance under quantile thresholding for log(p)-\log(p), p-p, and log(1p)\log(1-p). QpercentileQ_{\text{percentile}} is the predicted probability threshold used for experimenting with the emphasis on probability ranges.

Figure 3

Figure 3: Analysis of MS and MW ends in terms of objective convexity and likelihood estimation. Highlights the contrasting performances of concave versus convex objectives across model capability spectrum.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 3 likes.

Upgrade to Pro to view all of the tweets about this paper: