Unsupervised Elicitation of Language Models (2506.10139v1)

Published 11 Jun 2025 in cs.CL and cs.AI

Abstract: To steer pretrained LLMs for downstream tasks, today's post-training paradigm relies on humans to specify desired behaviors. However, for models with superhuman capabilities, it is difficult or impossible to get high-quality human supervision. To address this challenge, we introduce a new unsupervised algorithm, Internal Coherence Maximization (ICM), to fine-tune pretrained LLMs on their own generated labels, \emph{without external supervision}. On GSM8k-verification, TruthfulQA, and Alpaca reward modeling tasks, our method matches the performance of training on golden supervision and outperforms training on crowdsourced human supervision. On tasks where LMs' capabilities are strongly superhuman, our method can elicit those capabilities significantly better than training on human labels. Finally, we show that our method can improve the training of frontier LMs: we use our method to train an unsupervised reward model and use reinforcement learning to train a Claude 3.5 Haiku-based assistant. Both the reward model and the assistant outperform their human-supervised counterparts.

Summary

The paper presents ICM, which optimizes language model labels through mutual predictability and logical consistency without relying on external human annotations.
The method iteratively refines labels using a simulated annealing approach, achieving performance on tasks like TruthfulQA, GSM8K, and Alpaca comparable to golden supervision.
ICM highlights the potential of unsupervised label elicitation to match or surpass human and zero-shot benchmarks, despite challenges with non-salient concepts.

This paper introduces Internal Coherence Maximization (ICM), an unsupervised algorithm designed to fine-tune pretrained LMs using their own generated labels, thereby bypassing the need for external human supervision, especially for tasks where human expertise is limited or models exhibit superhuman capabilities. The core idea is that pretrained LMs already possess rich representations of many concepts, which can be "elicited" rather than taught.

Methodology: Internal Coherence Maximization (ICM)

The goal is to estimate labels $\{y_i\}$ for a set of inputs $\{x_i\}$ without using any ground-truth labels $\{y_i^*\}$ . ICM achieves this by optimizing a scoring function $U(D) = \alpha \cdot \mathcal{P}_\theta(D) - \mathcal{I}(D)$ , where $D = \{(x_i, y_i)\}$ is the set of model-generated labels.

Mutual Predictability ( $\mathcal{P}_\theta(D)$ ): This term measures how well the model can predict each label $y_i$ when conditioned on all other labels in the dataset $D \setminus (x_i, y_i)$ . It is calculated as the sum of log probabilities: $\mathcal{P}_\theta(D) = \sum_{i=0}^N\log P_\theta(y_i|x_i, D \setminus (x_i, y_i))$ . A high score indicates that the labels collectively represent a coherent concept for the model.
Logical Consistency ( $\mathcal{I}(D)$ ): This term penalizes inconsistencies between pairs of labels. It is defined as $\mathcal{I}(D) = \sum_{i=1}^N \sum_{j=1}^N c(x_i, y_i, x_j, y_j)$ , where $c(x_i, y_i, x_j, y_j)$ $c (x_{i}, y_{i}, x_{j}, y_{j})$ is a binary function checking if labels $y_i$ $y_{i}$ and $y_j$ $y_{j}$ for inputs $x_i$ $x_{i}$ and $x_j$ $x_{j}$ are logically consistent. Examples include:
- For mathematical correctness, two solutions to the same problem with different final answers cannot both be "True."
- For comparative datasets, if "A > B" is "True," then "B > A" cannot also be "True" (asymmetry).

The hyperparameter $\alpha$ balances these two terms.

ICM Algorithm (Inspired by Simulated Annealing):

Finding the optimal label set is computationally infeasible. ICM uses an iterative search algorithm:

Initialization: Start with an empty labeled set $D$ . Randomly select and label $K$ examples (e.g., $K=8$ ) to initialize $D$ . Small $K$ provides sufficient context without excessive initial noise. Then, run ConsistencyFix (Algorithm 2) to resolve initial inconsistencies.
Iterative Labeling: For $n=1, \dots, N$ iterations: a. Update Temperature: $T \leftarrow \max(T_{\min}, \frac{T_0}{1 + \beta \log(n)})$ . b. Input Selection: Sample an example $x_i$ (either unlabeled or previously labeled). Unlabeled examples with consistency relationships to existing labeled ones are prioritized. c. Label Assignment: Assign label $\hat{y_i}=\arg\max\limits_{y\in \mathcal{Y}} P_{\theta}(y_i|x_i, D \setminus \{(x_i, y_i)\})$ . d. Temporary Update: $\hat{D} \leftarrow D \cup \{(x_i, \hat{y_i})\}$ . e. Fix Inconsistencies: Run ConsistencyFix on $\hat{D}$ . This sub-algorithm samples inconsistent pairs, enumerates consistent label options, and selects the option maximizing $U(D)$ . f. Acceptance: Calculate $\Delta = U(\hat{D}) - U(D)$ . * If $\Delta > 0$ , accept the new label: $D \leftarrow \hat{D}$ . * Else (if $\Delta \le 0$ ), accept with probability $\exp(\Delta/T)$ : if random(0,1) $< \exp(\Delta/T)$ , then $D \leftarrow \hat{D}$ . This allows escaping local optima, with selectivity increasing over iterations.

Experiment Setup

Datasets:
- TruthfulQA: Classify answer choices as correct/incorrect.
- GSM8K-verification: Classify LM-generated math solutions as correct/incorrect (golden labels verify final answers and reasoning steps using Claude 3.5 Sonnet).
- Alpaca: Classify which of two assistant responses is more helpful/harmless.
Baselines:
- Zero-shot (pretrained models with optimized prompt).
- Zero-shot (Chat) (commercially post-trained chat models).
- Golden Label (many-shot prompting or fine-tuning with ground-truth labels).
- Human Label (many-shot prompting or fine-tuning with crowdsourced human labels).
Models: Llama 3.1 8B, Llama 3.1 70B (open-weight pretrained), Claude 3 Haiku, Claude 3.5 Haiku (proprietary pretrained).

Key Experimental Findings

Performance on Common NLP Tasks:
- ICM matches the performance of "Golden Label" supervision on TruthfulQA and GSM8K-verification.
- ICM outperforms "Human Label" (crowdsourced) supervision on Alpaca.
- ICM outperforms "Zero-shot (Chat)" (heavily post-trained commercial models) on these benchmarks.
- ICM's effectiveness scales with pretrained model capabilities (shown on TruthfulQA with Llama 8B vs. 70B).
Limitations - Concept Salience: ICM fails when the target concept is not "salient" to the pretrained model. In a custom "poem ranking" task where preference was tied to the obscure criterion of mentioning the word "sun" (unknown to the LM), ICM performed no better than random.
Eliciting Superhuman Capabilities: On an author gender prediction task (Blog Authorship Corpus), ICM (using Claude 3.5 Haiku) matched golden supervision accuracy (80%), significantly outperforming human annotator accuracy (60%). The logical consistency used was asymmetry: if A is more likely male than B, B cannot be more likely male than A.
Training an Assistant Chatbot without Supervision:
- Reward Model (RM) Training:
  - Used Claude 3 Haiku with ICM to label 6K examples from a production preference dataset (subset of ~400K examples for Claude 3.5 Haiku training).
  - Trained an initial RM on these 6K ICM labels, then used it to label the rest of the data.
  - Trained a final unsupervised RM.
  - This unsupervised RM outperformed a human-supervised RM (trained on production-grade human labels) on RewardBench (75.0% vs. 72.2% accuracy).
- Reinforcement Learning (RL) with Unsupervised RM:
  - Fine-tuned Claude 3.5 Haiku pretrained model using the unsupervised RM.
  - This RL policy won 60% of head-to-head comparisons against a policy trained with the human-supervised RM (evaluated by Claude 3.5 Sonnet’s production RM).
  - Both lagged behind the publicly released Claude 3.5 Haiku, which underwent more extensive training.

Ablation Studies

ICM vs. Randomly Perturbed Labels: ICM-generated labels consistently outperform randomly perturbed labels of the same accuracy, suggesting ICM labels are more aligned with the model's internal understanding.
Robustness to Initialization: ICM is robust.
- Golden initialization (semi-supervised) performs best.
- Random initialization (default) achieves comparable average accuracy but slightly higher variance.
- Worst-case initialization (all initial $K$ labels wrong) leads to only a moderate performance drop due to ICM's iterative correction.
Impact of Logical Consistency:
- On TruthfulQA, removing logical consistency only moderately worsened results, as the degenerate "all-same-label" solution was rare.
- On Alpaca, logical consistency was crucial to prevent the degenerate solution. The paper concludes mutual predictability is the primary driver of success, with logical consistency mainly preventing collapse into simple degenerate solutions.

Discussion and Limitations

Role of Logical Consistency: Mutual predictability is considered the most important term. Logical consistency helps avoid degenerate solutions rather than being the primary source of performance.
Unsupervised Elicitation as an Alignment Method: ICM can augment methods like Constitutional AI by generating labels for whether responses adhere to a constitution, which can then be used to train an unsupervised RM for RL-based alignment. Human validation (e.g., via scalable oversight) is still needed.
Limitations:

1. Concept Salience: ICM cannot elicit concepts not already salient in the pretrained model. 2. Long Inputs: The mutual predictability calculation requires fitting many examples into the context window, making it difficult for very long inputs.

Conclusion

The paper demonstrates that unsupervised elicitation, particularly through ICM, can match or even surpass human supervision in fine-tuning LMs for various tasks, including complex ones like helpfulness and harmlessness, and even for tasks where LMs have superhuman abilities. This approach holds promise for training advanced LMs on tasks difficult for humans to evaluate, reducing reliance on extensive human labeling.

Appendix Details

Hyperparameters: $T_0=10$ , $T_\text{min}=0.01$ , $\beta=0.99$ . $\alpha$ starts at 50, adjusted to 20 or 30 based on search speed.
Data Statistics: Train/test splits for TruthfulQA (2560/1000), GSM8K-verification (2560/2971), Alpaca (2048/933).
Compute Costs: ICM requires 2-3.9 forward passes per datapoint on average for labeling $n=128$ datapoints.
Human Annotation (Gender Prediction): 5 annotators from Upwork labeled pairs of blog posts.

PDF Markdown

Follow-up Questions

Related Papers

Automated Data Curation for Robust Language Model Fine-Tuning (2024)
TTRL: Test-Time Reinforcement Learning (2025)
WorldPM: Scaling Human Preference Modeling (2025)
Self-Challenging Language Model Agents (2025)
Probably Approximately Correct Labels (2025)

Authors (13)

Tweets

https://twitter.com/asankhaya/status/1954186177826984102

https://twitter.com/fly51fly/status/1933642362141729050

https://twitter.com/jmbollenbacher/status/1933907914986832187

https://twitter.com/rohanpaul_ai/status/1933881729770566051

https://twitter.com/LiminalAGI/status/1933850196833665367

https://twitter.com/betterhn20/status/1933883423438823660

Unsupervised Elicitation of Language Models (2506.10139v1)

Summary

Follow-up Questions

Related Papers

Authors (13)

Tweets

YouTube

HackerNews

Reddit