STARS: Segment-Level Token Alignment
- The paper introduces STARS, a decoding-time technique that aligns LLM outputs using segment-level reward-guided rejection sampling instead of full-sequence selection.
- It iteratively proposes, scores, and accepts short token blocks, allowing early corrections and efficient steering toward high-reward completions.
- Experimental results demonstrate that STARS achieves competitive alignment quality and efficiency compared to traditional fine-tuning and best-of-N methods.
Segment-level Token Alignment with Rejection Sampling (STARS) is a decoding-time technique for aligning LLM outputs to desirable properties, such as helpfulness, harmlessness, or positive sentiment, using segment-level reward-guided rejection sampling. Unlike conventional fine-tuning or full-sequence selection methods, STARS operates by iteratively proposing, scoring, and accepting or rejecting short token segments during generation, enabling efficient correction of generation trajectories and providing a practical inference-time alternative for robust alignment (Quamar et al., 5 Nov 2025).
1. Problem Formulation and Motivation
STARS is motivated by the challenge of aligning pretrained autoregressive LLMs, parameterized by , whose conditional next-token distributions can be expressed as for prompt and response . Alignment leverages a scalar reward function , typically trained from human preference data or classifiers, to define a “Gibbs-tilted” target distribution:
which exponentially favors high-reward completions.
Exact sampling from is intractable due to the exponential size of sequence space. Traditional approaches, such as supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), and direct preference optimization (DPO), require resource-intensive gradient-based updates and result in static models with risks like mode collapse. Alternatively, best-of- (BoN) sampling scores independently generated continuations and selects the best, but is computationally inefficient as most sampling effort is wasted on low-reward outputs.
STARS circumvents these inefficiencies by segmentally steering generation. Instead of scoring and selecting entire outputs or updating model weights, STARS samples blocks of fixed-length tokens, performs reward-based rejection sampling in each block, and thus guides the trajectory toward high-reward completions at intermediate stages.
2. The STARS Algorithm
STARS produces a -segment response, each segment of length tokens, with a total of new tokens. The algorithm consists of an outer loop over segments and an inner loop of proposal sampling and acceptance tests for each segment.
- Proposal: At segment , STARS repeatedly samples a candidate block from .
- Scoring: The model computes the reward difference , with a stage-specific reward threshold.
- Acceptance: The segment is accepted with probability . Upon acceptance, the current block is appended to the output prefix and generation proceeds to the next segment.
- Threshold schedule: follows a linear ramp from an initial (near prompt reward) to a target , typically the percentile of held-out rewards.
- Sampling efficiency: To bound computational expense, proposals per segment are capped (e.g., 20 attempts).
Blocks are set to tokens in experiments, with blocks for tokens. Earlier segments accept more freely; selection tightens in later segments.
The STARS pseudocode:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
Algorithm STARS(x, π_lm, r, B, K, τ_r(·), β) y = [] for k in 1…K: accepted = False while not accepted: s = sample B tokens from π_lm(· | x, y) Δr = r(x, y⊕s) – τ_r(k) α_k = min{1, exp(Δr / β)} e = Uniform(0,1) if e ≤ α_k: y = y ⊕ s accepted = True return y |
3. Theoretical Justification
STARS generalizes classical rejection sampling to blockwise LLM decoding, targeting Gibbs-tilted sequence distributions. At each stage , the acceptance probability is
where normalization is ensured via a dynamic constant . A composition lemma shows that, in the limit of sufficient proposals and precise thresholds, chaining together the segment-level rejection samplers produces unbiased samples from the full-sequence target .
Capping proposal attempts per block introduces bias of order , but with segment size and moderate (typically $2$–$5$), bias is negligible for typical proposal caps (e.g., $20$) (Quamar et al., 5 Nov 2025).
Because blockwise reward spreads are smaller than across full sequences, rejection rates and computational overheads remain tractable, achieving considerably greater sample efficiency than BoN sampling, especially since most costly full-sequence completions in BoN can be eliminated early with STARS.
4. Experimental Methodology
Models: STARS was evaluated across six LLMs: Llama-3.1-8B (base), Mistral-7B (base), Qwen-2.5-7B (base), DeepSeek-LLM-7B (base), Llama-3.1-8B-SFT (SFT-tuned), Llama-3.1-8B-DPO (DPO-tuned).
Datasets and Tasks:
- Helpfulness & Harmlessness: HH-RLHF prompts, HarmfulQA adversarial questions.
- Positive Sentiment: IMDB reviews.
- Experiments used 300 held-out prompts per task.
Reward Models (PRMs):
- Harmlessness: DeBERTa-v3-large-v2 (0.3B parameters) trained on HH-RLHF/Toxicity.
- Sentiment: DistilBERT-base (0.07B) for positivity.
Baselines:
- Vanilla decoding: nucleus sampling (temperature=0.9, top_p=0.9, top_k=40, rep_penalty=1.1).
- BoN: full-sequence completions, select best.
- SFT and DPO: conventional fine-tuning baselines.
STARS Hyperparameters:
- Segment length ; total ().
- Up to $20$ segment proposals.
- Threshold schedule , at percentile from validation.
- Same decoding temperature, top-p, etc., as vanilla.
Metrics:
- Win rate: auto-judged by GPT-4.1 () in pairwise preference against baseline.
- Mean reward, diversity (token-level self-BLEU), perplexity, and coherence (Gemma-2-9B) (Quamar et al., 5 Nov 2025).
5. Empirical Results and Analysis
STARS outperforms or closely matches strong BoN and fine-tuned baselines across multiple settings:
Alignment Quality (Win-Rates):
| Model | Helpfulness/Harmlessness | HarmfulQA | Positive-Sentiment (IMDB) |
|---|---|---|---|
| Llama-8B | 73.3% (STARS) vs 74.6% (BoN) | 71.5% vs 74.7% | 71.6% vs 70.1% |
| Mistral-7B | 72.5% vs 70.1% | 75.9% vs 74.9% | 68.3% vs 70.3% |
| Qwen-2.5-7B | 71.2% vs 73.3% | – | 67.2% vs 66.2% |
| DeepSeek-7B | 66.0% vs 66.3% | – | 70.6% vs 72.3% |
| Llama-8B-SFT | 66.7% vs 65.9% | – | 59.7% vs 56.7% |
| Llama-8B-DPO | 62.5% vs 63.6% | – | 54.9% vs 54.2% |
Gaps vs Fine-Tuned Baselines:
- IMDB/SFT: STARS outperforms by up to percentage points (Qwen), (Llama-8B), and (Mistral).
- HH-RLHF/DPO: STARS outperforms by up to (Mistral) and (Llama) percentage points.
Sample Efficiency and Robustness:
- On DeepSeek-7B, STARS win-rate vs Llama-405B is (vs BoN, vanilla).
- On red-teaming adversarial prompts, STARS yields marginal improvements over BoN (e.g., Llama-8B from to ).
Other Metrics (DeepSeek-7B):
| Metric | Harmlessness STARS | BoN | Vanilla | Pos. Sentiment STARS | BoN | Vanilla |
|---|---|---|---|---|---|---|
| Mean | 0.19 | 0.18 | 0.15 | 0.79 | 0.49 | 0.45 |
| Diversity | 0.91 | 0.78 | 0.77 | 0.81 | 0.80 | 0.76 |
| Perplexity (↓) | 0.37 | 32.88 | 35.68 | 15.59 | 14.58 | 17.68 |
| Coherence | 0.52 | 0.36 | 0.32 | 0.40 | 0.41 | 0.37 |
Ablation Studies:
- Segment length provides a practical balance between granularity and computational cost.
- Overhead vs vanilla sampling is typically $1.5$–, compared to BoN’s $10$–.
- Linear ramp threshold schedules outperform static or suboptimally chosen thresholds.
6. Discussion and Future Directions
Advantages:
- Training-free: STARS requires no gradient updates, auxiliary weights, or retraining.
- Fine-grained control: Segment-level granularity allows for early and frequent realignment.
- Computationally efficient: Achieves similar or superior alignment to BoN or fine-tuning at dramatically lower computational cost.
- Model-agnostic: Can be combined with any autoregressive LLM and scalar reward function.
Limitations:
- Dependent on the calibration and reliability of segment-level reward models (PRMs); errors or insensitivity on partial text can limit alignment.
- Sampling bias is introduced by proposal caps and approximate thresholds.
- Since model parameters are not updated, catastrophic errors or irreversible model behaviors may persist if not captured by the reward.
Potential Research Directions:
- Adaptive segment sizing to focus alignment effort on more difficult stretches of text.
- Guided proposals using importance sampling to further reduce expected number of attempts per block.
- Improved PRMs, including partial completion-based training for greater segment-level reward fidelity.
- Formal analysis of bias/variance due to capped proposals.
- Hybrid decoding that integrates STARS rejection with beam search or Metropolis–Hastings schemes.
STARS demonstrates that reward-guided rejection at the segment level constitutes a robust, generalizable alternative to both static fine-tuning and full-sequence ranking, efficiently aligning LLM outputs with desirable properties and enabling scalable, inference-time steering of pretrained models (Quamar et al., 5 Nov 2025).