SofT-GRPO: Surpassing Discrete-Token LLM Reinforcement Learning via Gumbel-Reparameterized Soft-Thinking Policy Optimization (2511.06411v1)
Abstract: The soft-thinking paradigm for LLM reasoning can outperform the conventional discrete-token Chain-of-Thought (CoT) reasoning in some scenarios, underscoring its research and application value. However, while the discrete-token CoT reasoning pattern can be reinforced through policy optimization algorithms such as group relative policy optimization (GRPO), extending the soft-thinking pattern with Reinforcement Learning (RL) remains challenging. This difficulty stems from the complexities of injecting stochasticity into soft-thinking tokens and updating soft-thinking policies accordingly. As a result, previous attempts to combine soft-thinking with GRPO typically underperform their discrete-token GRPO counterparts. To fully unlock the potential of soft-thinking, this paper presents a novel policy optimization algorithm, SofT-GRPO, to reinforce LLMs under the soft-thinking reasoning pattern. SofT-GRPO injects the Gumbel noise into logits, employs the Gumbel-Softmax technique to avoid soft-thinking tokens outside the pre-trained embedding space, and leverages the reparameterization trick in policy gradient. We conduct experiments across base LLMs ranging from 1.5B to 7B parameters, and results demonstrate that SofT-GRPO enables soft-thinking LLMs to slightly outperform discrete-token GRPO on Pass@1 (+0.13% on average accuracy), while exhibiting a substantial uplift on Pass@32 (+2.19% on average accuracy). Codes and weights are available on https://github.com/zz1358m/SofT-GRPO-master
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days freePaper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Explaining “SofT-GRPO: Reinforcing LLM Soft-Thinking with Gumbel Noise”
1. What is this paper about?
This paper is about teaching LLMs to “think” better. Normally, LLMs reason by choosing one word at a time in a step‑by‑step chain (called chain‑of‑thought or CoT). The authors explore a different way called “soft‑thinking,” where instead of picking a single word, the model mixes several possible words together into one blended “thought.” They then show how to train this soft‑thinking style using a special reinforcement learning method they call SofT‑GRPO, so the model gets better at solving tough problems, especially math.
2. What questions are the researchers trying to answer?
They focus on three simple questions:
- Can soft‑thinking (mixing word options) help LLMs reason better than standard one‑word‑at‑a‑time thinking?
- Why is it hard to improve soft‑thinking with reinforcement learning (RL), and how can we fix that?
- If we create a new RL method for soft‑thinking, will it beat strong baselines on real tasks like math, science, and code?
3. How did they do it? (Methods in everyday language)
First, some key ideas in plain words:
- Chain‑of‑Thought (CoT): The model writes its reasoning step by step, choosing one token (word piece) at each step.
- Soft‑Thinking: Instead of choosing just one token, the model blends several likely tokens together, like making a smoothie of words instead of picking a single fruit.
- Embeddings: Every token has a “coordinate” in a map inside the model. Blending tokens blends these coordinates.
- Logits: The model’s raw scores for each token before turning them into probabilities.
- Reinforcement Learning (RL): A training style where the model tries solutions, gets a reward (like “correct” or “wrong”), and then learns to do better next time.
- GRPO: A popular RL method that compares groups of attempts and pushes the model toward higher‑reward ones.
- Pass@1 vs Pass@32: Pass@1 is the accuracy if the model gets only one try. Pass@32 measures how often the correct answer appears somewhere across 32 tries.
What’s hard about soft‑thinking + RL?
- Soft‑thinking is very smooth and deterministic (less random), so it’s harder to “explore” different reasoning paths.
- Past attempts added random “Gaussian” noise directly to the blended input, which made it unclear how to update the model’s probabilities and sometimes didn’t match what the model expects.
The authors’ solution: SofT‑GRPO
- Inject “Gumbel noise” into logits: Think of Gumbel noise as a smart way to add randomness to the model’s raw scores so the model explores different reasoning paths more naturally.
- Use Gumbel‑Softmax: This turns the noisy scores into a smooth mixture of tokens that stays inside the model’s learned space, avoiding weird, invalid inputs.
- Reparameterization trick: A training technique that makes the randomness “trackable,” so the model can learn which probability changes led to better rewards. It’s like keeping a breadcrumb trail from the reward back to the choices the model made.
Analogy:
- Old RL for soft‑thinking added noise to the smoothie after blending (the input vector), which made it hard to know which ingredients mattered.
- SofT‑GRPO adds noise before blending (to the scores of each ingredient), so you can tell which fruits (tokens) helped make a better smoothie—and then tweak their amounts.
4. What did they find, and why does it matter?
Across three model sizes (about 1.5B, 3B, and 7B parameters) and several math benchmarks (like AIME, AMC, MATH‑500, GSM8K), SofT‑GRPO:
- Slightly improves one‑shot accuracy (Pass@1) compared to strong discrete‑token GRPO baselines.
- Clearly improves multi‑try accuracy: on average, +2.19% for Pass@32 and +1.80% for Pass@16. This means when the model can try multiple reasoning paths, soft‑thinking trained with SofT‑GRPO finds correct answers more often.
- Beats earlier soft‑thinking RL methods that used Gaussian noise.
- Shows gains on out‑of‑domain tasks too (a science Q&A and two coding benchmarks), meaning the improvement generalizes beyond math.
- Helps with “token efficiency” in some cases (fewer thinking steps needed), and works well with majority voting (picking the most common answer across many tries).
Why is this important?
- It proves that soft‑thinking can be effectively trained using RL, not just used “as is.”
- It helps models explore diverse reasoning paths while staying stable, which is great for hard problems where a single try might fail.
5. What’s the impact? (Implications)
If LLMs can reliably learn soft‑thinking with RL:
- They may become better at representing abstract ideas and complex steps, since they can blend multiple possibilities at once.
- They could solve tough math, science, and coding tasks more robustly, especially when allowed several attempts and then combining answers (majority voting).
- Training might rely less on expensive, hand‑written step‑by‑step labels, because RL can use simple rewards like “correct/incorrect.”
- This approach could extend to other areas, like models that understand both images and text, helping them reason more flexibly.
Overall, SofT‑GRPO shows that mixing tokens thoughtfully—and training that process with the right kind of randomness—can make LLMs better thinkers.
Knowledge Gaps
Unresolved Knowledge Gaps, Limitations, and Open Questions
Below is a focused list of specific gaps the paper leaves open and that future researchers could address:
- Theoretical rigor of the reparameterized soft-thinking likelihood: provide formal proofs that the log-probability formulations in Eq. (11–12) yield unbiased/low-variance gradient estimates under practical constraints (finite temperature, top-k/top-p truncation, and masked vocabularies), and characterize estimator variance relative to discrete GRPO.
- Softmax vs. argmax mismatch: Theorem 3.1 guarantees correct categorical sampling for argmax (Gumbel-max), but SofT-GRPO uses Gumbel-Softmax mixtures; quantify and bound the distributional bias introduced by using continuous mixtures (yi) for inputs rather than discrete samples.
- In-manifold validity of soft inputs: empirically and theoretically assess whether convex combinations of token embeddings actually remain within the model’s training manifold (beyond the linear span), and develop principled constraints (e.g., sparsity, projection onto a learned manifold, or mixture-of-k tokens) to prevent out-of-distribution soft inputs.
- Sensitivity and stability of training: the method collapses under small hyperparameter changes (e.g., top-p=1.0 or Tg=0.25); design adaptive schedules (KL coefficients, clipping ranges, temperature annealing) and provide robustness analyses across broader hyperparameter ranges and datasets.
- Scope of ablations: extend ablations beyond noise type to include group size G, advantage normalization, KL penalty strength/schedule, clip parameter ε, temperature T, and masking strategies (top-k/top-p), with clear causal attribution of performance changes.
- Credit assignment granularity: analyze how rewards propagate across soft-thinking steps versus answer-generation steps (the distinct log-prob terms in Eq. (8)), and test per-step reward shaping or subgoal/verifier signals to improve credit assignment to soft tokens.
- Comparison breadth: benchmark SofT-GRPO against a wider set of RLVR baselines (Dr-GRPO, DAPO, Lite-PPO, entropy-controlled methods) under matched settings to isolate gains attributable to the Gumbel reparameterization.
- Statistical significance and reliability: report confidence intervals or significance tests for Pass@1/@16/@32 improvements (the average Pass@1 gain is small at +0.13%), and quantify run-to-run variance.
- Fairness of experimental settings: reconcile differences in sampling hyperparameters across training and evaluation (e.g., top-k=5 in training vs. top-k=30 in some baselines) to avoid confounding improvements.
- Scale and generalization: evaluate on larger frontier models (e.g., ≥70B) and diverse task families beyond math/code/GPQA (reasoning with commonsense, multilingual, long-context, program synthesis with constraints) to test scalability and domain transfer.
- Sample efficiency and compute cost: quantify tokens processed, gradient steps, and wall-clock per gain (only a single 45-hour figure is provided), and compare the efficiency of SofT-GRPO vs. discrete GRPO per unit compute.
- Interaction with verifiers and self-consistency: integrate formal verifiers or self-consistency during RL (not just majority voting at inference) and measure whether verifier-guided shaping improves Pass@1 without relying on high sampling budgets.
- Handling of masked/zero-probability tokens: specify and evaluate numerical stability strategies for log pi under truncation (top-k/top-p) and near-zero probabilities to avoid exploding gradients or biased updates.
- Calibration and uncertainty: assess whether Gumbel perturbations improve or harm probability calibration, and explore entropy/temperature control tailored to soft-thinking policies (not just discrete policies).
- Interpretability of soft-thinking trajectories: develop tools to visualize and analyze continuous soft tokens for semantic alignment and debugging, and test whether soft trajectories correspond to meaningful latent concepts/subgoals.
- Safety and alignment: evaluate whether continuous inputs increase vulnerability to adversarial prompts or degrade safety filters, and design mitigation strategies for soft-thinking RL.
- Token efficiency trade-offs: quantify end-to-end token savings across all models and tasks, and explore explicit multi-objective optimization to balance accuracy vs. think-length (beyond anecdotal reductions in LLaMA-3.2-3B).
- Majority voting vs. principled aggregation: compare majority voting to more sample-efficient aggregation (e.g., reranking/verifier scoring, diversity-promoting sampling) and assess training-time strategies that reduce inference-time sample budgets.
- Mixed-precision and numeric stability: investigate training-inference mismatches (e.g., FP16 vs. FP32) for Gumbel-based soft policies and apply recent remedies to ensure stable gradients and reproducibility.
- Reproducibility and implementation complexity: reduce dependency on custom rollout data (g_i, y_i) transmission and specialized frameworks, and provide standardized APIs or reference implementations to ease adoption and replication.
Practical Applications
Immediate Applications
Below are practical uses that can be deployed now, leveraging the paper’s findings on SofT-GRPO (a Gumbel-reparameterized policy optimization algorithm for soft-thinking in LLMs), its sampling/optimization methods, and its observed performance gains, particularly at higher sample rates (Pass@16/32). Each bullet lists the sector(s), potential tools/workflows/products, and critical dependencies/assumptions.
- Higher-accuracy math tutoring and assessment systems
- Sectors: Education, EdTech
- What to deploy: Fine-tune existing 1.5B–7B open models with SofT-GRPO for math solvers and graders that use verifiable rewards (e.g., Math-Verify) to train/evaluate. Add multi-sample inference with majority voting for difficult problems.
- Tools/workflows: SofT-GRPO training loop (SGLang rollouts + verL/HybridFlow policy optimizer), domain verifiers (Math-Verify), inference with top-p=0.95 and Gumbel-Softmax (Tg≈0.1), majority voting aggregator.
- Assumptions/dependencies: Requires verifiable answers; costs rise with Pass@16/32; careful KL control and hyperparameters (top-p, Tg) to avoid training collapse.
- Code assistance with unit-test-driven self-verification
- Sectors: Software engineering, DevTools, CI/CD
- What to deploy: Train code LLMs with SofT-GRPO using unit tests as verifiable rewards. At inference, run N soft-thinking samples, select solutions that pass tests, and majority-vote or early-exit when tests pass.
- Tools/workflows: HumanEval/MBPP-like harnesses, project-specific unit tests as reward signals; CI integration that spins up multi-sample generation + execution sandbox; token-efficiency monitoring (soft-thinking often reduces thinking length).
- Assumptions/dependencies: Safe code execution sandbox; sufficient compute for multi-sampling; reward quality tied to test coverage and flakiness.
- Scientific Q&A assistants with calibrated multi-try accuracy
- Sectors: R&D, Knowledge Work, Scientific Computing
- What to deploy: Fine-tune assistants on verifiable subsets (e.g., MCQ with known keys, structured short-answer with checkers), then apply majority voting at inference to improve correctness on hard queries (as indicated by GPQA Diamond gains).
- Tools/workflows: SofT-GRPO training on verifiable science QA; inference budget controller (Pass@k curves); confidence estimation via agreement rates.
- Assumptions/dependencies: Availability of verifiable prompts/benchmarks; gains are larger with multiple samples; careful prompt design (e.g., standardized answer boxing).
- Cost-saving reasoning via shorter “soft-thinking” chains
- Sectors: Platform/LLM Ops, SaaS using LLMs
- What to deploy: Replace verbose text CoT with soft-thinking embeddings during reasoning to reduce output tokens (observed thinking-length reductions), then emit succinct final answers.
- Tools/workflows: Serve models with an inference stack that supports soft-thinking input reconstruction via Gumbel-Softmax; token accounting dashboards; automatic budget-based early exit when a verifier passes.
- Assumptions/dependencies: Serving stack must support passing mixed/expected embeddings between steps; not all frameworks do this out of the box.
- RLVR pipelines that avoid CoT label collection
- Sectors: Model labs, MLOps, Foundation-model post-training
- What to deploy: Swap SFT on expensive CoT traces for SofT-GRPO using task-specific verifiers and group rollouts to explore diverse reasoning paths while staying in the pre-trained embedding space.
- Tools/workflows: SGLang (parallel rollouts), Gumbel-Softmax sampling, replay/telemetry of y and g values, KL-regularized updates against a reference model, and verifiers as reward oracles.
- Assumptions/dependencies: Suitable verifiers; GPU scale (the paper used up to 8×H200); stable hyperparameters (top-p=0.95, Tg≈0.1).
- Product feature: “Multi-try with majority vote” mode
- Sectors: Consumer apps, Productivity suites, EdTech, Coding copilots
- What to deploy: Expose a “Try up to N solutions” button that runs soft-thinking multi-sampling and shows a confidence badge based on agreement and verifier pass rate. Useful for math, coding, and constrained QA.
- Tools/workflows: Front-end control for N, backend sampling + verifier + aggregator; user-facing uncertainty indicators.
- Assumptions/dependencies: Latency and cost budgets; only applicable where verifiable or aggregate-able answers exist.
- Benchmarking and evaluation upgrades for reasoning systems
- Sectors: Academia, Model evaluation platforms
- What to deploy: Adopt Mean@32/Pass@k reporting with and without soft-thinking and majority voting; integrate Math-Verify and test-based verifiers to ensure apples-to-apples comparisons.
- Tools/workflows: Evaluation harnesses with multi-run sampling; reproducible seeds for Gumbel noise; logging of KL to detect failure modes.
- Assumptions/dependencies: Access to standardized verifiers; compute for repeated trials.
- Training-data curation via verifiers
- Sectors: Data engineering, LLM training
- What to deploy: Use verifiable-reward tasks (math/code/MCQ) to generate high-quality synthetic reasoning traces through SofT-GRPO rollouts; filter by reward and agreement metrics to build stronger curricula.
- Tools/workflows: Auto-generation of reasoning paths; quality gates via unit tests/math checkers; storage of top-k rollouts for replay.
- Assumptions/dependencies: Quality verifiers; management of data contamination and test leakage.
Long-Term Applications
These require additional research, scaling, or infrastructure changes but are directly suggested by the paper’s methods and empirical trends.
- Multimodal and robot planning with soft-thinking RLVR
- Sectors: Robotics, Vision-Language, Autonomy
- Vision: Extend SofT-GRPO to VLMs that pass continuous latent “thoughts” between steps (e.g., planning vectors), with simulation-based verifiers (safety constraints, trajectory success).
- Potential products: Task planners that iteratively sample soft-thinking plans until a simulator/verifier passes; real-time majority-vote controllers for critical decisions.
- Dependencies/assumptions: Robust, fast simulators as reward oracles; safe exploration; generalization beyond text-only tasks.
- Enterprise decision copilots with verifiable constraints
- Sectors: Finance, Energy, Operations, Supply Chain
- Vision: Use soft-thinking multi-sample reasoning with domain verifiers (risk/constraint checkers) to propose plans that satisfy regulatory or operational constraints, selecting the first plan that passes checks.
- Potential products: “Constraint-verified” planning assistants; spreadsheet/BI copilots that auto-validate formulas/scenarios; energy dispatch planners respecting grid constraints.
- Dependencies/assumptions: High-quality verifiers (e.g., constraint solvers); integration with enterprise data; transparency and auditability requirements.
- Budget-aware, anytime reasoning stacks
- Sectors: LLM platforms, Cloud inference
- Vision: Dynamic controllers that adjust sampling (k) and soft-thinking temperature on-the-fly based on Pass@k curves, early-exiting when a verifier passes, trading latency for guaranteed correctness.
- Potential products: “Auto-budget” inference APIs; SLAs defined in terms of “probability of correctness within T seconds.”
- Dependencies/assumptions: Fast verifiers; accurate online calibration of Pass@k; robust failure handling.
- Soft-thinking–native model architectures and accelerators
- Sectors: Semiconductors, Systems/ML infra
- Vision: Kernels and serving stacks optimized for Gumbel-Softmax sampling, embedding mixing, and y/g reparameterization bookkeeping; architectures designed to be stable under continuous reasoning inputs.
- Potential products: Inference runtimes that natively support soft-thinking steps; GPU/ASIC features for efficient Gumbel sampling and weighted-embedding operations.
- Dependencies/assumptions: Community standardization of interfaces; demonstrable cost/performance advantage over text-CoT.
- Standardized safety, governance, and transparency for continuous reasoning
- Sectors: Policy, Risk, Compliance
- Vision: Guidelines to govern models that reason in continuous latent spaces, with requirements for logging stochastic seeds, KL drift monitoring, and verifier provenance to support audits and incident analysis.
- Potential outputs: Compliance checklists; provenance reports linking final outputs to sampled paths, verifiers used, and agreement statistics.
- Dependencies/assumptions: Sector-specific regulators accept verifier-backed guarantees; shared metadata schemas.
- Education platforms with personalized reasoning feedback
- Sectors: Education, EdTech research
- Vision: Systems that analyze multiple soft-thinking paths to diagnose misconceptions and generate targeted micro-feedback or hint sequences tailored to a student’s error profile.
- Potential products: Tutors that adaptively allocate attempts and explanations; dashboards for teachers showing aggregate reasoning patterns.
- Dependencies/assumptions: Robust mapping from latent paths to interpretable feedback; privacy-preserving telemetry.
- Neuro-symbolic solvers that combine verifiers and soft-thinking search trees
- Sectors: Theorem proving, Formal methods, Program synthesis
- Vision: Build latent search trees via soft-thinking, prune/expand using symbolic verifiers or type checkers, and optimize exploration via SofT-GRPO.
- Potential products: Assisted theorem provers; spec-driven code synthesis systems with verifiable guarantees.
- Dependencies/assumptions: Tight integration between continuous search and symbolic tooling; efficient reparameterized training signals.
- Domain expansion to healthcare and clinical decision support (with caution)
- Sectors: Healthcare, Bioinformatics
- Vision: Use verifiable sub-tasks (dose calculators, guideline-constrained checks, coding/billing validation) as reward signals inside larger clinical reasoning workflows.
- Potential products: Assistant modules that only act on tasks with strict verifiers, escalating ambiguous cases to humans.
- Dependencies/assumptions: Strict safety gating; high-quality medical verifiers; regulatory clearance; limited to sub-tasks with objective checks.
Cross-cutting Assumptions and Dependencies
- Verifiable rewards are central: SofT-GRPO relies on tasks with clear, automatic correctness checks (unit tests, math verifiers, MCQ keys). Open-ended tasks without verifiers are not a fit.
- Multi-sample inference economics: Largest gains occur at Pass@16/32; organizations must provision for the extra compute/latency or deploy early-exit based on verifiers.
- Serving/runtime support: Soft-thinking requires passing expected embeddings between decoding steps and Gumbel-Softmax resampling. Many production stacks will need modifications.
- Stability hinges on hyperparameters: The paper shows collapses with top-p=1.0 or higher Gumbel temperature; typical stable settings: top-p≈0.95, model T≈0.6, Tg≈0.1.
- Scaling and reproducibility: Training used substantial GPU resources (e.g., 8×H200); reproducible seeds for Gumbel noise, KL monitoring, and careful logging are needed for reliable operations.
- Generalization: Results shown for 1.5B–7B models and specific benchmarks (math, code, scientific QA). Further validation is needed for larger models and other domains.
Glossary
- Advantage function: A scalar reflecting how much better a trajectory’s reward is compared to a baseline in policy optimization. "Âg represents the advantage function for the g-th CoT"
- Categorical distributions: Discrete probability distributions over token choices used to model next-token outputs. "the probability of sampled trajectories can be easily obtained from the output Categor- ical distributions."
- Chain-of-Thought (CoT): A reasoning strategy where models generate intermediate tokens that articulate a step-by-step solution path. "Discrete-token CoT reasoning processes seek to solve a |Q|-token question Q = (q1, ... , q|Q|) by generating | R| reasoning CoT tokens R = (r1, . .. , T|R ) before outputting the answer prediction A = (a1, . .. , a| A|)"
- Dirichlet noise: Random perturbation applied to probability vectors using the Dirichlet distribution to induce stochasticity. "adding Dirichlet noise to predicted probabilities can be an- other choice"
- Dirichlet resampling technique: A method that resamples probability vectors from a Dirichlet distribution to create randomized soft inputs. "also tries to use the Dirichlet resampling technique as fol- lows:"
- Dr. GRPO: A variant of GRPO that modifies group-relative updates; used in RLVR for reasoning. "Dr. GRPO (Liu et al., 2025a)"
- Gaussian noise: Continuous noise sampled from a normal distribution, added to inputs or embeddings to induce randomness. "adds Gaussian noise on the input st in Eq. (3) as follows:"
- Gaussian reparameterization trick: Technique to compute gradients through stochastic Gaussian nodes by expressing samples as deterministic functions of noise. "calculate the log probability for the soft-thinking reasoning path S with the Gaussian reparameterization trick as follows"
- Group Relative Policy Optimization (GRPO): An RL algorithm that samples groups of trajectories and updates policies to favor higher-reward members. "Group Rela- tive Policy Optimization (GRPO) (Shao et al., 2024) has emerged as a particularly compelling framework."
- Gumbel distribution: A distribution used to model the maximum of samples; pivotal in the Gumbel-Softmax trick for differentiable sampling. "Ei is a scaler noise sampled from the Gumbel distri- bution Gumbel(0, 1)"
- Gumbel-max Trick: A method to sample discrete categories by adding Gumbel noise to log-probabilities and taking an argmax. "Theorem 3.1 (Gumbel-max Trick). Let (p1, ... , Pn) be nonnegative, and €1, ... , En independent samples from Gumbel(0,1) (Maddison et al., 2016),"
- Gumbel noise: Noise drawn from the Gumbel distribution, injected into logits to enable randomized yet stable sampling. "samples groups of soft-thinking reasoning paths by injecting sampled Gumbel noises into the output probabilities"
- Gumbel-Softmax: A differentiable approximation for sampling from categorical distributions by combining Gumbel noise with softmax. "employs the Gumbel-Softmax technique to avoid soft-thinking tokens outside the pre-trained embedding space"
- Inverse transform sampling: A sampling method that transforms uniform random variables to draw from target distributions. "Using the inverse transform sam- pling, we can sample E; by computing g = - log(- log(u)) where u ~ Uniform(0, 1)"
- KL divergence: A measure of difference between two probability distributions used to regularize policy updates. "DKL represents the KL-divergence of policy Te and a reference model policy TOref."
- Latent reasoning: Performing reasoning in a continuous hidden space rather than explicit language tokens. "Similar to the soft-thinking pattern, latent reasoning methods pass continuous vectors between LLM steps."
- Latent search tree: A hypothesized multi-thread branching structure in continuous reasoning that explores alternatives implicitly. "implementing a possible latent search tree (Wu et al., 2025a)"
- Lite PPO: A lightweight variant of Proximal Policy Optimization tailored for LLM reinforcement finetuning. "RLVR fine-tuning methods such as GRPO (Liu et al., 2024), Dr. GRPO (Liu et al., 2025a), DAPO (Yu et al., 2025), and Lite PPO (Liu et al., 2025b)"
- Majority voting: An ensemble aggregation method that selects the most common answer across multiple model runs. "we design to boost SofT- GRPO with majority voting (Chen et al., 2024)."
- Mean@32: The average Pass@1 accuracy computed over 32 independent runs on a dataset. "Mean@32 judges the average Pass@1 accuracy over 32 runs on the datasets."
- Multinomial sampling: Drawing discrete tokens according to categorical probabilities; often simulated by Gumbel-max. "it may simulate the multinomial sampling"
- Next-token prediction (NTP) policy: The autoregressive rule that generates the next token conditioned on the previous context. "next-token prediction (NTP) policy of LLM Tre as follows:"
- Off-policy REINFORCE: A policy-gradient method using importance weighting to learn from trajectories generated by an older policy. "update the soft-thinking policy with the off-policy REINFORCE (Williams, 1992) algorithm"
- Off-policy RLVR algorithm: An RL method that optimizes policies using trajectories sampled from a different (old) policy. "As an off-policy RLVR algorithm, for each Query Q, SofT- GRPO samples and restores a group of G soft-thinking CoTs"
- Pass@1: The probability that a single model attempt is correct; often averaged over seeds/runs. "Pass@1 (+0.13% on average ac- curacy)"
- Pass@32: The probability that at least one of 32 sampled answers is correct, measuring sample-efficient success. "Pass@32 (i.e., the pass rate with 32 attempts)"
- Policy gradient: The gradient of expected reward with respect to policy parameters used to update the model. "and leverages the reparameterization trick in pol- icy gradient."
- Reparameterization trick: A technique to enable low-variance gradient estimation through stochastic nodes by expressing randomness via deterministic transforms. "leverages the reparameterization trick on the Gumbel distribution."
- Reinforcement Learning with Verifiable Rewards (RLVR): RL finetuning where rewards are computed by verifiers (e.g., exact-match checkers) rather than labels. "Re- inforcement Learning with Verifiable Rewards (RLVR) ap- proaches"
- Rollout: The process of sampling model trajectories under a given policy to compute rewards and advantages. "in the rollout process, SofT-GRPO samples groups of soft-thinking reasoning paths"
- Soft-thinking paradigm: A reasoning scheme that passes weighted sums of token embeddings (continuous vectors) between steps instead of discrete tokens. "The soft-thinking reasoning pattern re- places each discrete token in the chain-of-thought (CoT) with a continuous representation: a weighted sum of d- dimensional token embeddings"
- Supervised Fine-tuning (SFT): Training with labeled CoT trajectories to make models predict prescribed tokens and answers. "Supervised Fine-tuning (SFT) for Discrete-Token CoT LLM Reasoning."
- Temperature (Gumbel-Softmax temperature Tg): A parameter controlling the softness of Gumbel-Softmax sampling; lower values approach argmax behavior. "Tg is the temperature of Gumbel- Softmax."
- Top-k sampling: Sampling scheme that restricts choices to the k most probable tokens before normalization. "enabling both the top-p and top-k sampling strategies."
- Top-p sampling: Nucleus sampling that selects the smallest set of tokens whose cumulative probability exceeds p. "enabling both the top-p and top-k sampling strategies."
Collections
Sign up for free to add this paper to one or more collections.