Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought
Abstract: While long, explicit chains-of-thought (CoT) have proven effective on complex reasoning tasks, they are costly to generate during inference. Non-verbal reasoning methods have emerged with shorter generation lengths by leveraging continuous representations, yet their performance lags behind verbalized CoT. We propose $\textbf{Abstract Chain-of-Thought}$, a discrete latent reasoning post-training mechanism in which the LLM produces a short sequence of tokens from a reserved vocabulary in lieu of a natural language CoT, before generating a response. To make previously unseen ''abstract'' tokens useful, we introduce a policy iteration-style warm-up loop that alternates between (i.) bottlenecking from a verbal CoT via masking and performing supervised fine-tuning, and (ii.) self-distillation by training the model to generate abstract tokens from the prompt alone via constrained decoding with the codebook. After warm-up, we optimize the generation of abstract sequences with warm-started reinforcement learning under constrained decoding. Abstract-CoT achieves up to $11.6\times$ fewer reasoning tokens while demonstrating comparable performance across mathematical reasoning, instruction-following, and multi-hop reasoning, and generalizes across LLM families. We also find an emergent power law distribution over the abstract vocabulary, akin to those seen in natural language, that evolves across the training phases. Our findings highlight the potential for post-training latent reasoning mechanisms that enable efficient inference through a learned abstract reasoning language.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What is this paper about?
This paper shows a way for AI LLMs to “think” without writing long explanations in normal words. Instead, the model uses a very short sequence of special, made-up symbols as a private shorthand, then gives the final answer. The goal is to keep the good performance of step-by-step thinking while making it much faster and cheaper to run.
What questions did the researchers ask?
They focused on three simple questions:
- Can we replace long, wordy chains-of-thought with short, non‑verbal “abstract” symbols and still get good answers?
- How do we teach a model to use brand-new symbols it has never seen before?
- Will this make reasoning more efficient (fewer tokens, lower cost) without hurting accuracy?
How did they do it?
Think of the model learning a secret shorthand language. The team introduces a small “alphabet” of special tokens (like <TOKENA>, <TOKENB>, …) that don’t mean anything to humans at first. The model learns to write a short line of these symbols before producing the final answer.
The idea of abstract tokens (short secret symbols)
- A “token” is a small chunk of text, like a word or symbol.
- The researchers add a reserved “codebook” of 64 special tokens that the model can use only in a short middle step.
- During inference (when answering), the model first writes a short string of these symbols, then gives the answer. It’s like scribbling quick notes in symbols only it understands.
Warm-up stage: teaching the symbols to be useful
Adding new symbols is a “cold start”: they mean nothing at first. So the team uses a two-part warm-up.
- Bottlenecked supervised training (SFT):
- The model sees a normal question, a full written chain-of-thought (a teacher’s step-by-step), and the correct answer.
- But the model is only allowed to use the new abstract symbols as the bridge to the answer. The answer is not allowed to “look at” the teacher’s long explanation directly.
- This creates a bottleneck: all the important info must pass through a short sequence of abstract tokens, forcing those symbols to learn to carry meaning.
- Self-distillation (the model teaches itself):
- Next, the model stops using the teacher’s long explanations.
- It tries to write the short abstract sequence from the question alone, then produce the answer.
- Training uses the model’s own abstract sequences plus the known correct answers, helping the model get better at using its shorthand by itself.
- They repeat these two steps a few times so the symbols become reliable notes.
Reinforcement learning (RL): practicing with a coach
- After warm-up, they use RL to refine the shorthand.
- The model tries different short symbol sequences, then produces an answer.
- A separate “judge” model scores how good the answer is (this score is the “reward”).
- The model updates itself to pick abstract sequences that lead to better answers more often.
- “Constrained decoding” keeps the shorthand step limited to the special token list, so the model can’t wander back into long, wordy explanations.
In everyday terms: first the model learns to take good notes from a teacher (bottleneck), then it practices taking notes on its own (self-distillation), and finally it trains with a coach who scores each attempt (RL) so the notes get even better.
What did they find?
- Much fewer “thinking” tokens: Up to about 11.6× fewer reasoning tokens compared to normal written chains-of-thought, while keeping similar or even better accuracy.
- Strong performance across tasks:
- Math problems (MATH‑500)
- General instruction-following (AlpacaEval)
- Multi‑hop question answering (HotpotQA)
- Also competitive on harder tests (AIME’25, GPQA-Diamond)
- Works across different model families and sizes (e.g., Qwen3‑8B, Qwen3‑4B, Granite 3B).
- The warm-up matters: “RL only” from a cold start didn’t work well. Warm-up alone helped, but warm-up plus RL was best.
- Better than simple “pause tokens”: Just inserting blank “pause” tokens didn’t help much; the learned abstract tokens did.
- Learned “reasoning language” patterns: The special symbols ended up used in a power‑law pattern (a few symbols used a lot, many used less), similar to how real words appear in language. This suggests the model discovered reusable “concept notes” in its shorthand.
- Order and length still matter:
- Scrambling the shorthand sequence hurt accuracy (so the sequence carries structured meaning).
- Forcing very short traces harms performance less than cutting off long written CoT, because the abstract shorthand was trained to stay short and efficient.
Why is this important?
- Faster and cheaper: Short symbol sequences are much quicker and use fewer tokens than long explanations, reducing latency and cost.
- Keeps quality: You can still get strong answers on complex tasks.
- Flexible and general: Works with different models and tasks, not just math.
What could this lead to?
- Smarter “thinking budgets”: The model could choose longer or shorter shorthand depending on how hard the question is, saving compute on easy problems.
- New ways to study model reasoning: Even though the tokens aren’t human‑readable, their usage patterns can be tracked, monitored, and audited—like checking what kinds of notes the model tends to take.
- Building blocks for concepts: Over time, the shorthand could grow into reusable “subroutines” (symbol sequences that mean “do this kind of reasoning”), helping models solve tough, multi‑step problems more reliably.
- Practical deployment: Where human-readable reasoning isn’t required, this method can deliver quicker answers at lower cost without losing accuracy.
In short: the paper shows that LLMs can “think without words” by using a short, learned symbol language as a private scratchpad. With a careful warm-up and reinforcement learning, this shorthand makes reasoning both efficient and effective.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a concise, actionable list of what remains missing, uncertain, or unexplored in the paper’s approach and evaluation:
- End-to-end latency and throughput are not measured; quantify real-world wall-clock speedups (including constrained-decoding overhead) versus verbal CoT baselines under typical serving stacks.
- Training cost vs. inference savings is not analyzed; report compute/energy budget for warm-up (T iterations) and 1M RL episodes, and establish net cost-benefit and sample efficiency.
- Dependence on a single generative reward model (gpt-oss-20b) is unexamined; assess robustness to reward-model bias, variance across reward models, reward hacking, and alignment with human judgments on non-verifiable tasks.
- Data contamination and leakage checks are absent; verify training/evaluation independence between Dolci-Think datasets and test sets (MATH-500, AIME, GPQA, HotpotQA, AlpacaEval).
- Generalization to additional task families is untested: code generation, tool use/planning, constrained decoding tasks, retrieval-augmented QA, long-context reasoning, program-of-thought, and factuality benchmarks.
- Multilingual and cross-lingual generalization is unknown; evaluate whether reserved abstract tokens transfer across languages and scripts.
- Scaling behavior across model sizes is underexplored; systematically test 32B–70B+ models and analyze scaling laws for performance and token efficiency.
- Adaptive “thinking budget” policies are not implemented; learn variable-length abstract traces (stopping policies) and characterize the length–accuracy trade-off per task.
- Codebook size/structure design is heuristic; study larger M, hierarchical codebooks, grammar-like constraints, and subroutine composition to discover reusable reasoning motifs.
- Interpretability of abstract tokens is largely anecdotal; perform probing and causal interventions (token swaps, ablations, counterfactual token injections) to map tokens to functions/concepts.
- Power-law token usage suggests potential mode collapse; quantify diversity over time, add regularizers (entropy bonuses, frequency caps), and study performance–diversity trade-offs.
- Robustness is only lightly tested (permutation/truncation); evaluate under adversarial prompts, noisy contexts, distribution shift, and prompt-injection scenarios.
- Faithfulness remains unclear; test whether abstract sequences reflect causal reasoning (e.g., via process supervision, latent interventions, and consistency under rationale redaction).
- Safety impacts are unmeasured; assess hallucination rates, calibration, harmful content propensity, and whether abstract tokens bypass safety filters or create covert channels.
- Deployment constraints of adding new tokens are not addressed; explore alternatives when tokenizers cannot be extended (e.g., reusing rare tokens, byte-level markers) and measure side effects on base vocabulary.
- Serving characteristics are unknown: quantify effects under quantization, KV-cache behavior, speculative decoding, streaming, batching, and beam search/sampling variations.
- Constrained-decoding failure modes are not analyzed; characterize constraint violations, fallback strategies, and their impact on correctness and latency.
- Attention-mask design is fixed; ablate mask variants (e.g., partial answer access to verbal CoT) to determine minimal bottleneck conditions and their impact on learning.
- Cold-start RL underperforms; study improved initializations (synthetic pretraining with abstract tokens, curriculum schedules, imitation from teacher codes) to reduce warm-up compute.
- Learning dynamics and sample efficiency are not characterized; provide learning curves across T, m_max, and M, and identify minimal data needed to reach target accuracy/efficiency.
- Comparative baselines omit leading continuous/hybrid latent methods (e.g., Coconut, CODI, HybridCoT) under matched token budgets; run head-to-head evaluations.
- Effect on unrelated capabilities is unreported; test for regressions on general benchmarks (e.g., MMLU, HELM) to detect catastrophic interference from new token embeddings.
- Transferability across models is unknown; examine whether learned abstract embeddings/policies can be ported between model families or sizes (e.g., via alignment layers).
- Budget adaptivity by difficulty is only proposed; evaluate difficulty-aware policies and per-task/per-instance length controllers optimized via RL.
- Long-horizon problems remain challenging; benchmark on harder math (e.g., Olympiad-level beyond AIME), theorem proving, and multi-step planning with explicit horizon analysis.
- Interpretability for audit is not demonstrated; propose concrete auditing protocols (token-level attribution, rationale consistency checks) and measure auditor performance.
- Security and governance considerations are not explored; analyze whether abstract tokens enable obfuscation or data exfiltration, and design detectors/filters for non-verbal traces.
- Tokenization and position effects are unstudied; probe sensitivity to positional placement of <beginabstract>/<endabstract> and interactions with positional encodings.
- Hyperparameter sensitivity is underreported; sweep decoding temperatures, sampling strategies, KL coefficients (GRPO), and m_max to map stable operating regions.
- Response-length effects are not isolated; disentangle reasoning-token savings from potential response-length shifts to ensure fair token-efficiency accounting.
- Failure analysis is minimal; categorize typical errors (mathematical slips, reasoning jumps), relate them to abstract-sequence properties, and devise targeted fixes (e.g., token-level constraints).
Practical Applications
Immediate Applications
The paper’s core innovation—replacing long natural-language rationales with short, discrete “abstract” token sequences learned post-training—enables practical reductions in inference cost while preserving performance. The following applications can be deployed now with moderate engineering effort.
- Cost- and latency-reduced model serving for reasoning-heavy LLMs
- Sectors: software/cloud, enterprise SaaS, energy
- What: Serve comparable-quality answers with 2–12× fewer intermediate reasoning tokens (e.g., up to 11.6× fewer on MATH-500), lowering latency and compute/billing while increasing throughput.
- Tools/Products/Workflows:
- “Concise reasoning mode” in LLM APIs with a knob for abstract budget m_max.
- Server-side constrained decoding for abstract segments plus normal decoding for answers.
- Observability dashboards tracking abstract-token budgets and performance.
- Assumptions/Dependencies:
- Ability to post-train existing instruction-tuned models (tokenizer extension, attention-mask customization).
- Availability of teacher CoT data or a teacher model to bootstrap warm-up; access to a reward model for RL.
- Constrained decoding support in serving stack.
- Edge and on-device assistants with faster response and lower power
- Sectors: mobile, embedded, robotics
- What: Shorter intermediate sequences reduce generation steps, making on-device or near-edge deployment more practical.
- Tools/Products/Workflows:
- Lightweight adapters that add abstract reasoning to 3–8B models for offline assistants.
- Device-tier budgeting (e.g., m_max scaled to battery/thermals).
- Assumptions/Dependencies:
- Sufficient fine-tuning compute to adapt target models; careful latency benchmarking on device hardware.
- Enterprise chatbots and customer support with lower operating cost
- Sectors: customer support, sales ops, IT helpdesk
- What: Maintain instruction-following quality (e.g., AlpacaEval gains) while reducing per-interaction token usage and cost.
- Tools/Products/Workflows:
- Multi-tenant chat infrastructure that enforces abstract reasoning budgets by tenant and by intent.
- Assumptions/Dependencies:
- Need to verify task-specific quality; careful monitoring for edge cases where long chains are still beneficial.
- Privacy-friendly logging and reduced exposure of sensitive rationales
- Sectors: healthcare admin, finance ops, legal services (non-clinical/non-adjudicative uses initially)
- What: Store minimal, non-linguistic latent traces instead of long natural-language CoTs to reduce risk of leaking sensitive intermediate content.
- Tools/Products/Workflows:
- Log pipelines that store only abstract token sequences and answers, with optional redaction of abstract segments.
- Assumptions/Dependencies:
- Abstract tokens are not human-readable, which improves privacy but reduces interpretability for audits; some regulators may still require human-explainable traces.
- Training-time efficiency for RL-style post-training
- Sectors: model providers, MLOps
- What: RL with shorter intermediate rollouts reduces memory/compute per episode and eases scaling.
- Tools/Products/Workflows:
- GRPO pipelines that constrain the abstract channel and separately decode answers; token-optimal curriculum with policy-iteration warm-up then RL.
- Assumptions/Dependencies:
- Quality of the reward model strongly impacts outcomes; reproducibility requires stable RL tuning.
- Budget-aware serving with per-request or per-user limits
- Sectors: cloud platforms, internal LLM platforms
- What: Enforce hard caps on abstract tokens (m_max) to control spend and latency; dynamically adjust budgets by request complexity or SLA.
- Tools/Products/Workflows:
- Policy-based resource governance (e.g., higher m_max for premium tiers).
- Assumptions/Dependencies:
- Simple caps work today; advanced difficulty-aware policies are promising but not yet automated.
- “Reasoning-lite” modes for RAG and multi-hop QA
- Sectors: knowledge management, enterprise search, education QA
- What: Combine retrieval with an abstract reasoning channel to hold intermediate cues while keeping the overall generation short.
- Tools/Products/Workflows:
- RAG pipelines that slot a gated abstract segment before final synthesis.
- Assumptions/Dependencies:
- Requires integration testing; gains may vary by retrieval quality and task difficulty.
- Telemetry and safety through compute-budget monitoring
- Sectors: trust & safety, platform governance
- What: Detect anomalous or unexpectedly long reasoning by watching the abstract token counts and usage patterns; throttle or block as needed.
- Tools/Products/Workflows:
- Serving middleware enforcing regex-guarded abstract segments; alerts on bursts or unusual token distributions.
- Assumptions/Dependencies:
- Abstract tokens are not semantically transparent; monitoring is primarily quantitative without learned probes.
- Developer adapters to retrofit existing LLMs
- Sectors: developer tooling, open-source ecosystems
- What: Drop-in libraries that add a reserved vocabulary, attention masks for bottlenecked SFT, and constrained decoding.
- Tools/Products/Workflows:
- Hugging Face adapters; tokenizer patches; training scripts for warm-up + RL; CI to verify performance and token reduction.
- Assumptions/Dependencies:
- License constraints for training data and reward models; GPU availability for warm-up/RL.
- Pricing and product differentiation via concise-reasoning tiers
- Sectors: API businesses, marketplaces
- What: Offer “concise” plans that deliver similar accuracy with lower token counts and predictable latency.
- Tools/Products/Workflows:
- Tiered SKUs, metering by answer + abstract tokens; developer-facing knobs for budget control.
- Assumptions/Dependencies:
- Must communicate trade-offs for long-horizon tasks; ensure fallbacks when budgets are insufficient.
Long-Term Applications
These opportunities require additional research, scaling, or tooling—particularly around interpretability, dynamic budgeting, and safety for high-stakes use.
- High-stakes decision support with dual channels (latent + explain-on-demand)
- Sectors: healthcare, finance, law, public sector
- What: Use abstract tokens for efficient internal reasoning, with a separate, audited module that compiles human-readable rationales from the latent trace when legally required.
- Tools/Products/Workflows:
- “Explanation compiler” that maps abstract sequences to validated, human-readable summaries.
- Assumptions/Dependencies:
- Rigorous clinical/financial validation, bias assessment, and regulatory approval; robust faithfulness guarantees are needed.
- Difficulty-aware, adaptive reasoning budgets
- Sectors: cloud platforms, robotics, education tech
- What: Automatically adjust m_max per query based on predicted difficulty or early stopping criteria, optimizing accuracy-cost tradeoffs.
- Tools/Products/Workflows:
- Controllers trained via RL that learn a policy over abstract budget allocation; early halt/extend mechanisms.
- Assumptions/Dependencies:
- More research on reliable difficulty estimation and stability of length-control policies.
- Hierarchical abstract codebooks and reusable “subroutines”
- Sectors: software engineering, planning, robotics, scientific assistants
- What: Organize tokens into reusable, compositional chunks (macros) to encode strategies or subtasks; support long-horizon reasoning with modular latent routines.
- Tools/Products/Workflows:
- Library of task-specific abstract macros; curriculum learning for hierarchy building.
- Assumptions/Dependencies:
- Requires training curricula and mechanisms for discovery, naming, and reuse; evaluation benchmarks for compositionality.
- Tool-use orchestration via latent “thought channel”
- Sectors: agent platforms, DevOps, data engineering
- What: Interpret specific abstract token patterns to trigger external tools (search, code execution, calculators) without verbose natural-language planning.
- Tools/Products/Workflows:
- Orchestrators that route based on abstract sequences; regex/probe-based routers; safety filters on tool invocation.
- Assumptions/Dependencies:
- Needs robust mapping from abstract patterns to tool semantics; potential risk of hidden or emergent tool-triggering behaviors.
- Safety and audit frameworks grounded in latent reasoning
- Sectors: AI governance, compliance
- What: Build probes that interpret abstract tokens into coarse concepts for auditing and anomaly detection, enabling “monitorable but concise” reasoning.
- Tools/Products/Workflows:
- Concept-probing classifiers; policy checks on forbidden abstract patterns; audit logs with compact latent summaries.
- Assumptions/Dependencies:
- Research required to validate probe fidelity and combat adversarial encoding; standards for acceptable transparency.
- Interoperability standards for an “abstract reasoning” sub-channel
- Sectors: model providers, platform ecosystems
- What: Define a cross-model interface (delimiters, budgets, token sets) so tools and orchestrators can consistently leverage the abstract channel across model families.
- Tools/Products/Workflows:
- Specs for delimiters (<beginabstract>/<endabstract>), codebook management, and budget negotiation in APIs.
- Assumptions/Dependencies:
- Industry coordination; aligning on safety and versioning; backward-compatibility with existing clients.
- Low-bandwidth and satellite/field deployments
- Sectors: disaster response, remote sensing, defense
- What: Minimize tokens transmitted over constrained links by compressing intermediate reasoning and optionally streaming only final answers.
- Tools/Products/Workflows:
- Edge inference nodes using abstract budgets; opportunistic synchronization of reward/policy updates when connectivity allows.
- Assumptions/Dependencies:
- Robustness to noisy conditions; verification that compressed reasoning suffices for mission-critical tasks.
- Research into mechanistic interpretability and concept dynamics
- Sectors: academia, foundation model labs
- What: Study emergent power-law usage of abstract tokens (Zipf-like) and map tokens to internal circuits/concepts to better understand reasoning.
- Tools/Products/Workflows:
- Causal tracing, activation patching, and representational similarity analyses on abstract positions; curriculum interventions.
- Assumptions/Dependencies:
- Access to model internals; standardized benchmarks for latent compositionality.
- Curriculum strategies for long-horizon tasks with stable budgets
- Sectors: scientific discovery, theorem proving, multi-step planning
- What: Gradually expand the abstract budget and/or hierarchy for problems that exceed small m_max while staying efficient.
- Tools/Products/Workflows:
- Stage-wise training from short to longer abstract traces; verifier-assisted rewards for correctness under tight budgets.
- Assumptions/Dependencies:
- Task-specific reward functions; scalable RL without regressions on shorter tasks.
- Hybrid latent reasoning (discrete + continuous) for better expressivity
- Sectors: advanced agent systems, creative tools
- What: Combine compact discrete sequences with continuous latent states to balance control and capacity.
- Tools/Products/Workflows:
- Training loops that interleave discrete codebook tokens with continuous “concept vectors”; gating policies between modes.
- Assumptions/Dependencies:
- Stability and credit assignment remain open research problems; careful optimization required.
- Regulatory policy and sustainability reporting with compute budgets
- Sectors: public policy, sustainability
- What: Tie per-request abstract budgets to energy/carbon reporting; define policy guidelines for “green” reasoning modes in public services.
- Tools/Products/Workflows:
- Auditable reporting of average abstract tokens per request; policy-compliant defaults for public-sector deployments.
- Assumptions/Dependencies:
- Agreement on measurement standards and acceptable accuracy-cost tradeoffs in public services.
Cross-cutting assumptions and risks (affecting both horizons)
- Abstract tokens are not human-readable; where explanations are required, additional explainability modules are needed.
- Method relies on warm-up with verbal CoT or a capable teacher model to bootstrap embeddings; quality of teacher/reward model affects outcomes.
- Constrained decoding and custom attention masks must be supported by serving and training stacks.
- Short budgets may degrade performance on very long-horizon tasks; difficulty-aware budgeting is advisable.
- Latent channels can encode unintended information; safety probes and policy controls are recommended to prevent misuse or hidden tool triggers.
- Results shown on specific benchmarks and model families; domain transfer (e.g., clinical or legal reasoning) requires careful validation and compliance review.
Glossary
- Abstract Chain-of-Thought (Abstract-CoT): A discrete latent reasoning approach where a model generates short sequences of special tokens as an internal “scratchpad” before answering. "We propose Abstract Chain-of-Thought (Abstract-CoT): instead of generating natural language reasoning, we induce the model to emit a bounded-length sequence of tokens from a reserved abstract vocabulary of distinguishable filler tokens."
- advantages: Normalized per-trajectory performance signals used in RL to weight policy updates. "We define advantages:"
- block-structured attention mask: An attention pattern that enforces information flow constraints between segments (prompt, CoT, abstract tokens, answer). "and define a block-structured attention mask that enforces an information bottleneck."
- causal decoder-only LLM: A unidirectional LLM that predicts each token from previous tokens only. "Let be a causal decoder-only LLM with parameters and base vocabulary ."
- causal masking: A masking scheme preventing tokens from attending to future positions to preserve autoregressive causality. "with all other entries following standard causal masking:"
- chain-of-thought (CoT): Step-by-step intermediate reasoning expressed in tokens to improve problem solving. "While long, explicit chains-of-thought (CoT) have proven effective on complex reasoning tasks,"
- constrained decoding: Restricting the model’s next-token choices to a predefined set (e.g., abstract vocabulary) during generation. "via constrained decoding with the codebook."
- data processing inequality: Information-theoretic principle stating downstream variables cannot contain more information about a source than intermediate representations do. "By the data processing inequality, any dependence between and must be bounded by the information that can be transmitted through the abstract segment:"
- discrete latent bottleneck: A restricted-capacity intermediate representation that mediates information flow between input and output using discrete tokens. "this training procedure can be seen as implementing a discrete latent bottleneck;"
- discrete latent variable: A hidden, non-observed discrete variable that influences generation (here, the abstract token sequence). "We view the abstract trace as a discrete latent variable, mediating reasoning."
- generative reward model: A model that assigns reward scores to generated outputs for RL training without verifiable labels. "We use a generative reward model -- specifically, gpt-oss-20b -- to score outputs in our experiments,"
- gist tokens: Learned tokens acting as a bottleneck that summarize and cache task-relevant context. "gist tokens which serve as a learned bottleneck,"
- GRPO: A reinforcement learning algorithm (Group Relative Policy Optimization) used to update policies from grouped trajectory rewards. "We optimize generation of abstract traces using GRPO, with constrained decoding to the abstract vocabulary."
- guided-regex constraint: A decoding constraint implemented via regular expressions to enforce structure in generated segments. "generate under a guided-regex constraint,"
- information bottleneck: A restricted channel that forces the model to compress relevant information into limited-capacity representations. "serving as an information bottleneck."
- instruction-tuned: Models further trained on instruction–response pairs to align with user directives. "it can be achieved purely through post-training instruction-tuned models."
- KL regularization: A penalty encouraging the policy to stay close to a reference distribution during RL updates. "KL regularization is applied over both the abstract and the response distributions."
- multi-hop reasoning: Answering questions that require combining information across multiple steps or sources. "instruction-following, and multi-hop reasoning,"
- on-policy generation: Producing training trajectories using the current policy rather than a separate teacher policy. "In subsequent iterations (), abstract sequences are generated on-policy:"
- pause tokens: Special tokens inserted to allocate “thinking time” and potentially improve reasoning. "introduce <pause> tokens,"
- policy iteration: An iterative scheme alternating between policy evaluation/generation and policy improvement. "we introduce a policy iteration-style warm-up loop"
- power law distribution: A heavy-tailed frequency pattern where a few items are used very often and many are rare. "an emergent power law distribution over the abstract vocabulary,"
- reference policy: A fixed (or slowly updated) baseline policy used to compute KL penalties and stabilize RL. "the reference policy (the warm-started model)"
- reinforcement learning (RL): Training via rewards from generated outputs to improve policies through exploration and optimization. "we apply reinforcement learning with a generative reward model"
- reserved abstract vocabulary: A set of newly added, non-verbal tokens used as the model’s latent reasoning language. "from the reserved abstract vocabulary inside"
- self-distillation: Training a model on its own generated intermediate representations to reduce reliance on teacher signals. "self-distillation by training the model to generate abstract tokens from the prompt alone via constrained decoding with the codebook."
- supervised fine-tuning (SFT): Training on labeled input–output pairs (and optionally intermediate traces) to adapt the model. "performing supervised fine-tuning,"
- tokenizer extension (abstract codebook): Adding new tokens to the vocabulary to serve as discrete latent representations. "We extend the tokenizer with a set of previously unseen (reserved) tokens in the abstract codebook,"
- vector quantization: Mapping continuous representations to discrete codebook entries to compress or discretize information. "latent tokens (learned via vector quantization) with remaining text tokens,"
- verbalized CoT: Chain-of-thought expressed in natural language, as opposed to latent or abstract tokens. "yet their performance lags behind verbalized CoT."
- warm-started RL: Running RL after initializing from a model already adapted to the task, improving stability and performance. "we refer to this as warm-started RL."
- Zipf's law: An empirical power-law pattern in token frequencies where rank and frequency are inversely related. "akin to Zipf's law,"
Collections
Sign up for free to add this paper to one or more collections.
