ExecTune: Effective Steering of Black-Box LLMs with Guide Models
Abstract: For LLMs deployed through black-box APIs, recurring inference costs often exceed one-time training costs. This motivates composed agentic systems that amortize expensive reasoning into reusable intermediate representations. We study a broad class of such systems, termed Guide-Core Policies (GCoP), in which a guide model generates a structured strategy that is executed by a black-box core model. This abstraction subsumes base, supervised, and advisor-style approaches, which differ primarily in how the guide is trained. We formalize GCoP under a cost-sensitive utility objective and show that end-to-end performance is governed by guide-averaged executability: the probability that a strategy generated by the guide can be faithfully executed by the core. Our analysis shows that existing GCoP instantiations often fail to optimize executability under deployment constraints, resulting in brittle strategies and inefficient computation. Motivated by these insights, we propose ExecTune, a principled training recipe that combines teacher-guided acceptance sampling, supervised fine-tuning, and structure-aware reinforcement learning to directly optimize syntactic validity, execution success, and cost efficiency. Across mathematical reasoning and code-generation benchmarks, GCoP with ExecTune improves accuracy by up to 9.2% over prior state-of-the-art baselines while reducing inference cost by up to 22.4%. It enables Claude Haiku 3.5 to outperform Sonnet 3.5 on both math and code tasks, and to come within 1.7% absolute accuracy of Sonnet 4 at 38% lower cost. Beyond efficiency, GCoP also supports modular adaptation by updating the guide without retraining the core.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Plain-language summary of “EXECTUNE: Effective Steering of Black-Box LLMs with Guide Models”
What is this paper about?
This paper shows a smarter, cheaper way to use big AI LLMs that you can’t change or look inside (“black-box” APIs). The idea is to split the job into two parts:
- a small “guide” model writes a clear plan or strategy,
- a separate “core” model follows that plan to produce the final answer.
The authors call this setup a Guide–Core Policy (GCOP). They also introduce a training method, EXECTUNE, that teaches the guide to write plans the core can actually follow. This makes the system more accurate and less expensive to run.
What questions are the authors trying to answer?
In simple terms:
- Can we make cheaper models solve hard tasks by giving them better plans?
- What’s the single most important thing to train the guide to do so the whole system works well?
- How can we train the guide to produce plans that are both easy to follow and cost-efficient?
Their key insight: the main thing that controls overall performance is “executability,” which means “how often the core can parse and successfully follow the guide’s plan.”
How did they do it? (With easy analogies)
Think of the system like a sports team:
- The guide is the coach who writes the play.
- The core is the player who runs the play.
- The goal is to win points (correct answers) without wasting too much time or energy (costs like money, tokens, or latency).
If the coach writes plays that are confusing or too fancy, the player stumbles and you lose time and points. So the coach must learn to write clear, followable plays. That “followability” is executability.
To teach the coach (guide) to write good plays, the authors use EXECTUNE, a two-stage training recipe:
- Stage 1: Build good examples using “teacher-guided acceptance sampling”
- A strong “teacher” model suggests a plan.
- The target core tries to follow that plan.
- If the plan leads to a correct result, keep it as training data; if not, the teacher refines the plan.
- Then they do supervised fine-tuning (SFT): train the guide to imitate these accepted, successful plans.
- Analogy: the assistant coach proposes a play; if the team scores, we add that play to the playbook; the coach studies this playbook.
- Stage 2: Refine with structure-aware reinforcement learning (RL)
- The guide gets rewarded for:
- using the correct format (e.g., putting the plan inside a <strategy>…</strategy> block so the core can parse it),
- plans that the core follows to a correct answer,
- not “cheating” by sneaking the final answer into the plan,
- not making the core do worse than it would without a plan.
- Analogy: we reward the coach for draws that the player can read, lead to points, don’t give away the final score, and never make the team perform worse.
What does “black-box” mean here? You can ask the core model questions and get answers, but you can’t change its internal settings or see how it thinks.
What does “cost-sensitive” mean? The paper treats the system like a scoreboard: total points = accuracy (correct answers) minus the cost of running the models (money/time/tokens). The goal is to get more points overall, not just more accuracy at any price.
What did they find, and why is it important?
- Executability is the key. When the guide’s plans are easy to parse and follow, the whole system performs like a much bigger model—without the big cost.
- Their EXECTUNE-trained guides:
- improved accuracy by up to about 9.2% compared to previous steering methods,
- reduced inference cost by up to about 22.4%,
- helped a smaller core (Claude Haiku) match or beat a stronger, more expensive core (Claude Sonnet) on math and coding tasks.
On benchmarks:
- Math (GSM8K): With EXECTUNE, the guide+Haiku reached about 93.6% accuracy—higher than Sonnet-3.5 reported in the paper—and far above Haiku alone.
- Code (KodCode, HumanEval): The guide+Haiku with EXECTUNE surpassed Sonnet-3.5 on both datasets.
- Overall, their method came within about 1.7% of a next-gen core (Sonnet 4) at roughly 38% lower cost.
Why this matters:
- You can get near “big model” performance while paying “small model” prices.
- It’s modular: you can update or swap the guide for new domains without retraining the core model.
What’s the bigger impact?
This approach helps companies and developers:
- Save money and reduce latency when using LLMs at scale.
- Make systems more reliable by turning vague “advice” into precise, parseable plans that models can follow.
- Adapt to new tasks quickly by retraining just the guide, not the core.
The authors note one limitation: they mostly tested single-turn tasks (one plan, one execution). Future work could explore multi-step settings where the guide and core interact over many turns, use tools, or recover from mistakes mid-process.
In short
The paper introduces a “coach-and-player” setup for LLMs and a training method (EXECTUNE) that teaches the coach to write plays the player can actually run. By focusing on executability—clear, structured plans that the core can reliably follow—the system gets more right answers for less cost, often matching or beating bigger, pricier models.
Knowledge Gaps
Below is a concise, actionable list of the paper’s unresolved knowledge gaps, limitations, and open questions to guide future work:
- Empirical validation of executability theory: Directly measure per-instance and guide-averaged executability at train/test time and quantify how changes in explain value gap closure, beyond the current theoretical bound.
- Strength of the “teacher-aligned good execution” assumption: Test the realism of the assumption ; characterize cases where “good execution” diverges from teacher behavior and how this affects the bound.
- Sensitivity to teacher choice and acceptance thresholds: Systematically ablate teacher model, acceptance threshold , number of validation trials , and refinement iterations; report acceptance rates, dataset sizes, and resulting performance/cost trade-offs.
- Cost of acceptance sampling and data curation: Quantify the monetary/latency overhead of building the accepted-strategy corpus (including multiple core executions per candidate) and incorporate it into overall utility accounting.
- Missing incorporation of cost into RL objective: The shaped reward used in GRPO appears not to include a token/latency cost term; evaluate adding explicit cost penalties to optimize the stated net-utility objective during training.
- Strategy length vs. utility trade-offs: Analyze how strategy verbosity impacts both executability and inference cost, and learn length/format policies that optimize under context-window constraints.
- Reliability and bias of LLM-as-a-judge signals: Assess consistency, bias, and adversarial robustness of the judge score ; compare to human judgments and rule-based heuristics; ablate the contribution of the judge to final performance.
- Reward component ablations: Provide rigorous ablations for Istr (structure), judge score, and no-negative-behavior penalty terms to isolate their individual and combined effects on executability and task performance.
- Reporting of executability diagnostics: Track and report parse success rates, fraction of malformed strategies, execution success rates conditioned on parse, and degradation rates during training and evaluation.
- Generalization beyond single-turn settings: Extend and test GCOP/EXECTUNE in multi-turn, tool-using, and long-horizon tasks, addressing strategy updating, recovery after failures, and credit assignment across turns.
- Distribution shift robustness: Evaluate on broader out-of-domain tasks (beyond HumanEval), adversarial prompts, and real-world coding/math distributions; study how and performance degrade under shift.
- Portability across cores and updates: Measure how a guide trained for one black-box core transfers to different cores or to updated versions of the same core; develop fast adaptation or robustness strategies for core drift.
- Safety and alignment impacts: Investigate whether guides can inadvertently reduce safety (e.g., prompt-injection susceptibility, jailbreaks) or amplify biases; create executability-aware safety constraints and evaluate them.
- Leakage definition and enforcement: Formalize “non-leakage” criteria across tasks (e.g., allowable code skeleton vs. full solutions), validate judge-based leakage detection, and test evasion/adversarial cases.
- Practical latency and systems constraints: Report end-to-end latency (including extra guide call and parsing) under realistic batching/parallelism; assess whether latency meets production SLAs.
- Scaling guide size and architecture: Explore the accuracy/cost frontier across guide sizes (sub-1B to ~7B+), architectures, and distillation methods; characterize diminishing returns and deployment memory constraints.
- Sample efficiency and stability of GRPO: Quantify RL sample efficiency, variance, and stability (e.g., catastrophic forgetting, oscillations); compare GRPO to alternative offline/online RL or bandit algorithms.
- Data contamination risks: Audit whether teacher outputs or curated strategies leak test answers (e.g., GSM8K/HumanEval contamination) and assess fairness of comparisons to stronger cores.
- Evaluation breadth: Test on additional reasoning/code benchmarks (e.g., MATH, MBPP, LeetCode-like suites, multilingual code) to gauge generality.
- Runtime fallbacks and robustness: Investigate multi-candidate strategies, fallback to core-only when parsing fails, or reject-option mechanisms to reduce brittleness at inference.
- Context budget management: Develop adaptive mechanisms to fit strategies within core context limits (e.g., truncation policies, content prioritization, or compressed formats).
- Guide-core interface design: Study alternative structured interfaces beyond a single <strategy> block (e.g., typed schemas, JSON, tool-call specifications) and their impact on executability.
- Validator dependence: Examine tasks without reliable automatic validators (soft metrics, open-ended outputs) and propose practical proxies or human-in-the-loop acceptance strategies.
- Modularity claims without experiments: Empirically validate touted modular benefits (domain adaptation, continual learning, targeted unlearning) with controlled studies showing adaptation speed and isolation from core behavior.
- Comprehensive cost accounting and reproducibility: Provide full token accounting (guide+core, retries, baseline comparisons), confidence intervals for metrics, released prompts/datasets/code, and clear hyperparameters to ensure replicability.
Practical Applications
Immediate Applications
Below are concrete, deployable use cases that leverage GCOP and the EXECTUNE training recipe to improve accuracy-cost trade-offs when steering black-box LLMs.
- Software: Cost-optimized black-box LLM steering in production apps
- Use case: Insert a small, trainable guide that emits a structured strategy block before calling a cheaper black-box core (e.g., for chatbots, summarization, drafting, analytics).
- Tools/products/workflows: “Guide layer” microservice; strategy schema + deterministic parser; acceptance-sampling data pipeline; executability dashboard; non-degradation A/B tests; budget controller for λ in the utility objective.
- Impact: Reduce recurring API cost and latency while improving reliability; measurable 20%+ cost reduction at similar or higher quality.
- Assumptions/Dependencies: Access to a strong teacher for acceptance sampling; validator for task success; stable core API behavior; lightweight compute to fine-tune an open-weight guide.
- Software Engineering: Code generation and review with strategy steering
- Use case: IDE extensions that generate an explicit plan (tests-to-pass, function signatures, constraints) which a black-box model follows to produce code; CI uses unit-test-based acceptance sampling to curate strategy datasets.
- Tools/products/workflows: VS Code/JetBrains plugin; strategy-aware code-gen API; CI/CD “Guide CI” with schema checks; regression guardrails via non-degradation shaping.
- Impact: Higher Pass@1 on in-domain and out-of-domain code with lower API spend; fewer brittle completions.
- Assumptions/Dependencies: Test harnesses; reliable code validators; permission to store logs for strategy SFT.
- Education: Structured math reasoning and tutoring
- Use case: Tutor systems generate a parseable problem decomposition (without leaking the final answer) and have a core produce step-by-step solutions aligned to that plan.
- Tools/products/workflows: Strategy schema aligned to pedagogical rubrics; judge service to penalize answer leakage; cost-aware routing for practice vs assessment modes.
- Impact: More consistent reasoning quality on math word problems with lower per-session cost.
- Assumptions/Dependencies: Safe design to prevent revealing answers in the strategy; task reward/validator (exact match, rubric-based).
- Enterprise RAG/Tool Use: Reliable tool-call and workflow plans
- Use case: Guide emits a structured tool plan (which tools, in what order, what inputs), and a core executes it; improves reproducibility and auditability in customer support, legal research, or analytics pipelines.
- Tools/products/workflows: Strategy schemas for tool selection and parameterization; deterministic validators for tool effects; telemetry on executability q(s, z).
- Impact: Fewer tool-call failures; better traceability; lower latency/cost compared to free-form agent loops.
- Assumptions/Dependencies: Tool APIs with deterministic interfaces; lightweight judges to assess plan quality and leakage.
- LLMOps/Platform Engineering: A “Guide Layer” for model orchestration
- Use case: Add a reusable “strategy IR” layer in LLM platforms (LangChain, Semantic Kernel) that standardizes planning across prompts and models.
- Tools/products/workflows: Guide registry/versioning; canary deploys; executability SLOs; cost-utility monitoring; structured RL (GRPO) training jobs.
- Impact: Modular updates to strategies per domain or tenant without changing the core model; faster iteration cycles.
- Assumptions/Dependencies: Organizational buy-in to treat strategies as first-class interface contracts.
- FinOps for AI: Utility-aware routing and budgeting
- Use case: Apply the reward–cost objective to dynamically choose when to use a guide+cheap core vs a stronger (expensive) core; set λ based on latency/cost SLAs.
- Tools/products/workflows: Policy engine for λ and routing; agreement-based cascades integrated with GCOP; dashboards tracking value gap vs cost.
- Impact: Predictable spend with minimal quality loss; graceful degradation under load.
- Assumptions/Dependencies: Calibrated validators and confidence metrics; access to multiple core APIs.
- Safety/Compliance Operations: Modular alignment updates
- Use case: Update guides to incorporate new safety rules (red-team learnings, policy changes) without retraining or swapping the core.
- Tools/products/workflows: Safety strategy templates; non-leakage judges; anti-regression tests to ensure no harm vs unguided baseline.
- Impact: Faster alignment changes; reduced vendor lock-in.
- Assumptions/Dependencies: Effective judge prompts; periodic audits to detect drift.
- Data/Labeling: High-quality instruction and strategy dataset curation
- Use case: Generate training data by teacher-proposing strategies and accepting only those that the target core executes successfully; bootstraps SFT corpora for new domains.
- Tools/products/workflows: Acceptance-sampling pipeline; programmatic validators (exact match, unit tests, heuristics); dataset versioning.
- Impact: Smaller but higher-yield datasets; faster domain adaptation.
- Assumptions/Dependencies: Access to a sufficiently strong teacher and reliable validators.
- Public Sector/Procurement: Contracting on “executability” and utility
- Use case: Include executability and net-utility targets in vendor SLAs, emphasizing cost-quality trade-offs over raw accuracy.
- Tools/products/workflows: Standardized reporting of executability a(s), degradation rates, and cost per successful task.
- Impact: Transparent cost control; better comparability across black-box vendors.
- Assumptions/Dependencies: Agreement on metrics definitions and validation protocols.
- Knowledge Work/Daily Life: Personal assistants with guide-on-device + core-in-cloud
- Use case: A small on-device guide plans tasks (email triage steps, meeting prep agenda), while the cloud core drafts or executes; reduces tokens and improves responsiveness.
- Tools/products/workflows: Mobile guide runtime; cached strategy libraries; offline-first planners; budget-aware fallbacks.
- Impact: Lower latency and cost; improved predictability.
- Assumptions/Dependencies: Lightweight on-device model and privacy controls.
Long-Term Applications
The following opportunities are promising but require further research, scaling, or standardization beyond the paper’s single-turn scope.
- Multi-turn Agent Systems and Tool Use at Scale (software, robotics, IT ops)
- Use case: Guides perform long-horizon planning and mid-course corrections; cores execute iterative steps with memory and recovery from failures.
- Tools/products/workflows: Turn-by-turn executability metrics; credit assignment across strategy updates; recovery policies.
- Dependencies: Robust multi-turn validators; strategy update protocols; memory safety and state management.
- Safety-Critical Domains (healthcare, finance, legal)
- Use case: Strategy-first clinical reasoning, risk analysis, or legal research plans with strict non-leakage and audit trails; core produces narratives consistent with the plan.
- Tools/products/workflows: Domain-specific schemas and validators (clinical guidelines, regulatory checklists); human-in-the-loop review; safety cases.
- Dependencies: Regulatory approvals, bias/harm audits, gold-standard validators, extensive trials.
- Enterprise Unlearning/Targeted Remediation via Guide Updates
- Use case: Remove or attenuate specific behaviors (e.g., deprecated policies, sensitive topics) by updating guides rather than retraining cores.
- Tools/products/workflows: Negative reward shaping with coverage tests; compliance attestations; rollback tooling.
- Dependencies: Verified absence of leakage paths; robust detection of residual behaviors.
- “Strategy IR” and LLM Compiler Toolchains
- Use case: Treat strategies as an intermediate representation optimized by compiler-like passes (normalize, simplify, cost-tune) before execution.
- Tools/products/workflows: IR spec; optimization passes; cost models; static analyzers for parseability and leakage.
- Dependencies: Broad community adoption of schemas; benchmarks for IR transformations.
- Marketplaces for Domain Guides
- Use case: Third-party, versioned guide packages (e.g., tax prep, biotech search) compatible with many cores; monetized via executability guarantees.
- Tools/products/workflows: Guide registry; interoperability certification; licensing and telemetry standards.
- Dependencies: IP frameworks for strategy artifacts; cross-vendor schema standards.
- Standardization and Policy: Executability Metrics and Audit Requirements
- Use case: Industry/standards bodies (e.g., ISO/IEEE) define “executability” KPIs, non-degradation thresholds, and disclosure formats for black-box LLM services.
- Tools/products/workflows: Certification suites; reference validators; reporting templates.
- Dependencies: Consensus on metrics; representative test corpora; regulator engagement.
- Dynamic Utility Control in Serving Stacks
- Use case: Real-time control of λ (cost sensitivity) to adapt reasoning effort under load, budgets, or latency SLAs; tie into autoscaling and routing.
- Tools/products/workflows: Control loops integrated with traffic managers; utility-aware schedulers; per-tenant policies.
- Dependencies: Stable and monotonic cost/quality curves; accurate short-term demand forecasts.
- Privacy-Preserving On-Device Guides + Cloud Cores
- Use case: Federated or differential-privacy-aware adaptation of guides on endpoints; acceptance sampling done locally with privacy budgets.
- Tools/products/workflows: Federated GRPO; private validators; secure telemetry aggregation.
- Dependencies: Efficient on-device training; privacy guarantees and audits.
- Curriculum and Lifelong Adaptation
- Use case: Systems that continuously curate accepted strategies and retrain guides for new domains/tasks, improving executability over time.
- Tools/products/workflows: Data flywheels with drift detection; domain adaptation protocols; safety rails to avoid catastrophic forgetting.
- Dependencies: High-quality feedback signals at scale; governance for data provenance and safety.
- Software Modernization at Scale (code)
- Use case: Strategy-driven refactoring/migration (e.g., Python 2→3, framework upgrades) with validator suites; cores execute consistent, incremental changes.
- Tools/products/workflows: Strategy templates for migration playbooks; regression test harnesses; batch executability monitors.
- Dependencies: Comprehensive test coverage; safe rollout and rollback mechanisms.
- Incident Response and Runbooks (IT/SRE)
- Use case: Guides propose structured incident runbooks; cores execute diagnostic and remediation steps with tool access.
- Tools/products/workflows: Tool schemas; safety interlocks; postmortem-aware feedback for acceptance sampling.
- Dependencies: Reliable tool instrumentation; strong validators to prevent harmful actions.
- Scientific Workflows (academia, biotech, materials)
- Use case: Strategy plans for experiment design or code notebooks; cores execute data analysis/code generation reproducibly.
- Tools/products/workflows: Domain validators (unit tests, statistical checks); provenance tracking; lab notebook integration.
- Dependencies: High-fidelity validators; domain-specific safety/ethics approvals.
Cross-cutting Assumptions and Dependencies
- Access to a strong “teacher” model and a cheaper “core” API; stable API behavior over time.
- Well-defined, automatable validators of success (exact match, unit tests, rubric, tool outcomes).
- Ability to parse and enforce a strategy schema; reliable LLM-as-a-judge for shaping signals.
- Compute and data to fine-tune small open-weight guides; MLOps/LLMOps maturity for CI/CD.
- Current results are single-turn; multi-turn, tool-rich settings require further research (credit assignment, memory, robustness).
- Data governance, privacy, and safety policies must be in place when logging strategies and outcomes.
Glossary
- Acceptance rate: The proportion of proposed strategies that are accepted by a validator during data curation or evaluation. "where As is the acceptance rate."
- Accepted-strategy distribution: A reweighted distribution over strategies that pass a validator’s acceptance criterion, used to train guides toward executable outputs. "This defines an accepted-strategy distribution Tacc (. | s) (see Appendix A.6)."
- Advisor-style models: Learned steering models that provide advice to a fixed black-box core, often optimizing advice quality without matching downstream execution constraints. "advisor-style models typically optimize advice quality in isolation rather than the downstream constraints imposed by a smaller, cost-constrained core."
- Agentic inference: Decision-making by LLM-based agents framed as optimizing reward under inference-time costs. "agentic inference can be formally written as a cost-sensitive net utility objective"
- Agentic systems: Composed LLM systems that separate reasoning and execution to amortize compute via reusable intermediate representations. "motivating the design of agentic systems that amortize expensive reasoning into reusable intermediate representations or memories."
- Auto-CoT: An automatic chain-of-thought prompting method that elicits intermediate reasoning steps. "Tool-augmented prompting methods such as ReAct (Yao et al., 2022) and Auto-CoT (Zhang et al., 2022) highlight how LLMs can dynamically generate intermediate reasoning or query tools to improve response quality."
- Black-box APIs: Model access interfaces that expose text I/O without internal weights or token-level controls. "For LLMs deployed through black-box APIs, recurring inference costs often dominate one-time training costs"
- Black-box core model: The fixed target LLM (accessed via API) that executes the strategy produced by a guide to generate the final output. "a black-box core model executes it to produce the final output for a given task."
- Chain-of-Thought distillation: Transferring step-by-step reasoning traces from larger to smaller models to teach reasoning. "Step-by-step distillation (Hsieh et al., 2023) and Chain-of-Thought distillation (Do et al., 2025) show that intermediate rationales help teach reasoning to smaller models."
- Constitutional AI: An alignment framework that replaces or supplements human feedback with AI-written principles and critiques. "They complement alignment frameworks such as InstructGPT (Ouyang et al., 2022) and Constitutional AI (Bai et al., 2022), which apply RLHF or AI-based critiques."
- Cost-sensitive net utility objective: An objective that balances task reward against inference-time cost to reflect deployment trade-offs. "agentic inference can be formally written as a cost-sensitive net utility objective"
- DExperts: A decoding-time steering method using expert and anti-expert models to bias generation. "FUDGE (Yang & Klein, 2021) and DExperts (et al., 2021a) steer generation through learned discriminators or ensemble scoring,"
- Direct Preference Knowledge Distillation (DPKD): A distillation approach that uses preference signals or reverse-KL to preserve quality and style in smaller models. "Direct Preference Knowledge Distillation (DPKD) (Li et al., 2024) use reward modeling or reverse-KL objectives to preserve quality and stylistic traits."
- Directional Stimulus Prompting: A learned prompting method that trains controllers to produce targeted steering instructions for black-box LLMs. "Guide Models (Asawa et al., 2025) and Directional Stimulus Prompting (Li et al., 2023) train lightweight controllers to craft tailored prompts"
- Environment/validator: The evaluation mechanism (tests, metrics, or rules) that determines success or failure of a model’s execution. "where G = 1 denotes 'good execution' under the environment/validator."
- Executability: The likelihood that a guide-produced strategy is parseable and can be faithfully executed by the core. "Our analysis identifies executability as the key component in GCoP: the probability that a guide-produced strategy is parseable and can be successfully followed by the core."
- Execution success: The outcome in which the core, conditioned on a strategy, completes the task correctly (e.g., passes tests). "explicit rewards for syntactic validity, execution success, and cost-efficient behavior"
- EXECTUNE: A two-stage training recipe (SFT + structure-aware RL) that optimizes guides for validity, success, and cost efficiency. "we propose EXECTUNE, a principled training recipe that combines teacher-guided acceptance sampling, supervised fine-tuning, and structure-aware reinforcement learning"
- Finite-horizon interactions: Tasks modeled over a fixed number of steps with bounded cumulative reward. "We model tasks as finite-horizon interactions with reward bounded by Rmax"
- FUDGE: A decoding-time control method that uses future discriminators to guide generation toward desired properties. "FUDGE (Yang & Klein, 2021) and DExperts (et al., 2021a) steer generation through learned discriminators or ensemble scoring,"
- GCOP (Guide-Core Policies): A composed policy family where a trainable guide emits a strategy that a black-box core executes. "Guide-Core Policies (GCoP) that decomposes reasoning into two stages: a small trainable-guide model generates a high-level plan, advice, or strategy, and a small black-box core model executes it"
- GRPO: A reinforcement learning algorithm used to refine guide models with structured rewards and penalties. "We further refine the guide with GRPO (Shao et al., 2024) algorithm using a shaped reward"
- Guide-averaged executability: The expected execution success probability over strategies sampled by the guide, controlling performance under GCOP. "guide-averaged executability: the probability that a strategy can be faithfully followed by the core."
- In-context learning (ICL): A core-only baseline that conditions on retrieved exemplars without parameter updates. "ICL (core-only), where we retrieve the three nearest (problem, solution) training pairs"
- Knowledge distillation: Techniques to transfer capabilities from large to smaller models using teacher outputs. "knowledge distillation techniques aim to transfer capabilities from large models to smaller ones."
- Least-to-Most Prompting: A prompting approach that decomposes problems into ordered subproblems to facilitate reasoning. "structured prompting techniques like Least-to-Most Prompting (Zhou et al., 2022) and Program-of-Thought prompting (Chen et al., 2022)"
- LLM-as-a-Judge: Using an LLM to grade or score generated strategies for quality, usefulness, or leakage. "Let J(s, z) ∈ [0, 1] be an LLM-as-a-Judge score applied to the parsed strategy"
- Matryoshka Pilot (M-Pilot): A framework that treats the LLM as an environment and uses a learned guide to decompose tasks. "Matryoshka Pilot (M-Pilot) (Li et al., 2025) treats the LLM as an environment, with a learned guide breaking complex tasks into subtasks across multiple turns."
- Model cascades: Deployment strategies that route queries among models of varying sizes to manage cost and accuracy. "system-level methods route queries through model cascades."
- Net utility: The objective value that subtracts deployment cost from task reward to balance performance and compute. "we optimize the net utility JT (So)"
- No-negative-behavior shaping: A training penalty that discourages strategies which degrade core performance relative to the unguided baseline. "No-negative-behavior shaping. Let y denote the final output produced by the core."
- Parseable strategy: A guide output that conforms to a strict schema so it can be deterministically checked and executed. "We require the guide to emit a well-formed <strategy> ...< /strategy> block that is parseable by a deterministic checker."
- Pass@1: The proportion of tasks solved correctly by the first sampled solution, commonly used in code evaluation. "We report Pass@1 for code generation"
- Program-of-Thought prompting: A technique that separates computation from reasoning by eliciting structured intermediate steps. "structured prompting techniques like Least-to-Most Prompting (Zhou et al., 2022) and Program-of-Thought prompting (Chen et al., 2022)"
- ReAct: A prompting paradigm that interleaves reasoning and acting, often via tool calls during generation. "Tool-augmented prompting methods such as ReAct (Yao et al., 2022) and Auto-CoT (Zhang et al., 2022) highlight how LLMs can dynamically generate intermediate reasoning or query tools to improve response quality."
- Reverse-KL objectives: Divergence-based training objectives that bias a student toward a teacher’s distribution while preserving style/quality. "reverse-KL objectives to preserve quality and stylistic traits."
- Reward-cost trade-off: The empirical balance between task performance and inference expense under deployment constraints. "Figure 1: Reward-cost trade-off."
- RLHF: Reinforcement learning from human feedback, used for aligning LLM behaviors to human preferences. "which apply RLHF or AI-based critiques."
- Shaped reward: A modified reward signal that includes structural bonuses/penalties to reinforce executable and non-leaky strategies. "using a shaped reward that enforces a structured guide-core interface"
- State visitation distribution: The distribution over states encountered by a policy, used in theoretical bounds and analysis. "where dL is the (average) state visitation distribution under TL."
- Student-teacher mixture analysis: A modeling view that expresses the induced policy as a mixture of teacher-aligned and non-executable behaviors. "student-teacher mixture analysis (see Figure 2)"
- Structure-aware reinforcement learning: RL that explicitly encodes structural constraints (e.g., strategy format) and execution outcomes in the reward. "structure-aware reinforcement learning with explicit rewards for syn- tactic validity, execution success, and cost-efficient behavior"
- Supervised fine-tuning (SFT): Post-training that teaches the guide to imitate accepted strategies tailored to the core’s constraints. "teacher-guided acceptance sampling and supervised fine-tuning"
- Teacher-guided acceptance sampling: Data curation where a strong teacher proposes strategies that are accepted only if the target core succeeds under validation. "teacher-guided acceptance sampling"
- Tool-augmented prompting: Prompting that instructs models to call external tools or APIs during reasoning to improve accuracy. "Tool-augmented prompting methods such as ReAct (Yao et al., 2022) and Auto-CoT (Zhang et al., 2022)"
- Value gap: The performance difference between a composed policy and a stronger baseline, controlled by executability. "the value gap to a large black-box baseline is controlled by guide-averaged executability"
Collections
Sign up for free to add this paper to one or more collections.