In-Context Distillation with Self-Consistency Cascades: A Simple, Training-Free Way to Reduce LLM Agent Costs (2512.02543v1)

Published 2 Dec 2025 in cs.LG

Abstract: The world currently has an abundance of ideas for how to use new LLM agents, and developers seek to rapidly prototype and test new agentic designs. However, executing agents at scale using high-capacity LLMs incurs high inference costs. We propose a simple method for reducing LLM agent inference costs without incurring the development friction costs associated with LLM fine-tuning (long training cycles, optimization hyperparameter tweaking loops) or manual prompt engineering (laborious trial and error). Most importantly, we introduce $\textit{in-context distillation}$, which adapts the idea of knowledge distillation (training a low cost-student model to mimic a high-cost teacher) to an in-context learning setting. Our approach retrieves relevant teacher demonstrations at each agent step and provides them to the student as in-context examples, enabling the student to imitate teacher behavior on-the-fly. We combine in-context distillation with the established idea of $\textit{self-consistency cascades}$ to know when the trust the student. This adaptive strategy realizes the cost benefits of model specialization while preserving the productivity of working with frozen models. On the multi-step embodied reasoning benchmark ALFWorld, our method matches teacher-level accuracy at $\textbf{2.5$\times$ lower cost}$, reducing per-episode costs from \$0.059 to \$0.024. The upfront demonstration cost amortizes after just 843 episodes, yielding cumulative savings exceeding \$34,900 at deployment scale (1M episodes). On AppWorld, a complex agent benchmark requiring multi-step API workflows, we shift the Pareto frontier by achieving a $\textbf{2$\times$ cost reduction}$ at iso-accuracy. By reducing operational costs while maintaining rapid experimentation cycles with frozen models, our approach makes advanced agentic systems economically viable for a broader range of applications.

Summary

The paper presents a training-free in-context distillation method that leverages teacher LLM demonstrations to guide low-cost student models.
It integrates a self-consistency cascade that samples multiple candidate actions and defers to teacher outputs when inconsistencies arise.
Empirical results on ALFWorld and AppWorld show significant cost reductions with near-teacher accuracy, streamlining agile LLM agent deployment.

In-Context Distillation with Self-Consistency Cascades for Rapid, Cost-Efficient LLM Agent Deployment

Motivation and Problem Formulation

The paper presents a novel framework for reducing inference costs associated with deploying LLM-based agents, focusing on dynamic prototyping and large-scale execution scenarios where traditional model distillation and prompt engineering approaches are impractical. Classical model distillation, which requires extensive training of a student model to mimic a teacher and fine-tuning for each target domain, induces substantial deployment friction. Prompt engineering, although useful for boosting small model performance, demands significant human intervention and tends to produce brittle solutions. As agentic systems proliferate into bespoke workflows and adaptive automation, the demand for cost-efficient yet agile approaches becomes paramount.

The authors formalize the goal as maximizing the success rate of agentic task completion by an LLM-based agent over a massive evaluation set ( $T_{test}$ ), while minimizing total inference cost. The key challenge is to render smaller, cheaper LLMs more effective on multi-step tasks without incurring retraining or extensive system engineering overhead.

Methodology

In-Context Distillation

The core contribution is a training-free in-context distillation mechanism that utilizes existing high-capability ("teacher") LLMs to generate exemplar trajectories on a small set of demonstration tasks. These teacher demonstrations encapsulate goals, plans, observations, intermediate reasoning traces, and final actions, and are stored in a vector database prepared with dense embedding indexes (e.g., using MiniLM-L6-v2). At test-time, a lower-cost, frozen student LLM dynamically retrieves k-nearest teacher exemplars pertinent to its current step and injects these into its prompt as in-context guidance. Thus, rather than updating model weights, the approach adaptively conditions agent behavior on high-quality trajectory fragments in real time.

Self-Consistency Cascades

Relying solely on in-context distillation may result in student uncertainty when retrieved demonstrations are sparse or mismatched. The framework addresses this by coupling the student model’s output with a self-consistency cascade. Multiple (default $N=3$ ) candidate actions are sampled for every agent step using the same in-context exemplars. If these outputs are mutually consistent (identical or semantically equivalent, optionally validated by an auxiliary LLM verifier), the student’s action is executed. If outputs diverge, the system adaptively defers to the expensive teacher model for that step. This mechanism exploits intrinsic model introspection and avoids the necessity for separate routing classifiers or training confidence estimators.

Empirical Results

Experiments are conducted on two multi-step agent benchmarks: ALFWorld (embodied reasoning) and AppWorld (API workflow automation). The student-teacher framework utilizes Claude Sonnet 4.5 as teacher and GPT-4.1-mini or Llama-3.3-70B as students.

Key findings include:

On ALFWorld, in-context distillation alone boosts student accuracy to 97% of the teacher at 43% of the teacher’s cost. Adding self-consistency gating further reduces cost to 2.5x lower than teacher ($0.024$ vs $0.059$ per episode) while slightly exceeding teacher accuracy (96% vs 89%).
On AppWorld, the combined system achieves 3.5x cost reduction, recovers 79% of teacher accuracy, and establishes a new Pareto frontier for cost-performance trade-offs.
Data efficiency is pronounced—in ALFWorld, only 100 teacher demonstrations suffice for the student to recover >94% of teacher accuracy, with further improvements from scaling the database to 500 examples.
Dynamic per-step retrieval of trajectory fragments is shown to match the accuracy of single-shot trajectory retrieval while significantly reducing context size and cost.
Cost amortization analysis demonstrates that the upfront expense of collecting teacher demonstrations is rapidly offset; e.g., on ALFWorld, the breakeven occurs after 843 episodes, resulting in cumulative savings exceeding $34,900 for 1M agents processed.

Comparative Analysis and Practical Context

The method is evaluated against random and confidence-based cascades, zero-shot student performance, classical model selection, and more sophisticated compound agentic systems (IBM CuGA). Across all baselines, the in-context distillation cascade consistently dominates, either matching or exceeding teacher-level accuracy at a fraction of the cost. Notably, the training-free nature obviates the challenges encountered with fine-tuning (such as objective misalignment and stability issues), streamlining deployment cycles for practitioners lacking specialized ML infrastructure.

Difficulty-aware routing—using oracle knowledge of task complexity—can further improve cost-accuracy trade-offs in domains where such metadata is available, but the proposed self-consistency gating is robust when explicit difficulty metrics are absent.

The framework generalizes across both proprietary and open-weight LLMs, and is orthogonal to other prompt optimization, memory, or selection enhancements found in contemporary agent improvement literature.

Theoretical and Practical Implications

The paper elucidates several important implications:

Non-Parametric Distillation: By leveraging retrieval-based in-context learning for trajectory transfer, the framework sidesteps weight optimization, enabling instant specialization of frozen models and immediate applicability to new domains.
Cost-Efficient Adaptation: Combining in-context distillation with self-consistency routing allows for near-teacher accuracy at a fraction of the cost, making large-scale agent deployment viable for resource-constrained or specialized applications.
Rapid Prototyping: The framework facilitates swift experimentation and iteration, essential for rapid agent design and deployment in dynamic environments.
Scalability: Upfront demonstration costs amortize quickly, yielding large savings at deployment scale, as confirmed by the detailed cost analyses.

Future Directions

Extensions of this approach could plausibly include:

Incorporation of more sophisticated retrieval (e.g., state-aware or trajectory-level predictors)
Hybridization with online self-improving agents for continual adaptation
Integration with caching, compression, or active demonstration selection to further minimize token usage
Expansion to scenarios involving multi-agent coordination or environments with highly unstructured action spaces

The framework opens avenues for flexible, robust agent architectures that blend the strengths of large frozen models, non-parametric adaptation, and uncertainty-aware routing.

Conclusion

In-context distillation combined with self-consistency cascades presents a robust, training-free paradigm for deploying cost-efficient LLM agents capable of handling complex, multi-step tasks. The approach achieves strong cost reductions and teacher-level accuracy without retraining, human prompt engineering, or advanced system infrastructure, thereby substantially lowering barriers for agile agent prototyping and deployment. The methodology generalizes across diverse LLM architectures and application domains, promising scalable and economically viable agentic AI for industry and research (2512.02543).