Combee: Scaling Prompt Learning for Self-Improving Language Model Agents

This presentation explores Combee, a breakthrough framework that solves the critical bottleneck preventing large-scale deployment of self-improving language model agents. When agents learn from their experiences by updating shared prompts, naive parallel scaling destroys the quality of learned knowledge—a phenomenon called context overload. Combee introduces a distributed Map-Shuffle-Reduce architecture with hierarchical aggregation, augmented shuffling, and dynamic batch control to achieve up to 17× speedup while preserving or improving accuracy across reasoning, extraction, and agent tasks.
Script
When language model agents learn from experience by updating their own prompts in parallel, a hidden crisis emerges: the more agents you run simultaneously, the less they actually learn. This is context overload, and until now, it has blocked the path to truly scalable, self-improving agent collectives.
On the Formula benchmark, scaling from 1 to 100 parallel agents causes a catastrophic knowledge collapse. The aggregated prompt shrinks from 264 specific strategies down to just 21 generic platitudes, and accuracy drops 15 percentage points. This isn't a memory limit problem—the text still fits. It's architectural information loss, and it happens across every task researchers tested.
Combee treats prompt learning as a distributed systems challenge, not just a prompting trick.
Where naive scaling forces one aggregator to compress hundreds of reflections into a single prompt, Combee builds a balanced aggregation tree. Sub-batches merge hierarchically, like a tournament bracket for ideas. Augmented shuffling duplicates each reflection and shuffles it across groups, ensuring critical insights survive the reduction process. A dynamic controller profiles runtime performance to find the largest safe batch size for each task.
This snapshot shows the core result on AppWorld with DeepSeek version 3.1. The horizontal axis represents training time; the vertical axis tracks task completion quality. Combee's curve hugs the top-right corner: it reaches the highest accuracy scores while slashing training duration by an order of magnitude. The key mechanism is preserving prompt content density even at batch size 40, where naive methods have already collapsed.
On Terminal Bench, Combee reduces training from 42 minutes to 2.4 minutes at batch 30 while recovering nearly all the accuracy that naive high-parallel runs destroy. Across AppWorld, Formula, and financial entity tagging, it consistently hits the quality ceiling of small-batch methods but at a fraction of the wall-clock cost. The final playbooks aren't just faster to produce—they're dramatically richer, containing 6,887 tokens of actionable knowledge versus 526 in collapsed prompts.
Combee proves that self-improving agents can scale without forgetting. By treating context aggregation as a distributed systems problem, it unlocks the next generation of parallelized, continuously learning language model collectives. Visit EmergentMind.com to explore the full paper and create your own research video.