Create a Video View Paper

Do LLMs Benefit From Their Own Words?

This presentation examines a surprising discovery about how large language models handle multi-turn conversations. Researchers challenged the default assumption that keeping all previous assistant responses improves conversation quality, revealing that omitting an LLM's own prior outputs often maintains or even improves response quality while dramatically reducing computational costs. Through controlled experiments on real-world conversations, the study demonstrates that selective context management can achieve 95% of full-context performance using only 70% of the tokens, with implications for efficiency, error propagation, and the design of conversational AI systems.

Script

When a language model responds in a conversation, does it actually benefit from seeing its own previous answers? The standard practice says yes, keep everything. But this paper reveals something counterintuitive: omitting those assistant turns often works just as well, and sometimes even better.

The researchers tested this on real technical conversations from WildChat and ShareLM, comparing full context against a version where assistant turns were replaced with simple placeholders. Four models were evaluated, from 4 billion to frontier scale.

Why does this work? The answer lies in how people actually write follow-up prompts. Over 36% of user turns are completely self-contained, referencing concrete feedback or prior instructions that don't require the assistant's previous response at all.

But there's another issue lurking beneath the surface.

The researchers observed this repeatedly: a model makes a mistake in one turn, then doubles down on it in the next, treating its own flawed output as authoritative. Omitting that context breaks the cycle.

The choice between these strategies isn't binary. An adaptive approach that learns when to omit context per turn achieved 95% of full-context performance while using only 70% of the tokens.

This adaptive classifier fuses prompt type, conversation embeddings, and round metadata to decide on a per-turn basis. Simple heuristics like omitting on every new question proved inferior, confirming that learned, fine-grained filtering is essential.

Beyond efficiency, this challenges a foundational assumption in dialogue modeling. It suggests that multi-turn dependence is not inherent and is often overestimated, with direct implications for agent systems, tool-augmented reasoning, and conversational memory design.

The researchers note that future work should focus on selective retention at turn or artifact granularity, leveraging relevance models and extending human evaluation to validate automated judge reliability.

This research dismantles a default assumption and reveals that sometimes, less context is more. Language models don't always benefit from their own words—and knowing when to forget might be just as important as knowing what to remember. Visit EmergentMind.com to learn more and create your own videos.