LLMs Get Lost In Multi-Turn Conversation (2505.06120v1)

Published 9 May 2025 in cs.CL and cs.HC

Abstract: LLMs are conversational interfaces. As such, LLMs have the potential to assist their users not only when they can fully specify the task at hand, but also to help them define, explore, and refine what they need through multi-turn conversational exchange. Although analysis of LLM conversation logs has confirmed that underspecification occurs frequently in user instructions, LLM evaluation has predominantly focused on the single-turn, fully-specified instruction setting. In this work, we perform large-scale simulation experiments to compare LLM performance in single- and multi-turn settings. Our experiments confirm that all the top open- and closed-weight LLMs we test exhibit significantly lower performance in multi-turn conversations than single-turn, with an average drop of 39% across six generation tasks. Analysis of 200,000+ simulated conversations decomposes the performance degradation into two components: a minor loss in aptitude and a significant increase in unreliability. We find that LLMs often make assumptions in early turns and prematurely attempt to generate final solutions, on which they overly rely. In simpler terms, we discover that when LLMs take a wrong turn in a conversation, they get lost and do not recover.

Summary

The paper shows that LLMs suffer a 39% average performance drop in multi-turn conversations compared to single-turn fully-specified interactions.
It introduces a sharding methodology that breaks instructions into atomic parts, enabling evaluation across six diverse generation tasks.
The analysis reveals that increased unreliability, rather than aptitude loss, primarily drives the degraded performance in multi-turn dialogues.

LLMs are increasingly used as conversational interfaces, promising assistance beyond fully specified single-turn instructions. This paper investigates LLM performance in multi-turn conversations where user requirements might be underspecified and clarified over time. In contrast to typical LLM evaluations focused on single-turn, fully-specified tasks, this research simulates multi-turn, underspecified conversations to assess performance degradation.

To evaluate LLMs in this setting, the authors developed a simulation environment based on a "sharding" process. This process transforms existing high-quality single-turn instructions into a set of smaller, atomic "shards" that collectively contain the same information as the original instruction. A semi-automatic process, leveraging LLMs for initial segmentation and rephrasing followed by manual verification, was used to create sharded instructions while ensuring properties like information preservation and order insensitivity of shards (except the first, which defines the main intent). The simulation environment involves an assistant LLM (the model being evaluated), a user simulator (an LLM tasked with revealing shards turn-by-turn based on conversation context), and a system that classifies assistant responses and evaluates answer attempts. The conversation proceeds with the user revealing a shard, the assistant responding, the system classifying the response (e.g., answer attempt, clarification, refusal, discussion), and evaluating if it's an answer attempt. The conversation ends when a correct answer is given or all shards are revealed.

The study evaluated 15 open- and closed-weight LLMs across six diverse generation tasks:

Code: Generating Python functions from descriptions (sourced from HumanEval (Chen et al., 2021) and LiveCodeBench (Jain et al., 12 Mar 2024)).
Database: Generating SQL queries from natural language and a schema (sourced from Spider (Yu et al., 2018)).
Actions: Generating API calls from user requests and API schemas (sourced from Berkeley Function Calling Leaderboard [2024bfcl]).
Math: Solving elementary math word problems (sourced from GSM8K (Cobbe et al., 2021)).
Data-to-text: Generating captions for tables based on highlighted cells and metadata (sourced from ToTTo (Parikh et al., 2020)).
Summary: Generating multi-document summaries with citations from a corpus and query (sourced from Summary of a Haystack (Laban et al., 1 Jul 2024)).

These tasks span programming and natural language domains, involving binary correctness checks (Code, Database, Actions, Math) or continuous scores (Data-to-text, Summary). All task scores were mapped to a 0-100 scale.

The experiments used five simulation types based on sharded instructions:

Full: Single-turn, original fully-specified instruction (baseline).
Sharded: Multi-turn, underspecified, shards revealed one by one.
Concat: Single-turn, all shards concatenated into a single prompt (tests effect of rephrasing/sharding format).
Recap: Sharded conversation followed by a final turn concatenating all shards (agentic intervention).
Snowball: Sharded conversation where each user turn adds a new shard and repeats all previous shards (agentic intervention).

To account for the stochastic nature of LLMs, $N=10$ simulations were run for each (LLM, instruction, simulation type) pair, totaling over 200,000 conversations. This allowed for the definition of three key metrics: averaged performance ( $\overline{P}$ ), aptitude ( $A^{90}$ - 90th percentile score, best-case), and unreliability ( $U_{10}^{90}$ - difference between 90th and 10th percentile scores, best-case vs. worst-case).

\begin{table}[h] \renewcommand{\arraystretch}{1.6} \resizebox{1.0\textwidth}{!}{% \begin{tabular}{lcccccc/cccccc/cccccc/cc} & \multicolumn{18}{c}{\Large{Lost in Conversation Experiment} & & \ \cmidrule(r{7pt}){2-19} \multirow{ 2}{*}{\large{Model}} & \multicolumn{6}{c}{Full} & \multicolumn{6}{c}{Concat} & \multicolumn{6}{c}{Sharded} & \multicolumn{2}{c}{\Large{Overall}} \ \cmidrule(r{7pt}){2-7} \cmidrule(r{7pt}){8-13} \cmidrule(r{7pt}){14-19} \cmidrule(){20-21} & Code & Database & Actions & Data2Text & Math & Summary & Code & Database & Actions & Data2Text & Math & Summary & Code & Database & Actions & Data2Text & Math & Summary & Concat/Full & Sharded/Full \ \cmidrule(lr){1-1} \cmidrule(r{7pt}){2-7} \cmidrule(r{7pt}){8-13} \cmidrule(r{7pt}){14-19} \cmidrule(){20-21} Llama 3.1-8B & 27.4 & 64.1 & 82.9 & 13.7 & 63.9 & \phantom{0}7.6 & 21.2 & 47.7 & 83.0 & 15.7 & 62.6 & \phantom{0}6.5 & 21.7 & 25.9 & 45.5 & 13.3 & 37.4 & \phantom{0}3.4 & 91.6 & 62.5 \ OLMo2 & 18.8 & 54.8 & 56.1 & 17.2 & 80.0 & - & 16.3 & 40.5 & 49.8 & 14.3 & 80.1 & - & 14.4 & 22.4 & 13.8 & \phantom{0}9.0 & 46.3 & - & 86.5 & 50.5 \ Claude 3-Haiku & 44.8 & 85.0 & 83.5 & 29.8 & 73.9 & 11.6 & 36.3 & 76.5 & 80.2 & 30.1 & 76.1 & \phantom{0}9.2 & 31.5 & 31.8 & 55.9 & 18.6 & 47.1 & \phantom{0}1.6 & 91.6 & 52.4 \ GPT-4o-mini & 75.9 & 89.3 & 94.1 & 35.9 & 88.1 & 14.9 & 66.7 & 90.7 & 92.2 & 31.2 & 88.0 & 12.5 & 50.3 & 40.2 & 52.4 & 19.8 & 58.7 & \phantom{0}7.2 & 93.0 & 56.2 \ Llama 3.3-70B & 72.0 & 91.1 & 95.0 & 34.1 & 91.7 & 15.8 & 52.7 & 87.9 & 97.0 & 32.0 & 91.8 & 14.7 & 51.6 & 35.4 & 71.0 & 22.4 & 61.5 & 10.5 & 93.2 & 64.2 \ Phi-4 & 53.2 & 87.6 & 82.7 & 23.9 & 89.2 & - & 48.4 & 79.6 & 76.0 & 28.6 & 90.4 & - & 39.1 & 33.1 & 34.1 & 23.2 & 52.5 & - & 99.0 & 61.7 \ CMD-A & 72.0 & 91.9 & 98.5 & 27.7 & 94.5 & 24.3 & 61.6 & 86.1 & 98.4 & 33.2 & 91.9 & 21.3 & 44.9 & 33.6 & 72.0 & 27.9 & 66.0 & \phantom{0}4.9 & 97.3 & 60.4 \ Llama 4-Scout & 73.9 & 92.7 & 98.0 & 35.2 & 96.3 & 13.7 & 60.3 & 81.5 & 98.3 & 28.2 & 92.9 & 13.7 & 46.4 & 27.1 & 69.9 & 26.1 & 67.0 & 12.3 & 91.0 & 66.1 \ o3 & 86.4 & 92.0 & 89.8 & 40.2 & 81.6 & 30.7 & 87.2 & 83.3 & 91.5 & 39.4 & 80.0 & 30.4 & 53.0 & 35.4 & 60.2 & 21.7 & 63.1 & 26.5 & 98.1 & 64.1 \ Claude 3.7-Sonnet & 78.0 & 93.9 & 95.4 & 45.6 & 85.4 & 29.3 & 76.2 & 81.5 & 96.0 & 53.3 & 87.2 & 28.9 & 65.6 & 34.9 & 33.3 & 35.1 & 70.0 & 23.6 & 100.4 & 65.9 \ Deepseek-R1 & 99.4 & 92.1 & 97.0 & 27.0 & 95.5 & 26.1 & 97.1 & 89.9 & 97.0 & 36.7 & 92.9 & 24.4 & 70.9 & 31.5 & 47.5 & 20.0 & 67.3 & 17.2 & 103.6 & 60.8 \ GPT-4o & 88.4 & 93.6 & 96.1 & 42.1 & 93.8 & 23.9 & 82.9 & 91.7 & 97.1 & 32.2 & 91.9 & 23.9 & 61.3 & 42.3 & 65.0 & 20.5 & 67.9 & 10.6 & 94.5 & 57.9 \ Gemini 2.5-Flash & 97.0 & 96.3 & 88.4 & 51.2 & 90.6 & 29.1 & 92.5 & 95.5 & 89.2 & 51.9 & 88.4 & 29.4 & 68.3 & 51.3 & 42.6 & 31.0 & 66.1 & 26.1 & 99.3 & 65.8 \ GPT-4.1 & 96.6 & 93.0 & 94.7 & 54.6 & 91.7 & 26.5 & 88.7 & 86.5 & 98.5 & 54.4 & 89.7 & 26.8 & 72.6 & 46.0 & 62.9 & 28.6 & 70.7 & 13.3 & 97.9 & 61.8 \ Gemini 2.5-Pro & 97.4 & 97.3 & 97.8 & 54.8 & 90.2 & 31.2 & 95.7 & 94.9 & 98.1 & 56.9 & 89.3 & 31.8 & 68.1 & 43.8 & 36.3 & 46.2 & 64.3 & 24.9 & 100.1 & 64.5 \ \end{tabular} } \caption{Averaged Performance ( $\overline{P}$ ) of LLMs on six tasks. Background color indicates the level of degradation from the Full setting. The last two columns average the performance drops from Concat and Sharded compared to Full in percentages across the six tasks.} \label{tab:main_results_summary} \end{table}

The main finding is that all tested LLMs exhibit significantly lower performance in multi-turn, underspecified (Sharded) conversations compared to single-turn, fully-specified (Full) conversations. As shown in Table 1, there is an average performance drop of 39% across the six tasks and 15 models. This is termed the "Lost in Conversation" phenomenon. Performance in the single-turn Concat setting, where all shards are provided upfront, averages 95.1% of Full performance, indicating that the performance drop in Sharded is not due to the rephrasing or format of the sharded instructions themselves, but rather the multi-turn, underspecified nature. Smaller models showed slightly larger Concat degradations, suggesting less robustness to rephrasing. The magnitude of the drop (30-40%) is similar for highly performant and less performant models alike. Additional test-time compute (as in reasoning models) did not mitigate the effect and even correlated with longer, more problematic responses.

Analyzing the performance degradation into aptitude ( $A$ ) and unreliability ( $U$ ) reveals that in single-turn settings (Full, Concat), higher aptitude generally correlates with lower unreliability. However, in the Sharded setting, aptitude drops only moderately (average 16%), while unreliability more than doubles (average 112%). All models, regardless of their aptitude, exhibit very high unreliability in the multi-turn setting. The "Lost in Conversation" phenomenon is primarily driven by a significant increase in unreliability rather than a major loss in aptitude.

Qualitative analysis of simulation logs points to several potential root causes:

Premature Answer Attempts: Models often attempt to provide a full solution early in the conversation when information is still highly underspecified. Conversations where the first answer attempt occurs later tend to have significantly higher performance.
Answer Bloat: In multi-turn conversations, subsequent answer attempts tend to be significantly longer than initial attempts or solutions generated in single-turn settings, even for correct solutions. This suggests models struggle to correctly refine previous (potentially incorrect) attempts and instead overly rely on them.
Loss-in-Middle-Turns: Similar to the "lost in the middle" phenomenon in long-context, single-turn settings, LLMs in multi-turn conversations tend to over-rely on information presented in the earliest and latest turns, neglecting information provided in intermediary turns. This was observed in the Summary task by analyzing citation patterns.
Overly Verbose Responses: Across most tasks, conversations where the assistant generated shorter, more focused responses showed higher performance. Longer responses are hypothesized to introduce more assumptions and irrelevant information that derail the conversation.

\begin{table}[h] \resizebox{0.8\linewidth}{!}{% \begin{tabular}{lccccc} & \multicolumn{5}{c}{Simulation Type} \ \cmidrule(){2-6} Model & Full & Concat & Sharded & Recap & Snowball \ \cmidrule(lr){1-1} \cmidrule(){2-6} GPT-4o-mini & 86.8 & 84.4 & 50.4 & 66.5 & 61.8 \ GPT-4o & 93.0 & 90.9 & 59.1 & 76.6 & 65.3 \ \end{tabular} } \caption{Experimental Results with additional simulation types: Recap and Snowball. Both strategies involve repeating user-turn information.} \label{tab:recap_snowball_summary} \end{table}

The findings have significant implications for different stakeholders:

System and Agent Builders: Relying solely on agentic frameworks to manage multi-turn interactions and using LLMs as single-turn operators may be insufficient. Agent-like interventions tested, such as Recap (final turn concatenation) and Snowball (turn-level repetition), improved performance over Sharded but still lagged significantly behind Full performance (Table 2). This suggests a need for native multi-turn understanding and reliability within the LLMs themselves.
LLM Builders: The study calls for prioritizing reliability alongside aptitude. An experiment varying assistant (AT) and user (UT) temperatures showed that while lowering temperature improves reliability in single-turn settings, it is largely ineffective in mitigating high unreliability in multi-turn settings (Table 3). Even at T=0, substantial unreliability remains due to cumulative non-determinism. Builders should aim for LLMs that are reliable in multi-turn settings at standard temperatures (T=1.0).
NLP Practitioners: The methodology developed for sharding instructions is provided to facilitate evaluating multi-turn capabilities on new tasks. Tasks are hypothesized to cause LLMs to get lost in conversation if they are generative, sufficiently complex with multiple specifications, and require non-decomposable solutions (unlike the episodic Translation task, which showed no significant degradation in Sharded setting, Table 6).
Users of Conversational Systems: Users should be aware of LLM unreliability in multi-turn contexts. Practical recommendations include starting a new conversation if the current one is unproductive and consolidating all necessary information into a single prompt when retrying, as LLMs struggle with information dispersed across turns.

The study acknowledges limitations, primarily the reliance on automated simulation using an LLM user simulator, which may not fully capture the complexity and unpredictability of human-AI conversations. The focus on analytical tasks and English-only text are also limitations. However, the authors argue that the controlled, simplified nature of the simulation likely leads to an underestimation of the degradation and unreliability that occurs in real-world, underspecified multi-turn interactions.

In conclusion, the research demonstrates that current LLMs get significantly lost in multi-turn, underspecified conversations, primarily due to catastrophic unreliability stemming from premature assumptions, bloated responses, context neglect, and verbosity. Standard techniques like lowering temperature or simple agentic recapitulation are insufficient remedies. The paper calls for LLM developers to prioritize multi-turn reliability alongside aptitude in future model development to bridge the gap between lab performance and real-world conversational utility.