DD-GEPA: Prompt Optimization for Dialogue Disentanglement Focusing on Task Instruction and Utterance Representation

Published 5 Jun 2026 in cs.SE | (2606.07894v1)

Abstract: Multi-party chat often contains interleaved dialogues because multiple participants can discuss different topics at the same time. Dialogue disentanglement addresses this problem by separating an entangled utterance sequence into coherent dialogues. While LLMs are promising for this task, they still struggle with dialogue disentanglement and achieve low accuracy. This paper proposes an automatic prompt optimization for LLM based dialogue disentanglement. We decompose the prompt into three components: task instruction, utterance representation, and output instruction, and optimize them using GEPA, an optimization method for compound AI systems. Experiments on benchmark datasets show that the optimized prompts improve dialogue disentanglement accuracy over the original prompts and can surpass hand crafted prompts.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces DD-GEPA, a modular framework that automatically optimizes dialogue disentanglement prompts through evolutionary search.
The paper demonstrates that the optimized prompt boosts key metrics (e.g., VI=94.12, ARI=75.87, Dialogue F1=42.52) compared to manual baselines.
The paper highlights limitations in trace diversity and signal bottlenecks, suggesting hybrid approaches for further advancements.

DD-GEPA: Automatic Prompt Optimization for Dialogue Disentanglement

Problem Setting and Motivation

Dialogue disentanglement refers to the task of partitioning entangled utterances from multi-party chat streams into coherent dialogue clusters, where clusters are defined by reply-to relations among utterances. This is a foundational preprocessing step for downstream applications such as dialogue state tracking and conversational response modeling, particularly in settings with high message concurrency (e.g., IRC, Slack). LLMs exhibit strong contextual reasoning, suggesting potential for state-of-the-art performance in disentanglement. However, prior studies have shown that LLM-based approaches are highly sensitive to prompt design and may underperform compared to specialized neural models and handcrafted features, especially for open-weight models with ≤30B parameters.

Manual prompt engineering is infeasible at scale due to the lack of comprehensive annotation guidelines, notable human disagreement over gold standards, and the fine-grained dependence of LLM performance on prompt formatting and structure. Therefore, this work introduces DD-GEPA: an automatic, modular prompt optimization framework that decomposes complex dialogue disentanglement prompts into three independently optimizable modules—task instruction, utterance representation, and output instruction—and uses an extension of GEPA (Reflective Prompt Evolution; [agrawal2026gepa]) to optimize these components through evolutionary search and natural language reflection.

Figure 1: Illustration of the dialogue disentanglement task, visualizing the partitioning of interleaved chat utterances into color-coded dialogue clusters.

DD-GEPA Framework

DD-GEPA adapts the GEPA optimization paradigm for the multi-faceted prompt structure of dialogue disentanglement. A "program" in DD-GEPA consists of the three prompt modules (instruction, representation, output), and a population of such programs is iteratively optimized in a Pareto-based evolutionary loop.

Candidate program selection: Programs are maintained in a pool, scored on a validation set by the binary accuracy of individual reply-to predictions.
Module update: At each iteration, a module is selected in round-robin and updated by an LLM via reflection on execution traces (inputs, LLM output, reasoning, correctness, and failure explanations for incorrect decisions).
Acceptance & evaluation: Updates are only accepted if the new program outperforms its predecessor on a minibatch; accepted programs are evaluated on the full validation set and re-inserted into the pool.
Representation/Output handling: Proposed modifications to utterance representation or output instruction modules require automatically generated conversion/interpreting code, ensuring compatibility with LLM input/output specifications.
Figure 2: DD-GEPA's automatic prompt optimization process: selection and update of a module using program traces, Pareto filtering, and minibatch validation.

This approach enables joint search over semantics and formatting, tailored for disentanglement tasks with complex, compositional prompt requirements.

Experimental Evaluation

Dataset and Baselines

Experiments are conducted on the Ubuntu IRC dataset ([kummerfeld2019large]), with standard development/test splits. The study benchmarks four prompt settings:

Seed 1: Minimal instruction and whitespace-separated representation; single-value output.
Seed 2: As above, with a two-key output scheme ("is_new_dialogue", "utterance_id").
Baseline: Manually engineered prompt with JSON utterance representation and two-key output, as in [TakadaMori2026Rethinking].
Optimum: The program produced by DD-GEPA after three update rounds (one per module).

All experiments use Qwen3-30B-A3B-Thinking-2507 ([yang2025qwen3]) for reply-to prediction; GPT-5.2 ([openai2025gpt52]) is used for comparative evaluation. Metrics include variation of information (VI), ARI, NMI, 1-1, S-F1, local accuracy, and dialogue-level (cluster) precision/recall/F1.

Main Results

DD-GEPA optimization consistently improves performance of open-weight Qwen3-30B over Seed and Baseline prompts across all metrics on the IRC test split.
The optimized Optimum prompt with Qwen3-30B achieves VI=94.12, ARI=75.87, 1-1=82.26, NMI=95.51, $Local_3$ =95.39, S-F1=84.80, Dialogue F1=42.52.
This setting outperforms the manual Baseline (F1=39.40), but does not close the gap with fully supervised non-LLM SOTA such as DiHRL ([li2025revisiting]) or with large proprietary LLMs (Gemini2.5pro, F1=61.27).
No further gains are observed after the fifth optimization iteration, as the candidate pool saturates—a reflection of local optima or limited prompt-update diversity.
Figure 3: Trajectory of validation scores during DD-GEPA optimization—substantial improvements are realized in early rounds, after which the validation score plateaus.

Analysis and Implications

Error Taxonomy: Predominant residual errors under Optimum are associated with ambiguities in phatic expressions, system messages, technical jargon, ambiguous addressee contexts, and cases with insufficient preceding context. These align with known annotator ambiguities in gold data and indicate fundamental challenges in both LLM and traditional models.

Optimization Limits: DD-GEPA's convergence correlates with both the representational expressiveness of the best prompt and limits in trace diversity (large token context requirements preclude larger trace sets during update). The prompt optimization dataset, although curated for difficulty, may be insufficiently challenging or diverse to drive breakthrough improvements.

Implications for Disentanglement and Prompt Engineering:

Modular, automatically optimized prompts can yield measurable accuracy gains in LLM-based dialogue disentanglement, demonstrating their value over manual engineering even for strong, large models.
Despite clear advances, prompt optimization alone does not bridge the gap with best-in-class non-LLM and scaled proprietary LLM systems—suggesting limitations inherent to both open-weight model capacity and the information bottleneck of prompts.
Extending beyond prompt search—possibly leveraging hybrid retrieval, in-context exemplars, or dynamic prompt adaptation—may be necessary for further advances.

Theoretical and Practical Impact

From a theoretical perspective, this work formalizes a modular, compositional approach to prompt optimization for structured NLP tasks, providing a blueprint for decomposable, high-dimensional prompt search in compound LLM systems. Practically, the findings highlight both the strengths and constraints of reflective evolutionary prompt search for real-world dialogue tasks—especially under privacy restrictions limiting LLM fine-tuning or data sharing.

Future developments may pursue:

Scalable trace-based optimization that circumvents token constraints (e.g., sampling, abstraction, or filter techniques for trace selection).
Integration with instruction tuning or demonstration-based prompt composition.
Expanding to domains with more complex interaction structures, beyond programming-related or publicly available data.

Conclusion

DD-GEPA demonstrates that automatic, modular prompt optimization provides tangible improvements for LLM-based dialogue disentanglement, surpassing manual prompt baselines but not yet matching specialized neural architectures or scaled commercial LLMs. This establishes a robust foundation for systematic, data-driven prompt design in compositional LLM programs and sets the stage for further research on context-efficient, high-fidelity disentanglement in multi-party chat environments.

Markdown Report Issue