DICE-BENCH: Evaluating LLM Tool-Use in Dialogues
- DICE-BENCH is a framework and evaluation benchmark that rigorously assesses LLM tool-use in multi-round, multi-party dialogue scenarios simulating realistic collaborative interactions.
- It introduces the DICE-SCORE metric to quantify the dispersion and fragmentation of tool-related information, revealing performance challenges as dialogue complexity increases.
- The benchmark leverages 1,607 high-fidelity dialogues with complex tool dependencies, offering actionable insights for improving LLM memory and context integration.
DICE-BENCH is a framework and evaluation benchmark designed to rigorously assess the tool-use capabilities of LLMs in multi-round, multi-party dialogue scenarios. In contrast to prior function-calling benchmarks that primarily evaluate models in single-turn settings, DICE-BENCH introduces a systematic means to measure and test a model's ability to integrate information distributed across extended, naturalistic conversations—mirroring the complexity of real-world collaborative interactions.
1. DICE-SCORE Metric: Quantifying Functional Dialogue Complexity
At the core of DICE-BENCH is DICE-SCORE, a metric that quantifies the dispersion and fragmentation of tool-related information (such as function names and parameters) within a dialogue. DICE-SCORE captures how challenging it is for an LLM to correctly identify and compile the requisite function call from contextually scattered conversational cues, as opposed to settings where all information is provided in a single concise request.
The metric is formally defined as follows. Given a dialogue with utterances and function call components to be extracted (function name and all argument slots), for each utterance , counts the number of unique function-related items present. The sequence is analyzed via:
where counts utterances with relevant information, (typically ) penalizes repeated or redundant mention, and scaling by normalizes for sequence and output length. This formulation rewards dispersed, non-redundant, multi-utterance information (higher DICE-SCORE), reflecting the complexity of reference–binding and cross-turn coreference, while penalizing concentrated or repetitive presentations.
Empirical validation indicates that human accuracy decreases strongly as DICE-SCORE increases (correlation ), confirming its alignment with task difficulty and real-world multi-agent collaborative demands.
2. Framework Design and Dataset Construction
DICE-BENCH systematically generates high-fidelity, multi-party, multi-round dialogues through an overview pipeline:
- Tool Graph Construction: A manually curated toolgraph (124 nodes, 270 edges) provides the backbone, ensuring functional interdependencies and contextual dependencies meaningful for real-world task workflows.
- Scenario Configuration: Tool chains (paths of length 1–4) are sampled from , with dialogue types assigned (e.g., persuasion, information-seeking, eristic/conflict-driven). Number of participants varies (2–4 agents).
- Persona Assignment: Each dialogue participant is assigned a distinct persona generated with LLM prompting, ensuring a rich diversity in tone and interaction style.
- Dialogue Generation: Multi-agent simulation orchestrates conversation, with agents exchanging information, negotiating, or collaborating to produce the function call input across sequential rounds. Contextual memory and the output of prior tools condition subsequent rounds, and semantic information is distributed to match real-life dialogue flow rather than concentrated or explicit statements.
- Filtering and Validation: Dialogues are filtered using automatic LLM-based criteria (fluency, persona consistency, integration, etc.), rule-based checks (e.g., appropriate addressing of system), and a human-validated rubric (15-point checklist, e.g., real-world applicability).
This process results in a challenging dataset of 1,607 dialogues, with mean DICE-SCORE 3.64, significantly exceeding prior benchmarks.
3. Dataset Structure and Properties
The resulting DICE-BENCH dataset exhibits several distinguishing properties:
Feature | DICE-BENCH Value | Comparative Value (prior benchmarks) |
---|---|---|
DICE-SCORE (mean) | 3.64 | API-Bank: 1.63, TaskBench: 0.64 |
Instances | 1,607 | (varies; typically <1,000 for similar) |
Dialogue rounds (levels) | 1–4 (multi-round) | Primarily single-turn |
Number of parties | 2–4 | Often single or two-party |
The dataset spans task domains with complex tool dependencies and realistic conversational dynamics, encompassing a broad spectrum of dialogue types and multi-agent interactions.
4. Experimental Evaluation of LLMs
DICE-BENCH evaluates 19 LLMs (both open- and closed-source, supporting at least 8k context tokens), testing each on the full multi-round, multi-party dialogues. The task is to predict the correct function call—including precise parameter extraction—from the natural language conversation.
Experimental results demonstrate:
- Performance strongly inversely correlates with DICE-SCORE: As information is more dispersed, LLM accuracy (measured by exact match for function name and all argument slots) declines. GPT-4o, for example, achieves ~74% EM in the simplest round (R1, lower DICE-SCORE) but drops to ~59% in the most difficult round (R4, highest DICE-SCORE).
- Open-source models show progress but lag behind closed-source: Models such as Qwen2.5-32B and Phi-4-15B·Instruct improve over prior open releases but remain below the best proprietary models.
- Common failure modes include inability to bridge references across speakers and rounds, and strict format sensitivity: Models output correct semantics but fail the benchmark due to minor JSON format mismatches, highlighting the challenge of robust structure adherence under multi-agent histories.
- Human performance declines with increasing DICE-SCORE, mirroring LLM struggle: This supports the metric's grounding in genuine task difficulty.
A representative plot (see paper) underlines the steep performance drop-off (>20% gap) as dialogue rounds and information dispersion increase.
5. Significance and Impact
DICE-BENCH highlights a crucial shortcoming in current function-calling benchmarks: prior resources largely test single-turn, single-party interactions with concentrated information—scenarios that are significantly less complex than real-world deployments. By contrast, DICE-BENCH’s construction and DICE-SCORE explicitly reflect the multi-agent, multi-round environments encountered in collaborative settings and real-world team workflows.
The DICE-BENCH approach illustrates that even state-of-the-art LLMs are not yet suitable as robust, context-integrating assistants for group, dialogue-driven tool use. This suggests a need for future research focused on enhancing LLM memory, dialogue coherence, long-context reasoning, and flexible reference resolution in distributed conversational settings.
The benchmark and its metric provide precise, interpretable signals for model development and evaluation, enabling direct measurement of dialogue-level context tracking and the bridging of information across complex conversational structures.
6. Public Resource Availability
All code and data for DICE-BENCH are publicly released:
- Code repository: https://github.com/snuhcc/DICE-Bench
- Dataset repository: https://huggingface.co/datasets/OfficerChul/DICE-BENCH
This ensures reproducibility and supports further methodological advances in LLM function-calling benchmarking.
7. Summary Table: Key Metrics and Findings
Aspect | Detail (from paper) |
---|---|
DICE-SCORE (difficulty) | Ranges up to >5.0; mean 3.64 in DICE-BENCH; higher = harder |
Dataset size | 1,607 high-quality, multi-round, multi-party dialogue scenarios |
Human accuracy (vs DICE-SCORE) | Drops from 80.5% (easy) to 49.3% (hard; R4), correlation -0.984 |
LLM performance (GPT-4o) | ~74% (R1) → ~59% (R4) EM; best open-source models lag behind |
Scenario diversity | Multi-level, multi-party, multi-persona, multi-tool dependencies |
Public accessibility | Code and data fully available for research use |
DICE-BENCH establishes a rigorous, interpretable, and practically-oriented testbed for LLMs' capabilities in realistic, collaborative, and memory-intensive function-calling tasks, providing actionable diagnostics for both model benchmarking and future research directions.