Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DICE-BENCH: Evaluating LLM Tool-Use in Dialogues

Updated 1 July 2025
  • DICE-BENCH is a framework and evaluation benchmark that rigorously assesses LLM tool-use in multi-round, multi-party dialogue scenarios simulating realistic collaborative interactions.
  • It introduces the DICE-SCORE metric to quantify the dispersion and fragmentation of tool-related information, revealing performance challenges as dialogue complexity increases.
  • The benchmark leverages 1,607 high-fidelity dialogues with complex tool dependencies, offering actionable insights for improving LLM memory and context integration.

DICE-BENCH is a framework and evaluation benchmark designed to rigorously assess the tool-use capabilities of LLMs in multi-round, multi-party dialogue scenarios. In contrast to prior function-calling benchmarks that primarily evaluate models in single-turn settings, DICE-BENCH introduces a systematic means to measure and test a model's ability to integrate information distributed across extended, naturalistic conversations—mirroring the complexity of real-world collaborative interactions.

1. DICE-SCORE Metric: Quantifying Functional Dialogue Complexity

At the core of DICE-BENCH is DICE-SCORE, a metric that quantifies the dispersion and fragmentation of tool-related information (such as function names and parameters) within a dialogue. DICE-SCORE captures how challenging it is for an LLM to correctly identify and compile the requisite function call from contextually scattered conversational cues, as opposed to settings where all information is provided in a single concise request.

The metric is formally defined as follows. Given a dialogue with nn utterances and TT function call components to be extracted (function name and all argument slots), for each utterance ii, SiS_i counts the number of unique function-related items present. The sequence S=(S1,...,Sn)S = (S_1,...,S_n) is analyzed via:

DICE(S,T)=min(S0,T)STiSln(1+αSi)\text{DICE}(S, T) = \frac{\min( |S_{\neq 0}|, T ) \cdot \sqrt{ |S| \cdot T }}{ \sum_{i \in S} \ln(1 + \alpha S_i) }

where S0|S_{\neq 0}| counts utterances with relevant information, α\alpha (typically e2e^2) penalizes repeated or redundant mention, and scaling by ST\sqrt{|S| T} normalizes for sequence and output length. This formulation rewards dispersed, non-redundant, multi-utterance information (higher DICE-SCORE), reflecting the complexity of reference–binding and cross-turn coreference, while penalizing concentrated or repetitive presentations.

Empirical validation indicates that human accuracy decreases strongly as DICE-SCORE increases (correlation r0.984r \approx -0.984), confirming its alignment with task difficulty and real-world multi-agent collaborative demands.

2. Framework Design and Dataset Construction

DICE-BENCH systematically generates high-fidelity, multi-party, multi-round dialogues through an overview pipeline:

  • Tool Graph Construction: A manually curated toolgraph G=(V,E)\mathcal{G} = (\mathcal{V},\mathcal{E}) (124 nodes, 270 edges) provides the backbone, ensuring functional interdependencies and contextual dependencies meaningful for real-world task workflows.
  • Scenario Configuration: Tool chains (paths of length 1–4) are sampled from G\mathcal{G}, with dialogue types assigned (e.g., persuasion, information-seeking, eristic/conflict-driven). Number of participants varies (2–4 agents).
  • Persona Assignment: Each dialogue participant is assigned a distinct persona generated with LLM prompting, ensuring a rich diversity in tone and interaction style.
  • Dialogue Generation: Multi-agent simulation orchestrates conversation, with agents exchanging information, negotiating, or collaborating to produce the function call input across sequential rounds. Contextual memory and the output of prior tools condition subsequent rounds, and semantic information is distributed to match real-life dialogue flow rather than concentrated or explicit statements.
  • Filtering and Validation: Dialogues are filtered using automatic LLM-based criteria (fluency, persona consistency, integration, etc.), rule-based checks (e.g., appropriate addressing of system), and a human-validated rubric (15-point checklist, e.g., real-world applicability).

This process results in a challenging dataset of 1,607 dialogues, with mean DICE-SCORE 3.64, significantly exceeding prior benchmarks.

3. Dataset Structure and Properties

The resulting DICE-BENCH dataset exhibits several distinguishing properties:

Feature DICE-BENCH Value Comparative Value (prior benchmarks)
DICE-SCORE (mean) 3.64 API-Bank: 1.63, TaskBench: 0.64
Instances 1,607 (varies; typically <1,000 for similar)
Dialogue rounds (levels) 1–4 (multi-round) Primarily single-turn
Number of parties 2–4 Often single or two-party

The dataset spans task domains with complex tool dependencies and realistic conversational dynamics, encompassing a broad spectrum of dialogue types and multi-agent interactions.

4. Experimental Evaluation of LLMs

DICE-BENCH evaluates 19 LLMs (both open- and closed-source, supporting at least 8k context tokens), testing each on the full multi-round, multi-party dialogues. The task is to predict the correct function call—including precise parameter extraction—from the natural language conversation.

Experimental results demonstrate:

  • Performance strongly inversely correlates with DICE-SCORE: As information is more dispersed, LLM accuracy (measured by exact match for function name and all argument slots) declines. GPT-4o, for example, achieves ~74% EM in the simplest round (R1, lower DICE-SCORE) but drops to ~59% in the most difficult round (R4, highest DICE-SCORE).
  • Open-source models show progress but lag behind closed-source: Models such as Qwen2.5-32B and Phi-4-15B·Instruct improve over prior open releases but remain below the best proprietary models.
  • Common failure modes include inability to bridge references across speakers and rounds, and strict format sensitivity: Models output correct semantics but fail the benchmark due to minor JSON format mismatches, highlighting the challenge of robust structure adherence under multi-agent histories.
  • Human performance declines with increasing DICE-SCORE, mirroring LLM struggle: This supports the metric's grounding in genuine task difficulty.

A representative plot (see paper) underlines the steep performance drop-off (>20% gap) as dialogue rounds and information dispersion increase.

5. Significance and Impact

DICE-BENCH highlights a crucial shortcoming in current function-calling benchmarks: prior resources largely test single-turn, single-party interactions with concentrated information—scenarios that are significantly less complex than real-world deployments. By contrast, DICE-BENCH’s construction and DICE-SCORE explicitly reflect the multi-agent, multi-round environments encountered in collaborative settings and real-world team workflows.

The DICE-BENCH approach illustrates that even state-of-the-art LLMs are not yet suitable as robust, context-integrating assistants for group, dialogue-driven tool use. This suggests a need for future research focused on enhancing LLM memory, dialogue coherence, long-context reasoning, and flexible reference resolution in distributed conversational settings.

The benchmark and its metric provide precise, interpretable signals for model development and evaluation, enabling direct measurement of dialogue-level context tracking and the bridging of information across complex conversational structures.

6. Public Resource Availability

All code and data for DICE-BENCH are publicly released:

This ensures reproducibility and supports further methodological advances in LLM function-calling benchmarking.

7. Summary Table: Key Metrics and Findings

Aspect Detail (from paper)
DICE-SCORE (difficulty) Ranges up to >5.0; mean 3.64 in DICE-BENCH; higher = harder
Dataset size 1,607 high-quality, multi-round, multi-party dialogue scenarios
Human accuracy (vs DICE-SCORE) Drops from 80.5% (easy) to 49.3% (hard; R4), correlation -0.984
LLM performance (GPT-4o) ~74% (R1) → ~59% (R4) EM; best open-source models lag behind
Scenario diversity Multi-level, multi-party, multi-persona, multi-tool dependencies
Public accessibility Code and data fully available for research use

DICE-BENCH establishes a rigorous, interpretable, and practically-oriented testbed for LLMs' capabilities in realistic, collaborative, and memory-intensive function-calling tasks, providing actionable diagnostics for both model benchmarking and future research directions.