Papers
Topics
Authors
Recent
Search
2000 character limit reached

MTR-Bench: Interactive LLM Benchmark

Updated 12 April 2026
  • MTR-Bench is a comprehensive automated benchmark that evaluates multi-turn, interactive reasoning capabilities in large language models.
  • It employs 40 tasks with 3,600 instances to test iterative problem solving through feedback, adaptive planning, and difficulty calibration.
  • The benchmark highlights performance trade-offs between specialized reasoning models and general-purpose LLMs, focusing on accuracy, efficiency, and instruction adherence.

MTR-Bench is a comprehensive, automated benchmark designed to evaluate the capabilities of LLMs in multi-turn, interactive reasoning scenarios that closely resemble real-world problem solving. Unlike prior benchmarks predominantly focused on single-turn reasoning, MTR-Bench systematically probes models’ ability to act iteratively—incorporating feedback, adapting plans, and sustaining coherence across a sequence of turns. It consists of 40 interactive tasks grouped into four distinct reasoning modes, encompasses 3,600 total instances with fine-grained difficulty control, and provides a template-driven, end-to-end pipeline for scalable, reproducible assessment (Li et al., 21 May 2025).

1. Motivation and Scope

MTR-Bench was introduced to fill the critical gap in benchmarking LLMs' ability to handle iterative, interactive reasoning. In most practical contexts, complex reasoning requires a sequence of decisions where each step is informed by previous outcomes—phenomena that single-turn evaluation datasets (e.g., GSM8K, ProofWriter) fail to capture. By emphasizing multi-turn interaction, MTR-Bench assesses a model’s capacity to plan, adapt, and refine hypotheses in response to feedback, thereby exposing weaknesses not apparent in traditional benchmarks. Each of the 40 tasks is purposely constructed so that a solution cannot be achieved in a single pass but instead demands repeated action–feedback loops until the objective is attained or a turn limit is reached.

Reasoning Capability Classes

Tasks are organized into four modes, each foregrounding a characteristic reasoning process:

Mode (Abbr.) Primary Reasoning Process Example Task
Information Probing (IP) Inductive elimination Find impostors by probing triplets for majority
Dynamic Adaptation (DA) Abductive adaptation Password-Breaker-XOR with changing target
State Operation (SO) Deductive manipulation Maze navigation with hidden swapped controls
Strategic Gaming (SG) Strategic/adversarial planning Knight battle on n×m board versus an opponent

While multi-skill blending is encouraged, each mode serves as a diagnostic for a specific aspect: information aggregation (IP), hypothesis updating (DA), latent mechanism discovery (SO), or multi-step planning in adversarial contexts (SG).

2. Dataset Construction

The assembly of MTR-Bench uses a two-stage process: hand-crafted task templates and programmatic instantiation. Initial seeds for tasks are sourced from structured puzzles (e.g., Codeforces, NYT), each converted into a formal problem template detailing:

  • Legal action and query formats
  • Response mechanics, with numerically or symbolically precise feedback
  • Example dialogues demonstrating valid turn structure

A Generator module, P(t,n,gn)P(t, n, g_n), takes as input the template tt, numeric difficulty parameter nn, and random seeds gng_n, outputting a unique problem instance pp and its hidden solution ss. For each template, 30 instances are created at each of three calibrated difficulty levels (easy, medium, hard) by tuning nn, resulting in 40×3×30=360040 \times 3 \times 30 = 3600 total instances. Difficulty is operationally defined via the growth of the underlying search space and calibrated empirically such that a reference model (o3-mini) yields a success gradient of roughly 60% (easy), 40% (medium), 20% (hard).

3. Multi-Turn Interaction and Evaluation Protocol

Each iteration within an MTR-Bench task adheres to a standardized protocol:

  • At turn ii: Model receives problem context Ci=(p,Hi−1)C_i = (p, H_{i-1}).
  • It emits a query tt0 in the prescribed format.
  • The Monitor module validates the operation, updates the internal task state tt1, and returns feedback tt2.
  • The task concludes if tt3 matches the hidden objective tt4 (success) or upon reaching the maximum turn threshold (e.g., tt5).

Automated evaluation employs the following metrics:

  1. Accuracy (Acc): tt6
  2. Efficiency (Eff): For shared-success tasks tt7 of models tt8: tt9, with nn0 as turns used by nn1 on task nn2.
  3. Invalid Rate (IR): Fraction of dialogues containing at least one invalid operation.
  4. Pattern Analysis (PA): Aggregates the per-turn frequency of high-level reasoning behaviors (Associate, Verify, Plan, Feedback).

This automated, end-to-end architecture ensures scalability and the ability to support new tasks and rapid zero/few-shot evaluation, without human-labeled data.

4. Experimental Results and Model Performance

Experiments conducted on MTR-Bench encompassed 20 models, including specialized reasoning LLMs (e.g., o3-mini, R1, QwQ-32B) and high-profile general-purpose LLMs (e.g., GPT-4o, Qwen-Max, Llama-3.1).

Main empirical findings include:

  • Performance Degradation by Difficulty: Across all classes, models exhibit a monotonic drop in accuracy from easy to hard, consistent with the intended calibration.
  • Specialized vs. General Models: Purpose-built reasoning models substantially outperform general LLMs, even those of larger parameter count.
  • Ceilings in Multi-Turn vs. Single-Turn: The highest accuracy on easy interactive tasks is approximately 60% (o3-mini), falling below 20% on hard; single-turn benchmarks report >95% on their easy partitions.
  • Impact of Allowed Turns: Additional allowed steps substantially benefit IP (inductive) tasks, though the gain is limited in DA and SO modes, pointing to prevailing model limitations in abductive and deductive chain reasoning.
  • Efficiency vs. Accuracy Tradeoff: R1, while not the most accurate, is more turn-efficient than o3-mini on shared-success tasks, highlighting divergent model search and planning strategies.
  • Invalid Operation Sensitivity: Smaller and distilled models exhibit markedly high invalid rates (up to 50%), indicating instruction-following brittleness, while even top models incur ~10%.

5. Limitations, Failure Modes, and Research Directions

Analysis of MTR-Bench results reveals persistent gaps:

  • Feedback Chaining: Tasks demanding robust chaining of feedback—especially in DA (abductive) and SO (deductive) classes—prove demanding, as models often fail to build or update consistent latent state representations.
  • Instruction Stability: Distillation, while beneficial for single-turn format stability, degrades adherence to operation/response requirements in multi-turn settings, potentially remediable through RL-based fine-tuning.
  • Strategic Gaming Weaknesses: SG tasks currently employ random system responses; introducing principled adversarial opponents could yield stronger diagnostics for planning.
  • Efficiency–Accuracy Divergence: Differences in models' strategies and planning cost functions invite further work in multi-objective optimization for reasoning LLMs.

A plausible implication is that current LLM architectures are not inherently equipped for symbol-rich, iterative deduction and feedback integration, motivating further integration of symbolic-rule induction techniques and more dynamic environment feedback loops.

6. Reproducibility, Release, and Future Extensions

MTR-Bench will be made publicly available under Apache 2.0 license, with the following resources:

  • 40 problem templates and dataset generation code
  • Monitor and Evaluator modules (Python)
  • Automated benchmarking scripts for OpenAI and open-source LLMs
  • Environment reproducibility artifacts (Dockerfile, dependency management), calibration instructions, example logs

This infrastructure enables the rapid instantiation and benchmarking of new interactive tasks at arbitrary difficulty levels and supports fully automated, human-free assessment. The anticipated repository is https://github.com/Alibaba-MTR/MTR-Bench.

The systematic, scalable nature of MTR-Bench enables direct comparison of emerging reasoning strategies and architectures in an interactive, open-ended problem-solving regime. Its adoption is expected to drive progress on planning, adaptation, and sustained reasoning in LLMs, providing new baselines and a rigorous challenge for iterative, interactive AI systems (Li et al., 21 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MTR-Bench.