MTR-Bench: Interactive LLM Benchmark
- MTR-Bench is a comprehensive automated benchmark that evaluates multi-turn, interactive reasoning capabilities in large language models.
- It employs 40 tasks with 3,600 instances to test iterative problem solving through feedback, adaptive planning, and difficulty calibration.
- The benchmark highlights performance trade-offs between specialized reasoning models and general-purpose LLMs, focusing on accuracy, efficiency, and instruction adherence.
MTR-Bench is a comprehensive, automated benchmark designed to evaluate the capabilities of LLMs in multi-turn, interactive reasoning scenarios that closely resemble real-world problem solving. Unlike prior benchmarks predominantly focused on single-turn reasoning, MTR-Bench systematically probes models’ ability to act iteratively—incorporating feedback, adapting plans, and sustaining coherence across a sequence of turns. It consists of 40 interactive tasks grouped into four distinct reasoning modes, encompasses 3,600 total instances with fine-grained difficulty control, and provides a template-driven, end-to-end pipeline for scalable, reproducible assessment (Li et al., 21 May 2025).
1. Motivation and Scope
MTR-Bench was introduced to fill the critical gap in benchmarking LLMs' ability to handle iterative, interactive reasoning. In most practical contexts, complex reasoning requires a sequence of decisions where each step is informed by previous outcomes—phenomena that single-turn evaluation datasets (e.g., GSM8K, ProofWriter) fail to capture. By emphasizing multi-turn interaction, MTR-Bench assesses a model’s capacity to plan, adapt, and refine hypotheses in response to feedback, thereby exposing weaknesses not apparent in traditional benchmarks. Each of the 40 tasks is purposely constructed so that a solution cannot be achieved in a single pass but instead demands repeated action–feedback loops until the objective is attained or a turn limit is reached.
Reasoning Capability Classes
Tasks are organized into four modes, each foregrounding a characteristic reasoning process:
| Mode (Abbr.) | Primary Reasoning Process | Example Task |
|---|---|---|
| Information Probing (IP) | Inductive elimination | Find impostors by probing triplets for majority |
| Dynamic Adaptation (DA) | Abductive adaptation | Password-Breaker-XOR with changing target |
| State Operation (SO) | Deductive manipulation | Maze navigation with hidden swapped controls |
| Strategic Gaming (SG) | Strategic/adversarial planning | Knight battle on n×m board versus an opponent |
While multi-skill blending is encouraged, each mode serves as a diagnostic for a specific aspect: information aggregation (IP), hypothesis updating (DA), latent mechanism discovery (SO), or multi-step planning in adversarial contexts (SG).
2. Dataset Construction
The assembly of MTR-Bench uses a two-stage process: hand-crafted task templates and programmatic instantiation. Initial seeds for tasks are sourced from structured puzzles (e.g., Codeforces, NYT), each converted into a formal problem template detailing:
- Legal action and query formats
- Response mechanics, with numerically or symbolically precise feedback
- Example dialogues demonstrating valid turn structure
A Generator module, , takes as input the template , numeric difficulty parameter , and random seeds , outputting a unique problem instance and its hidden solution . For each template, 30 instances are created at each of three calibrated difficulty levels (easy, medium, hard) by tuning , resulting in total instances. Difficulty is operationally defined via the growth of the underlying search space and calibrated empirically such that a reference model (o3-mini) yields a success gradient of roughly 60% (easy), 40% (medium), 20% (hard).
3. Multi-Turn Interaction and Evaluation Protocol
Each iteration within an MTR-Bench task adheres to a standardized protocol:
- At turn : Model receives problem context .
- It emits a query 0 in the prescribed format.
- The Monitor module validates the operation, updates the internal task state 1, and returns feedback 2.
- The task concludes if 3 matches the hidden objective 4 (success) or upon reaching the maximum turn threshold (e.g., 5).
Automated evaluation employs the following metrics:
- Accuracy (Acc): 6
- Efficiency (Eff): For shared-success tasks 7 of models 8: 9, with 0 as turns used by 1 on task 2.
- Invalid Rate (IR): Fraction of dialogues containing at least one invalid operation.
- Pattern Analysis (PA): Aggregates the per-turn frequency of high-level reasoning behaviors (Associate, Verify, Plan, Feedback).
This automated, end-to-end architecture ensures scalability and the ability to support new tasks and rapid zero/few-shot evaluation, without human-labeled data.
4. Experimental Results and Model Performance
Experiments conducted on MTR-Bench encompassed 20 models, including specialized reasoning LLMs (e.g., o3-mini, R1, QwQ-32B) and high-profile general-purpose LLMs (e.g., GPT-4o, Qwen-Max, Llama-3.1).
Main empirical findings include:
- Performance Degradation by Difficulty: Across all classes, models exhibit a monotonic drop in accuracy from easy to hard, consistent with the intended calibration.
- Specialized vs. General Models: Purpose-built reasoning models substantially outperform general LLMs, even those of larger parameter count.
- Ceilings in Multi-Turn vs. Single-Turn: The highest accuracy on easy interactive tasks is approximately 60% (o3-mini), falling below 20% on hard; single-turn benchmarks report >95% on their easy partitions.
- Impact of Allowed Turns: Additional allowed steps substantially benefit IP (inductive) tasks, though the gain is limited in DA and SO modes, pointing to prevailing model limitations in abductive and deductive chain reasoning.
- Efficiency vs. Accuracy Tradeoff: R1, while not the most accurate, is more turn-efficient than o3-mini on shared-success tasks, highlighting divergent model search and planning strategies.
- Invalid Operation Sensitivity: Smaller and distilled models exhibit markedly high invalid rates (up to 50%), indicating instruction-following brittleness, while even top models incur ~10%.
5. Limitations, Failure Modes, and Research Directions
Analysis of MTR-Bench results reveals persistent gaps:
- Feedback Chaining: Tasks demanding robust chaining of feedback—especially in DA (abductive) and SO (deductive) classes—prove demanding, as models often fail to build or update consistent latent state representations.
- Instruction Stability: Distillation, while beneficial for single-turn format stability, degrades adherence to operation/response requirements in multi-turn settings, potentially remediable through RL-based fine-tuning.
- Strategic Gaming Weaknesses: SG tasks currently employ random system responses; introducing principled adversarial opponents could yield stronger diagnostics for planning.
- Efficiency–Accuracy Divergence: Differences in models' strategies and planning cost functions invite further work in multi-objective optimization for reasoning LLMs.
A plausible implication is that current LLM architectures are not inherently equipped for symbol-rich, iterative deduction and feedback integration, motivating further integration of symbolic-rule induction techniques and more dynamic environment feedback loops.
6. Reproducibility, Release, and Future Extensions
MTR-Bench will be made publicly available under Apache 2.0 license, with the following resources:
- 40 problem templates and dataset generation code
- Monitor and Evaluator modules (Python)
- Automated benchmarking scripts for OpenAI and open-source LLMs
- Environment reproducibility artifacts (Dockerfile, dependency management), calibration instructions, example logs
This infrastructure enables the rapid instantiation and benchmarking of new interactive tasks at arbitrary difficulty levels and supports fully automated, human-free assessment. The anticipated repository is https://github.com/Alibaba-MTR/MTR-Bench.
The systematic, scalable nature of MTR-Bench enables direct comparison of emerging reasoning strategies and architectures in an interactive, open-ended problem-solving regime. Its adoption is expected to drive progress on planning, adaptation, and sustained reasoning in LLMs, providing new baselines and a rigorous challenge for iterative, interactive AI systems (Li et al., 21 May 2025).