Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 93 tok/s
Gemini 2.5 Pro 35 tok/s Pro
GPT-5 Medium 28 tok/s
GPT-5 High 30 tok/s Pro
GPT-4o 81 tok/s
GPT OSS 120B 439 tok/s Pro
Kimi K2 197 tok/s Pro
2000 character limit reached

RotBench Benchmark Overview

Updated 21 August 2025
  • RotBench is a collection of benchmark frameworks that assess algorithm performance in rotated optimization, LLM tool learning, and multimodal spatial reasoning tasks.
  • The linear optimization variant transforms the Klee–Minty problem through rotation and translation to eliminate coordinate bias, enabling precise evaluation of evolutionary algorithms.
  • The benchmarks also challenge models with noise-corrupted tool names and ambiguous image rotations, revealing critical limitations in LLM robustness and MLLM spatial awareness.

RotBench refers to several distinct benchmark frameworks in contemporary AI research. It includes: (1) a scalable linear optimization benchmark for probabilistic search algorithms—specifically a rotated and translated Klee–Minty problem (Hellwig et al., 2018); (2) a multi-level robustness benchmark evaluating LLMs in tool learning under noise (Ye et al., 16 Jan 2024); (3) a manually curated image rotation identification benchmark for multimodal LLMs (MLLMs) (Niu et al., 19 Aug 2025). Each instantiation targets a fundamental limitation in evaluation and benchmarking of optimization, tool learning, or spatial reasoning systems.

1. Linear Constrained Optimization: Rotated Klee–Minty Problem

RotBench (Hellwig et al., 2018) introduces a modified Klee–Minty cube to evaluate the performance of evolutionary algorithms (EAs) on linear constrained optimization tasks. The original Klee–Minty construction generates exponentially hard instances for simplex-type linear programming solvers by perturbing the vertices of a unit hypercube; however, its optimal solution is at the origin, implicitly favoring coordinate-aligned search heuristics.

RotBench applies a translation and an orthogonal rotation to reposition the optimum away from the origin and axes, defining the transformation

T(y)=R(yt),T(\mathbf{y}) = \mathbf{R}(\mathbf{y} - \mathbf{t}),

where R\mathbf{R} is a rotation matrix constructed using two orthonormal vectors that span a two-dimensional subspace of RN\mathbb{R}^N, and t=(N3,N3,,N3)\mathbf{t} = (N^3, N^3, \ldots, N^3)^\top. This modification removes alignment bias and preserves the problem's combinatorial hardness.

The benchmark is scalable across multiple dimensions—N{2,3,5,10,20,40}N \in \{2,3,5,10,20,40\}—with $2N$ inequality constraints and box constraints defining a perturbed hypercube feasible region. Reporting conventions involve lexicographic ordering of solutions with respect to both objective and constraint violation:

ylexz    {f(y)f(z),if ν(y)=ν(z), ν(y)<ν(z),otherwise,\mathbf{y} \preceq_\text{lex} \mathbf{z} \iff \begin{cases} f(\mathbf{y}) \leq f(\mathbf{z}), & \text{if}~\nu(\mathbf{y}) = \nu(\mathbf{z}),\ \nu(\mathbf{y}) < \nu(\mathbf{z}), & \text{otherwise}, \end{cases}

where ν(y)\nu(\mathbf{y}) is the sum of constraint violations.

Experimental results demonstrate that both LSHADE44 (a state-of-the-art differential evolution variant) and ε\varepsilonMAg-ES (an evolution strategy with matrix adaptation and lexicographic constraint-handling) achieve solutions with error O(108)O(10^{-8}) across all dimensions, with ε\varepsilonMAg-ES exhibiting runtime advantages as NN increases. Compared to LP solvers (e.g., interior point methods), these EAs can deliver competitive or superior precision on the rotated instance, highlighting RotBench's utility as an unbiased, reproducible environment for probabilistic search evaluation.

2. Robustness Evaluation in LLM Tool Learning

RoTBench (Ye et al., 16 Jan 2024) serves as a multi-level robustness benchmark for LLMs’ tool learning capabilities under real-world noise. Its framework comprises five external “environments”:

Environment Noise Type(s) Number of Test Cases
Clean None 105
Slight Insertion, omission, substitution 210
Medium Reversal, nonsense 210
Heavy Exchange, addendum 210
Union Random mix 105

Each environment corrupts tool and parameter names in controlled ways: inserting, omitting, substituting characters (“Slight”); reversing or replacing with nonsense strings (“Medium”); exchanging names and appending extraneous parameters (“Heavy”); or combining these simultaneously (“Union”).

RoTBench assesses three sequential stages of tool learning:

  1. Tool selection (STS=I(t=t)STS = I(t = t^*))
  2. Parameter identification (SPI=STSI(P=P)SPI = STS \cdot I(P = P^*))
  3. Content filling (SCF=SPIi=1NI(Ci=ci)SCF = SPI \cdot \prod_{i=1}^N I(C_i = c_i))

Key findings include:

  • LLM accuracy degrades sharply with increased noise (e.g., GPT-4 drops from 80.00 to 58.10 in tool selection moving from Clean to Union), while manual (human) accuracy stays nearly constant.
  • Noise on tool names affects models more strongly than noise on parameters.
  • Automatic noise correction in the GPT family paradoxically harms adaptability for subtle noise levels, with over-correction leading to new errors (most notable in “Slight” noise cases).

RoTTuning is introduced as a strategy to increase robustness, comprising query expansion (self-instruct), trajectory generation (using GPT-4 for rich function call records), environment augmentation (algorithmically injecting noise), and generalizability training (e.g., LoRA and position interpolation on RoTLLaMA). Empirical results show a mean improvement of 16.10 points; ablation confirms environment augmentation and LoRA fine-tuning are critical to this gain.

3. Multimodal Rotation Identification and Spatial Reasoning

RotBench (Niu et al., 19 Aug 2025) provides a benchmark to quantify MLLMs’ ability to recognize image rotations, with tasks requiring discrimination among 0°, 90°, 180°, and 270° orientations. The benchmark consists of 350 manually curated images (lifestyle, portrait, landscape), processed through a two-stage review to ensure clear visual cues for orientation. Images without a well-defined “up” direction are excluded.

MLLMs (Qwen-2.5-VL-7B-Instruct, Llama-3.2-11B-Instruct, GPT-5, Gemini-2.5-Pro, o3, etc.) are evaluated via four-way classification, with each image randomly mapped to letter choices for each prompt. Performance is robust for 0° (right-side-up) and, for proprietary models, for 180° (upside-down), but systematically weak for discriminating 90° and 270° rotations—persistent confusion is observed. Some models show directional bias in the classification (e.g., always defaulting to counter-clockwise).

Auxiliary information (captions, bounding boxes, scene graphs, depth maps, segmentation) and chain-of-thought (CoT) prompting are tested; though CoT reliably helps with 180°, improvements for 90°/270° are inconsistent or negative. Specialized strategies, including a “rotation grid” (all orientations shown side-by-side) and “normalized rotation voting” (final prediction by majority vote after normalizing per-rotation model outputs), offer moderate gains for reasoning-centric models but are costly and limited by discrete angle assumptions.

Fine-tuning (Qwen-2.5-VL-7B-Instruct on MS COCO) reliably improves detection of upright and upside-down images but cannot stably differentiate between sideways rotations. Accuracy for 90° and 270° fluctuates, suggesting two local optima and indicating a representational bottleneck unresolved by standard transfer learning.

4. Benchmark Construction and Evaluation Protocols

Each RotBench variant adopts rigorous construction and reporting protocols. The rotated Klee–Minty benchmarks (Hellwig et al., 2018) prescribe:

  • Dimensional scalability (explicit instance sets)
  • Lexicographic ordering for solution comparison
  • Clear termination criteria (objective within 10810^{-8} of optimum or stagnation)
  • Standardized reporting through quality indicators (best/median function values, feasibility rate, ECDF plots)

RoTBench for LLM tool learning (Ye et al., 16 Jan 2024) utilizes controlled, programmatic corruption of names/attributes to fabricate challenging environments. Multi-level task decomposition enables granular assessment: errors are diagnosed as selection failures, parameter ambiguity, or incorrect content filling.

The image rotation RotBench (Niu et al., 19 Aug 2025) is built on two-level human curation and randomized prompt mapping. Quantitative performance is measured via zero-shot, auxiliary-cue, CoT, and grid/voting protocols, always averaged across multiple runs to reduce sampling variance.

5. Empirical Results and Limitations

All studies report clear empirical limitations in state-of-the-art systems:

  • EAs can match deterministic LP algorithms on highly-rotated linear instances, but scaling exposes runtime disparities. RotBench’s rotation eliminates algorithmic bias—forcing generalized constraint handling evaluation.
  • LLMs are robust in clean or heavily corrupted tool learning scenarios but suffer from over-correction on mild noise—especially for GPT-family models. These artifacts demonstrate the necessity for training data diversity when deploying tool-using LLMs in realistic, noisy settings.
  • MLLMs tested on visual rotation consistently fail to resolve 90° versus 270°, even when aided by additional context, reasoning approaches, or fine-tuning. The underlying model architectures appear insufficiently rotation-aware, supporting the claim that spatial reasoning in deep neural networks is weakly aligned with human perceptual faculties.

6. Implications and Future Directions

Together, the RotBench frameworks expose persistent deficiencies in optimization, tool learning, and spatial reasoning across modern algorithms and models. Their contribution lies in the reproducibility, scalability, and targeted difficulty of their respective environments, compelling new research into unbiased benchmarking, robust training, and geometry-aware architectures.

A plausible implication is that future work should focus on more rotation-aware pre-training for MLLMs (including explicit geometric regularization); for LLM tool learning, expanding the spectrum of environmental noise during instruction-tuning may directly address the over-correction phenomenon. In black-box optimization, continued investigation into rotation-invariant constraint handling and reporting standards is suggested.

These benchmarks are increasingly central to establishing transparent state-of-the-art protocols, measuring algorithmic progress, and diagnosing model limitations in settings that precisely reflect real-world complexity.