Papers
Topics
Authors
Recent
Search
2000 character limit reached

CircularEval: A Robust Evaluation Protocol

Updated 1 June 2026
  • CircularEval is an evaluation protocol that cyclically shifts answer options to accurately assess a model's semantic understanding.
  • It mitigates label position and surface-pattern biases by enforcing correct answers across all circular permutations.
  • Empirical results from benchmarks like MMBench demonstrate significant performance drops, revealing reliance on superficial cues.

CircularEval is an evaluation protocol designed to robustly assess the true semantic understanding and answer consistency of models, especially vision–LLMs (VLMs) and LLMs, on multiple-choice questions. Distinct from conventional single-pass scoring, CircularEval enforces that a model correctly answers all circular permutations (“shifts”) of the answer options for a particular question. By systematically mitigating label-position and surface-pattern biases, CircularEval provides a sharper, more discriminative measure of model competence, as demonstrated in leading benchmarks such as MMBench and FoundaBench (Liu et al., 2023, Li et al., 2024).

1. Concept and Motivation

CircularEval evaluates a model’s performance on multiple-choice QA by cyclically rotating the positions of the answer options and scoring a question as correct if, and only if, the correct answer is selected under every circular permutation. This approach was introduced to address several deficiencies in classical multiple-choice evaluation:

  • Label and position bias: Models can exploit superficial correlations, such as favoring a particular choice label (e.g., “C”), irrespective of semantic content.
  • Random guessing: Single-pass scoring tolerates spurious correct responses at a rate of $1/N$ for NN-way choices.
  • Surface-pattern bias: Some models latch onto familiar token or option patterns in the prompt structure.
  • Instruction-following variability: VLMs often emit natural language or arbitrary phrasings, causing brittle rule-based mapping from free-form output to option label.
  • Scalability and reproducibility: Human-in-the-loop assessment is generally costly, slow, and prone to subjective bias.

CircularEval operationalizes robustness by requiring full consistency across all option orderings, thus substantially reducing spurious correctness and rewarding genuine semantic comprehension (Liu et al., 2023, Li et al., 2024).

2. Formal Algorithmic Workflow

Consider a dataset QQ of multiple-choice questions, each with NN labeled options {c1,,cN}\{c_1, \ldots, c_N\} and a gold label aa^*. For each question qq, CircularEval executes the following procedure:

QQ7

“Rotate” denotes a cyclic shift of the answer options by ss positions. “Correct_label_shifted” is the shifted label corresponding to the true answer after the permutation. The label extraction process (discussed in Section 4) guarantees mapping from arbitrary model output to the fixed label set. Early exit optimizes computation; evaluation of further shifts terminates once a failure is detected for a question.

3. Mathematical Foundation and Bias Reduction

Let Iq,sI_{q,s} be an indicator variable, Iq,s=1I_{q,s}=1 if the model answers correctly with the NN0-th (circularly shifted) prompt and NN1 otherwise. For each question NN2, define the overall score:

NN3

(i.e., NN4 only if all NN5 rotations are correct).

CircularEval accuracy across all questions is then:

NN6

This protocol estimates NN7 rather than NN8, making the metric significantly more robust to label-position effects. The probability that a purely guessing model answers all NN9 permutations correctly is QQ0 (e.g., QQ1 for QQ2 choices), thus sharply suppressing spurious correctness. Empirically, CircularEval accuracy often drops 4–27% compared to one-pass evaluation, exposing gaps between models that may be masked by vanilla scoring (Liu et al., 2023, Li et al., 2024).

4. Free-Form to Predefined Choice Mapping

Real-world models, especially VLMs, often output answers as free-form text or full sentences instead of canonical option indices (“A”, “B”, ...). CircularEval incorporates a robust two-stage label extraction:

  • Stage 1 (Heuristic Matching): If the output contains exactly one option label (“A”–“D”) or an unambiguous substring match to one of the choices, assign that label.
  • Stage 2 (LLM-Based Mapping): Otherwise, a strong LLM (e.g., GPT-4-0125) is prompted with the question, set of options, and raw model output to select the best match. The LLM outputs {A, B, C, D, Z}, with “Z” indicating no match.

If “Z” is returned, all shifts for that question are marked as incorrect. In MMBench, GPT-4-0125 achieved 91.5% agreement with human judges on ∼420 challenging cases, and LLM-based extraction led to up to +23% improvement in zero-shot model accuracy on some VLMs (Liu et al., 2023).

5. Practical Implementation Considerations

Several practical aspects are integral to correct and efficient CircularEval deployment:

  • Choice extractor LLM: GPT-4-0125 is default, but alternatives (GPT-3.5-turbo, InternLM2-7B) produce similar rankings.
  • Early exit: As soon as a model fails one shift, subsequent evaluations for that question terminate.
  • Quality control: Datasets undergo automated and manual filtering to remove “text-only” solvable items and noisy QA pairs.
  • Shift granularity: The number of shifts QQ3 equals the number of answer choices (typically 2–4).
  • Unit of evaluation: Each question is evaluated independently; there is no batchwise contamination.

6. Empirical Observations and Results

CircularEval’s impact has been demonstrated across diverse domains. In MMBench, model rankings revealed wider accuracy gaps under CircularEval (e.g., LLaVA-v1.5-13B vs. 7B widened from 2.1% to 4.7%; GPT-4v dev-set dropped from 81.5% to 74.3%). In FoundaBench, substantial absolute accuracy drops were recorded:

Model Raw (approx.) CircularEval Common Sense CircularEval K-12
InternLM-123B ~75% 65.89% 50.58%
InternLM-70B 67.11% 45.35%
GPT-4 55.74% 47.63%
GPT-3.5-turbo 39.70% 18.64%
Qwen-14B 52.79% 40.31%
Baichuan2-13B 47.72% 23.35%
ChatGLM3-6B 26.80% 11.88%

Observations reveal that models with lower raw accuracy experience larger absolute drops under CircularEval, implying greater reliance on bias or guessing. Reasoning questions show larger performance gaps than factual recall, and there exists a statistically significant negative correlation (QQ4, QQ5) between raw accuracy and CircularEval drop across models in FoundaBench (Li et al., 2024, Liu et al., 2023).

7. Limitations and Potential Extensions

Key limitations of CircularEval include increased computational overhead (each question is evaluated QQ6 times), restricted bias mitigation (covers position/label bias but not other surface-pattern or prompt-form biases), and inapplicability to open-ended generation tasks. The protocol is currently tailored to closed-form multiple-choice QA. Potential extensions discussed in the literature include:

  • Sampling random permutations beyond cyclic shifts.
  • Integrating model confidence or log-probabilities.
  • Cross-lingual evaluation to reveal language-specific biases.
  • Coupling option rotation with adversarial stem paraphrasing for advanced robustness (Li et al., 2024).

CircularEval provides a reproducible, cost-effective, and highly discriminative protocol that sets a high standard for the objective assessment of multimodal and LLMs (Liu et al., 2023, Li et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CircularEval.