CircularEval: A Robust Evaluation Protocol
- CircularEval is an evaluation protocol that cyclically shifts answer options to accurately assess a model's semantic understanding.
- It mitigates label position and surface-pattern biases by enforcing correct answers across all circular permutations.
- Empirical results from benchmarks like MMBench demonstrate significant performance drops, revealing reliance on superficial cues.
CircularEval is an evaluation protocol designed to robustly assess the true semantic understanding and answer consistency of models, especially vision–LLMs (VLMs) and LLMs, on multiple-choice questions. Distinct from conventional single-pass scoring, CircularEval enforces that a model correctly answers all circular permutations (“shifts”) of the answer options for a particular question. By systematically mitigating label-position and surface-pattern biases, CircularEval provides a sharper, more discriminative measure of model competence, as demonstrated in leading benchmarks such as MMBench and FoundaBench (Liu et al., 2023, Li et al., 2024).
1. Concept and Motivation
CircularEval evaluates a model’s performance on multiple-choice QA by cyclically rotating the positions of the answer options and scoring a question as correct if, and only if, the correct answer is selected under every circular permutation. This approach was introduced to address several deficiencies in classical multiple-choice evaluation:
- Label and position bias: Models can exploit superficial correlations, such as favoring a particular choice label (e.g., “C”), irrespective of semantic content.
- Random guessing: Single-pass scoring tolerates spurious correct responses at a rate of $1/N$ for -way choices.
- Surface-pattern bias: Some models latch onto familiar token or option patterns in the prompt structure.
- Instruction-following variability: VLMs often emit natural language or arbitrary phrasings, causing brittle rule-based mapping from free-form output to option label.
- Scalability and reproducibility: Human-in-the-loop assessment is generally costly, slow, and prone to subjective bias.
CircularEval operationalizes robustness by requiring full consistency across all option orderings, thus substantially reducing spurious correctness and rewarding genuine semantic comprehension (Liu et al., 2023, Li et al., 2024).
2. Formal Algorithmic Workflow
Consider a dataset of multiple-choice questions, each with labeled options and a gold label . For each question , CircularEval executes the following procedure:
7
“Rotate” denotes a cyclic shift of the answer options by positions. “Correct_label_shifted” is the shifted label corresponding to the true answer after the permutation. The label extraction process (discussed in Section 4) guarantees mapping from arbitrary model output to the fixed label set. Early exit optimizes computation; evaluation of further shifts terminates once a failure is detected for a question.
3. Mathematical Foundation and Bias Reduction
Let be an indicator variable, if the model answers correctly with the 0-th (circularly shifted) prompt and 1 otherwise. For each question 2, define the overall score:
3
(i.e., 4 only if all 5 rotations are correct).
CircularEval accuracy across all questions is then:
6
This protocol estimates 7 rather than 8, making the metric significantly more robust to label-position effects. The probability that a purely guessing model answers all 9 permutations correctly is 0 (e.g., 1 for 2 choices), thus sharply suppressing spurious correctness. Empirically, CircularEval accuracy often drops 4–27% compared to one-pass evaluation, exposing gaps between models that may be masked by vanilla scoring (Liu et al., 2023, Li et al., 2024).
4. Free-Form to Predefined Choice Mapping
Real-world models, especially VLMs, often output answers as free-form text or full sentences instead of canonical option indices (“A”, “B”, ...). CircularEval incorporates a robust two-stage label extraction:
- Stage 1 (Heuristic Matching): If the output contains exactly one option label (“A”–“D”) or an unambiguous substring match to one of the choices, assign that label.
- Stage 2 (LLM-Based Mapping): Otherwise, a strong LLM (e.g., GPT-4-0125) is prompted with the question, set of options, and raw model output to select the best match. The LLM outputs {A, B, C, D, Z}, with “Z” indicating no match.
If “Z” is returned, all shifts for that question are marked as incorrect. In MMBench, GPT-4-0125 achieved 91.5% agreement with human judges on ∼420 challenging cases, and LLM-based extraction led to up to +23% improvement in zero-shot model accuracy on some VLMs (Liu et al., 2023).
5. Practical Implementation Considerations
Several practical aspects are integral to correct and efficient CircularEval deployment:
- Choice extractor LLM: GPT-4-0125 is default, but alternatives (GPT-3.5-turbo, InternLM2-7B) produce similar rankings.
- Early exit: As soon as a model fails one shift, subsequent evaluations for that question terminate.
- Quality control: Datasets undergo automated and manual filtering to remove “text-only” solvable items and noisy QA pairs.
- Shift granularity: The number of shifts 3 equals the number of answer choices (typically 2–4).
- Unit of evaluation: Each question is evaluated independently; there is no batchwise contamination.
6. Empirical Observations and Results
CircularEval’s impact has been demonstrated across diverse domains. In MMBench, model rankings revealed wider accuracy gaps under CircularEval (e.g., LLaVA-v1.5-13B vs. 7B widened from 2.1% to 4.7%; GPT-4v dev-set dropped from 81.5% to 74.3%). In FoundaBench, substantial absolute accuracy drops were recorded:
| Model | Raw (approx.) | CircularEval Common Sense | CircularEval K-12 |
|---|---|---|---|
| InternLM-123B | ~75% | 65.89% | 50.58% |
| InternLM-70B | — | 67.11% | 45.35% |
| GPT-4 | — | 55.74% | 47.63% |
| GPT-3.5-turbo | — | 39.70% | 18.64% |
| Qwen-14B | — | 52.79% | 40.31% |
| Baichuan2-13B | — | 47.72% | 23.35% |
| ChatGLM3-6B | — | 26.80% | 11.88% |
Observations reveal that models with lower raw accuracy experience larger absolute drops under CircularEval, implying greater reliance on bias or guessing. Reasoning questions show larger performance gaps than factual recall, and there exists a statistically significant negative correlation (4, 5) between raw accuracy and CircularEval drop across models in FoundaBench (Li et al., 2024, Liu et al., 2023).
7. Limitations and Potential Extensions
Key limitations of CircularEval include increased computational overhead (each question is evaluated 6 times), restricted bias mitigation (covers position/label bias but not other surface-pattern or prompt-form biases), and inapplicability to open-ended generation tasks. The protocol is currently tailored to closed-form multiple-choice QA. Potential extensions discussed in the literature include:
- Sampling random permutations beyond cyclic shifts.
- Integrating model confidence or log-probabilities.
- Cross-lingual evaluation to reveal language-specific biases.
- Coupling option rotation with adversarial stem paraphrasing for advanced robustness (Li et al., 2024).
CircularEval provides a reproducible, cost-effective, and highly discriminative protocol that sets a high standard for the objective assessment of multimodal and LLMs (Liu et al., 2023, Li et al., 2024).