Papers
Topics
Authors
Recent
Search
2000 character limit reached

RoParQ: Paraphrase Robustness Benchmark

Updated 3 July 2026
  • RoParQ is a benchmark framework that tests whether large language models maintain consistent answers across semantically equivalent paraphrased questions.
  • The framework employs proprietary paraphrasing models and answer-order randomization to isolate failures due to reliance on surface forms.
  • It introduces the XParaCon metric and a reasoning-based supervised fine-tuning procedure to improve and evaluate paraphrase robustness.

Searching arXiv for the RoParQ paper to verify metadata and citation. RoParQ is a benchmark and alignment framework for studying whether LLMs preserve their answers when semantically equivalent multiple-choice questions are paraphrased. It targets a specific failure mode in closed-book QA: a model may answer one wording correctly while failing on another wording with the same meaning, indicating sensitivity to surface form rather than semantic invariance. The framework introduced in "RoParQ: Paraphrase-Aware Alignment of LLMs Towards Robustness to Paraphrased Questions" couples a selectively constructed benchmark with a reasoning-based, paraphrase-aware supervised fine-tuning procedure, and evaluates robustness with a dedicated metric, XParaCon (Choi, 26 Nov 2025).

1. Conceptual motivation and task definition

RoParQ is motivated by the observation that strong benchmark accuracy does not imply robust understanding. In the setting considered, the model receives a closed-book multiple-choice question with candidate set C={c1,c2,,ck}\mathbb{C} = \{c_1, c_2, \dots, c_k\} and must select the correct answer without retrieval or passage support. The central concern is not ordinary accuracy alone, but whether the answer remains stable under paraphrastic variation of the question text (Choi, 26 Nov 2025).

The benchmark frames paraphrase robustness as a test of semantic invariance. If two questions are semantically identical, a robust model should preserve its prediction across both forms. The paper argues that inconsistency under paraphrase suggests reliance on superficial lexical or syntactic cues rather than abstract semantics. In that sense, RoParQ is not merely a paraphrased QA set; it is a stress test for whether closed-book performance reflects meaning-sensitive reasoning rather than memorized associations between familiar phrasings and answer choices.

RoParQ also isolates this issue from external retrieval quality by using a closed-book setup. That design makes the benchmark a probe of parametric knowledge and internal reasoning, rather than document matching or retrieval augmentation. The task domain is divided into two higher-level subsets: General Knowledge, formed from MMLU, ARC, and CommonsenseQA, and Math Reasoning, formed from MathQA.

2. Benchmark construction and selective data curation

RoParQ is built from the Unified MCQA collection, specifically its 4-choice and 5-choice subsets, using four source datasets: MMLU, ARC, CommonsenseQA, and MathQA. Before paraphrasing, the source data are aggressively filtered to preserve a clean closed-book MCQA setting. Retained questions must satisfy all of the following: no accompanying passage, no underbars or blanks, an ending question mark, at most three sentences, and at least ten words (Choi, 26 Nov 2025).

Each retained original question is paraphrased by two proprietary models, Gemini 2.5 Flash Lite and Claude 3.5 Sonnet, producing three variants per example:

q1=qoriginal,q2=qgemini,q3=qclaude.q_1 = q_{\text{original}}, \quad q_2 = q_{\text{gemini}}, \quad q_3 = q_{\text{claude}}.

Only the question is paraphrased; the choices remain unchanged. The QA triplet is given to the paraphraser to preserve decision semantics.

A further design element is answer-order randomization. Each example is evaluated under 8 randomly shuffled permutations of the answer choices, C1,,C8\mathbb{C}_1, \dots, \mathbb{C}_8, yielding 3×8=243 \times 8 = 24 judge responses per example. This is used to disentangle paraphrase sensitivity from answer-position bias.

The benchmark is then selectively constructed with a judge model, Llama-3.1-8B-Instruct. A variant is treated as perfectly correct if the judge gives the ground-truth answer under all eight choice permutations. RoParQ keeps only examples that exhibit inconsistent confidence across the three paraphrases, meaning exactly one or two variants are perfectly correct. Examples where all three variants are perfectly correct or none are perfectly correct are excluded. This concentrates the benchmark on cases that actually expose cross-paraphrase instability.

Source dataset After preprocessing After judge-based selection
MMLU 3,899 707
ARC 2,174 344
CommonsenseQA 6,995 2,083
MathQA 18,542 7,140

The final benchmark is split 70/15/15. The General Knowledge subset contains train/validation/test counts of 2,194 / 470 / 470. The Math Reasoning subset contains 4,998 / 1,071 / 1,071. This yields a total benchmark size of 10,274, with 3,134 General Knowledge examples and 7,140 Math Reasoning examples.

3. Evaluation protocol and the XParaCon metric

RoParQ evaluates models with conventional accuracy and with a dedicated robustness metric, XParaCon. For each example ii, the model is evaluated on the three paraphrase variants, and the per-example cross-paraphrase variability is measured as

STDi=StdDev(acc(qi,0),acc(qi,1),acc(qi,2)).STD_i = \operatorname{StdDev}\big(acc(q_{i,0}),\, acc(q_{i,1}),\, acc(q_{i,2})\big).

The dataset-level metric is then defined as

XParaCon=log2(1ni=1nSTDi),\mathrm{XParaCon} = -\log_2\left(\frac{1}{n}\sum_{i=1}^{n} STD_i\right),

where nn is the number of examples (Choi, 26 Nov 2025).

The metric is constructed so that higher is better. If a model behaves similarly across paraphrases, the standard deviation of the three per-variant accuracies is small, and XParaCon increases. The paper highlights an interpretable consequence of the log2\log_2 scaling: every increase of 1 point in XParaCon corresponds to halving the average cross-paraphrase standard deviation.

RoParQ’s notion of confidence is operational rather than probabilistic. The paper repeatedly refers to inconsistent confidence, but it does not define a separate logit-, entropy-, or calibration-based confidence score. Instead, confidence is approximated behaviorally through the “perfectly correct under all 8 permutations” criterion. This is narrower than full probabilistic calibration, but well suited to the benchmark’s design objective of isolating wording sensitivity while controlling for answer-order effects.

The judge statistics reported for paraphrase quality show comparable, though not identical, difficulty across the three question forms. In General Knowledge, the judge’s overall accuracy is 0.785 on originals, 0.770 on Gemini paraphrases, and 0.792 on Claude paraphrases; the ratios of perfectly correct examples are 0.646, 0.636, and 0.664, respectively. In Math Reasoning, the corresponding accuracies are 0.597, 0.538, and 0.582, with perfectly correct ratios 0.380, 0.317, and 0.369. This suggests broad comparability but also indicates that paraphrase source can shift difficulty, especially in math.

4. Paraphrase-aware alignment by supervised fine-tuning

The training component associated with RoParQ is a reasoning-based, paraphrase-aware Supervised Fine-Tuning (SFT) strategy. Its purpose is to align models toward semantic invariance rather than surface-form matching. The training prompt requires the model to restate the question, generate a meaning-preserving paraphrase, verify that the same answer applies under the paraphrased form, and then output the final answer (Choi, 26 Nov 2025).

The method differs from ordinary paraphrase augmentation. In simple augmentation, a model sees multiple phrasings with the same label, but it is not explicitly instructed to compare them. In RoParQ’s alignment procedure, semantic comparison and answer invariance are themselves part of the supervised behavior. For the Math Reasoning subset, the prompt instructs the model to generate its reasoning first, making the procedure more overtly reasoning-oriented in that domain.

The paper does not introduce a dedicated consistency regularizer, contrastive loss, KL constraint across paraphrases, or RL-style objective. The alignment method remains standard SFT implemented with LoRA, so the mechanism is best understood as prompt-mediated behavioral shaping under ordinary next-token supervision rather than explicit invariance regularization. Fine-tuned models are Llama-3.1-8B-Instruct, Mistral-7B-Instruct-v0.3, and Qwen3-4B-Instruct-2507.

Implementation details are reported explicitly. The SFT setup uses seed 42, learning rate 0.0002, linear scheduler, per-device train and eval batch sizes of 1, warmup ratio 0.03, weight decay 0, gradient accumulation steps 1, max grad norm 1, and bf16. The LoRA configuration uses r=16r = 16, lora_alpha = 32, lora_dropout = 0.05, bias = none, and task_type = CAUSAL_LM. The paper positions this as an alignment recipe rather than a new optimization objective.

5. Empirical performance and observed trade-offs

The benchmark results show that model scale generally improves both ordinary accuracy and paraphrase robustness, but not uniformly across domains. On the General Knowledge subset, the highest reported accuracy is 0.882 for Llama-3.1-405B, while the highest XParaCon is 3.428 for Claude 3.5 Sonnet. On the Math Reasoning subset, Claude 3.5 Sonnet is strongest on both axes, with 0.959 accuracy and 5.164 XParaCon (Choi, 26 Nov 2025).

Several smaller models are notable for asymmetries between scale and robustness. In Math Reasoning, Qwen3-4B-Instruct-2507 achieves 0.942 accuracy and 4.489 XParaCon, outperforming many much larger models on robustness. This already suggests that paraphrase consistency is not reducible to parameter count alone.

The paraphrase-aware fine-tuning results are mixed but consequential. On General Knowledge, fine-tuning improves both metrics for all three adapted models:

  • Llama-3.1-8B-Instruct: accuracy 0.781 q1=qoriginal,q2=qgemini,q3=qclaude.q_1 = q_{\text{original}}, \quad q_2 = q_{\text{gemini}}, \quad q_3 = q_{\text{claude}}.0 0.798, XParaCon 2.186 q1=qoriginal,q2=qgemini,q3=qclaude.q_1 = q_{\text{original}}, \quad q_2 = q_{\text{gemini}}, \quad q_3 = q_{\text{claude}}.1 2.629
  • Mistral-7B-Instruct-v0.3: accuracy 0.697 q1=qoriginal,q2=qgemini,q3=qclaude.q_1 = q_{\text{original}}, \quad q_2 = q_{\text{gemini}}, \quad q_3 = q_{\text{claude}}.2 0.735, XParaCon 2.663 q1=qoriginal,q2=qgemini,q3=qclaude.q_1 = q_{\text{original}}, \quad q_2 = q_{\text{gemini}}, \quad q_3 = q_{\text{claude}}.3 2.854
  • Qwen3-4B-Instruct-2507: accuracy 0.802 q1=qoriginal,q2=qgemini,q3=qclaude.q_1 = q_{\text{original}}, \quad q_2 = q_{\text{gemini}}, \quad q_3 = q_{\text{claude}}.4 0.821, XParaCon 2.848 q1=qoriginal,q2=qgemini,q3=qclaude.q_1 = q_{\text{original}}, \quad q_2 = q_{\text{gemini}}, \quad q_3 = q_{\text{claude}}.5 2.920

On Math Reasoning, the outcomes reveal a clearer trade-off structure:

  • Llama-3.1-8B-Instruct: accuracy 0.738 q1=qoriginal,q2=qgemini,q3=qclaude.q_1 = q_{\text{original}}, \quad q_2 = q_{\text{gemini}}, \quad q_3 = q_{\text{claude}}.6 0.685, XParaCon 1.924 q1=qoriginal,q2=qgemini,q3=qclaude.q_1 = q_{\text{original}}, \quad q_2 = q_{\text{gemini}}, \quad q_3 = q_{\text{claude}}.7 2.316
  • Mistral-7B-Instruct-v0.3: accuracy 0.344 q1=qoriginal,q2=qgemini,q3=qclaude.q_1 = q_{\text{original}}, \quad q_2 = q_{\text{gemini}}, \quad q_3 = q_{\text{claude}}.8 0.405, XParaCon 2.728 q1=qoriginal,q2=qgemini,q3=qclaude.q_1 = q_{\text{original}}, \quad q_2 = q_{\text{gemini}}, \quad q_3 = q_{\text{claude}}.9 2.617
  • Qwen3-4B-Instruct-2507: accuracy 0.942 C1,,C8\mathbb{C}_1, \dots, \mathbb{C}_80 0.951, XParaCon 4.489 C1,,C8\mathbb{C}_1, \dots, \mathbb{C}_81 4.856

These results matter for interpretation. The alignment procedure often improves robustness, but not monotonically for all models and tasks. In particular, the Llama-8B math result shows that greater semantic stability need not coincide with higher raw accuracy. Conversely, the Qwen3-4B math result shows that both can improve simultaneously. The paper therefore supports the claim that paraphrase robustness is trainable, but it does not support a universal no-trade-off conclusion.

The most striking comparison is the fine-tuned Qwen3-4B on Math Reasoning, which reaches 4.856 XParaCon. That exceeds the reported XParaCon of Llama-3.1-405B (3.762), Qwen3-30B-A3B-Instruct-2507 (4.034), Deepseek-R1 (4.405), and Gemini 2.5 Flash Lite (4.195), trailing only Claude 3.5 Sonnet (5.164). This suggests that targeted alignment can, in some cases, make a lightweight model competitive with or superior to much larger pretrained models on paraphrase consistency.

6. Limitations, interpretation, and significance

RoParQ is explicitly limited to English, closed-book multiple-choice QA, and an SFT-based post-training regime. It does not study open-ended generation, multilingual paraphrase robustness, or alternative alignment strategies such as RLHF or DPO. The paper also acknowledges limited model-scale coverage despite including some very large systems (Choi, 26 Nov 2025).

Several methodological caveats are central to interpreting the benchmark. First, paraphrase generation depends on proprietary models, namely Gemini 2.5 Flash Lite and Claude 3.5 Sonnet. Second, benchmark selection depends on a single open-source judge, Llama-3.1-8B-Instruct. A plausible implication is that the final benchmark is shaped by the particular failure profile of that judge. Third, the paper does not report human validation of paraphrase quality; semantic fidelity is enforced procedurally through prompting constraints, answer preservation during paraphrase generation, and judge-based filtering rather than manual annotation.

The benchmark’s operational definition of confidence also has limits. “Inconsistent confidence” refers to variation in perfect correctness across paraphrase variants under eight answer-order permutations, not to calibrated probabilities. This makes the framework behaviorally precise for its intended purpose, but it should not be conflated with full uncertainty calibration.

Within those constraints, RoParQ makes three durable contributions. It provides a concentrated benchmark for semantic invariance failures in MCQA; it introduces XParaCon as a compact measure of cross-paraphrase variability; and it demonstrates that paraphrase-aware alignment can materially improve robustness. More broadly, it shifts evaluation away from single-phrasing benchmark accuracy toward invariance under semantically preserving transformations. This suggests that closed-book QA competence should be assessed not only by whether a model knows an answer, but by whether that answer survives changes in wording.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RoParQ.