Papers
Topics
Authors
Recent
Search
2000 character limit reached

XParaCon: Cross-Paraphrase Consistency Metric

Updated 3 July 2026
  • The paper introduces XParaCon, a metric that quantifies a model’s robustness by measuring performance variations across three semantically equivalent question formulations.
  • XParaCon is computed using the standard deviation of accuracies from different paraphrase variants with a negative logarithmic transformation to scale consistency.
  • Empirical evaluations show that paraphrase-aware fine-tuning significantly boosts XParaCon values, highlighting enhanced stability and reliability in model predictions.

Searching arXiv for the specified paper to ground the article in the cited source. XParaCon is the Cross-Paraphrase Consistency metric introduced in "RoParQ: Paraphrase-Aware Alignment of LLMs Towards Robustness to Paraphrased Questions" to quantify robustness to paraphrased questions in closed-book multiple-choice QA (Choi, 26 Nov 2025). It is designed for the setting in which the same question is presented in three semantically equivalent formulations—an original version and two paraphrases—and measures how much a model’s accuracy varies across those variants. In the formulation reported for RoParQ, a high XParaCon score means small variation across paraphrases and therefore high robustness, whereas a low score indicates large variation and therefore low robustness.

1. Motivation and evaluative role

LLMs often give different answers, or exhibit different confidence levels, when the same multiple-choice question is rephrased, even though its semantics remain identical (Choi, 26 Nov 2025). The motivation for XParaCon is the observation that overall accuracy tells how many questions a model gets right, but does not capture whether the model is stable against superficial linguistic variation. XParaCon is therefore intended to quantify how much a model’s performance “wiggles” across different paraphrases of the same question.

Within the RoParQ framework, this role is tightly connected to the benchmark design. RoParQ is derived from standard datasets by generating paraphrases via proprietary models and selectively retaining examples that elicit inconsistent confidence from a judge model. XParaCon serves as the associated robustness metric for this benchmark: it operationalizes semantic invariance as low variability in accuracy across paraphrase variants.

A plausible implication is that XParaCon addresses a distinct failure mode from raw correctness. It is not primarily a measure of task competence in isolation, but of stability under semantically preserving rewording. For that reason, the authors recommend including it alongside accuracy in evaluations that care about robustness to rephrasings (Choi, 26 Nov 2025).

2. Formal definition

Let nn be the number of data examples. For each example ii, three semantically equivalent variants are used:

  • qi,0q_{i,0} = original formulation
  • qi,1q_{i,1} = paraphrase by Gemini
  • qi,2q_{i,2} = paraphrase by Claude

For each variant, the model’s accuracy is measured separately, averaging over random shufflings of the candidate choices. These quantities are denoted acc(qi,0)\mathrm{acc}(q_{i,0}), acc(qi,1)\mathrm{acc}(q_{i,1}), and acc(qi,2)\mathrm{acc}(q_{i,2}) (Choi, 26 Nov 2025).

The per-example inconsistency measure is the standard deviation of those three accuracy values:

For each example i=1,,n:STDi  =  13j=02(acc(qi,j)    acci)2\text{For each example } i=1,\dots,n: \quad \mathrm{STD}_i \;=\; \sqrt{\frac{1}{3}\sum_{j=0}^{2} \bigl( \mathrm{acc}\bigl(q_{i,j}\bigr) \;-\;\overline{\mathrm{acc}_i} \bigr)^2}

where

acci=13(acc(qi,0)+acc(qi,1)+acc(qi,2)).\overline{\mathrm{acc}_i} = \tfrac{1}{3}\bigl(\mathrm{acc}(q_{i,0})+\mathrm{acc}(q_{i,1})+\mathrm{acc}(q_{i,2})\bigr).

The overall XParaCon score is then defined as

ii0

This construction has two reported properties. First, the per-example standard deviation ii1 captures how inconsistent the model is for a single question across paraphrases. Second, the negative base-2 logarithm produces a scale on which higher values indicate better consistency (Choi, 26 Nov 2025).

3. Computation procedure

The reported computation requires three ingredients: a set of ii2 examples with three question variants each, a model ii3 that predicts an answer from a question and its choices, and ii4 random shufflings of the choice list for each variant. In the paper, ii5 is used to remove order bias (Choi, 26 Nov 2025).

Operationally, the procedure is as follows. For each example, the model is evaluated on each of the three variants under ii6 different random permutations of the answer choices. The fraction of correct predictions over those shufflings is the accuracy for that variant. The standard deviation across the three resulting accuracies is then computed as ii7. After repeating this for all examples, the mean of the per-example standard deviations is taken, and XParaCon is obtained by applying ii8 to that mean.

The paper gives the following pseudocode:

qi,1q_{i,1}3

This procedure shows that XParaCon is computed over accuracies rather than over raw answer strings. That design is central to the metric’s coupling of correctness and consistency (Choi, 26 Nov 2025).

4. Interpretation and comparison with adjacent metrics

The paper contrasts XParaCon with two common alternatives. Overall accuracy measures correctness on each individual query while collapsing over all paraphrases, thereby ignoring stability. Pairwise agreement rates measure how often two phrasings lead to the same answer, but consider only agreement and not whether those answers are correct (Choi, 26 Nov 2025). XParaCon differs by using the standard deviation of correctness rates across all variants.

The reported interpretation is explicit. A model that answers each variant correctly ii9 of the time has qi,0q_{i,0}0 for every qi,0q_{i,0}1, so the mean standard deviation is qi,0q_{i,0}2 and XParaCon tends to qi,0q_{i,0}3; in practice, this appears as a large finite value capped by numeric precision. Conversely, a model whose performance wildly oscillates across variants has a large mean standard deviation and therefore a small, or even negative, XParaCon score.

The toy example provided in the paper illustrates the scale. For qi,0q_{i,0}4 questions, suppose the model accuracies across the three variants are

  • qi,0q_{i,0}5
  • qi,0q_{i,0}6
  • qi,0q_{i,0}7

Then qi,0q_{i,0}8, while qi,0q_{i,0}9 and qi,1q_{i,1}0. The mean standard deviation is therefore approximately qi,1q_{i,1}1, which yields qi,1q_{i,1}2 (Choi, 26 Nov 2025). The paper interprets this as a modest consistency score on a tiny set.

A plausible implication is that XParaCon is most informative when paired with accuracy rather than substituted for it. That reading aligns with the authors’ explicit recommendation to report both measures.

5. Empirical values reported for RoParQ

The paper reports aggregate XParaCon values on two RoParQ subsets: general knowledge and math reasoning (Choi, 26 Nov 2025). The reported figures are as follows.

General knowledge subset

Model Accuracy XParaCon
Llama-3.1-8B-Instruct 0.781 2.186
Llama-3.1-70B-Instruct 0.869 3.219
Claude 3.5 Sonnet (proprietary) 0.876 3.428

Math reasoning subset

Model Accuracy XParaCon
Qwen-3-4B-Instruct-2507 0.942 4.489
Claude 3.5 Sonnet 0.959 5.164

These values show that the benchmark distinguishes not only between models with different accuracies, but also between models with different cross-paraphrase stability. The authors also state that fine-tuned lightweight models achieved consistency levels comparable to much larger pre-trained models (Choi, 26 Nov 2025). This suggests that XParaCon is sensitive to alignment interventions and not merely to parameter scale.

6. Relationship to paraphrase-aware fine-tuning and reported outlook

RoParQ is accompanied by a reasoning-based, paraphrase-aware Supervised Fine-Tuning strategy designed to align models toward semantic invariance (Choi, 26 Nov 2025). The paper reports that targeted alignment significantly enhances robustness, with XParaCon increasing after fine-tuning in every listed case.

The reported before/after values are:

Model and setting Before FT After FT
Llama-3.1-8B-Instruct, General Knowledge Acc=0.781, XParaCon=2.186 Acc=0.798, XParaCon=2.629
Mistral-7B-Instruct, General Knowledge Acc=0.697, XParaCon=2.663 Acc=0.735, XParaCon=2.854
Qwen-3-4B-Instruct, Math Reasoning Acc=0.942, XParaCon=4.489 Acc=0.951, XParaCon=4.856

The paper characterizes these changes as a substantial increase in XParaCon and states that such improvements often close the consistency gap between small models and much larger baselines (Choi, 26 Nov 2025). In that interpretation, XParaCon is not only an evaluation metric but also a target for alignment-oriented training.

The authors’ recommendations and future directions are also explicit. They state that XParaCon provides a direct, quantitative gauge of semantic invariance in multiple-choice QA and recommend including it alongside accuracy whenever robustness to rephrasings matters. They further suggest that future work could extend XParaCon beyond multiple-choice to open-ended settings, other languages, or integrate it into reinforcement learning objectives such as RLHF to drive paraphrase-stable behavior at inference time. This suggests a broader research program in which paraphrase consistency becomes a first-class optimization target rather than only a post hoc diagnostic.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to XParaCon.