XParaCon: Cross-Paraphrase Consistency Metric
- The paper introduces XParaCon, a metric that quantifies a model’s robustness by measuring performance variations across three semantically equivalent question formulations.
- XParaCon is computed using the standard deviation of accuracies from different paraphrase variants with a negative logarithmic transformation to scale consistency.
- Empirical evaluations show that paraphrase-aware fine-tuning significantly boosts XParaCon values, highlighting enhanced stability and reliability in model predictions.
Searching arXiv for the specified paper to ground the article in the cited source. XParaCon is the Cross-Paraphrase Consistency metric introduced in "RoParQ: Paraphrase-Aware Alignment of LLMs Towards Robustness to Paraphrased Questions" to quantify robustness to paraphrased questions in closed-book multiple-choice QA (Choi, 26 Nov 2025). It is designed for the setting in which the same question is presented in three semantically equivalent formulations—an original version and two paraphrases—and measures how much a model’s accuracy varies across those variants. In the formulation reported for RoParQ, a high XParaCon score means small variation across paraphrases and therefore high robustness, whereas a low score indicates large variation and therefore low robustness.
1. Motivation and evaluative role
LLMs often give different answers, or exhibit different confidence levels, when the same multiple-choice question is rephrased, even though its semantics remain identical (Choi, 26 Nov 2025). The motivation for XParaCon is the observation that overall accuracy tells how many questions a model gets right, but does not capture whether the model is stable against superficial linguistic variation. XParaCon is therefore intended to quantify how much a model’s performance “wiggles” across different paraphrases of the same question.
Within the RoParQ framework, this role is tightly connected to the benchmark design. RoParQ is derived from standard datasets by generating paraphrases via proprietary models and selectively retaining examples that elicit inconsistent confidence from a judge model. XParaCon serves as the associated robustness metric for this benchmark: it operationalizes semantic invariance as low variability in accuracy across paraphrase variants.
A plausible implication is that XParaCon addresses a distinct failure mode from raw correctness. It is not primarily a measure of task competence in isolation, but of stability under semantically preserving rewording. For that reason, the authors recommend including it alongside accuracy in evaluations that care about robustness to rephrasings (Choi, 26 Nov 2025).
2. Formal definition
Let be the number of data examples. For each example , three semantically equivalent variants are used:
- = original formulation
- = paraphrase by Gemini
- = paraphrase by Claude
For each variant, the model’s accuracy is measured separately, averaging over random shufflings of the candidate choices. These quantities are denoted , , and (Choi, 26 Nov 2025).
The per-example inconsistency measure is the standard deviation of those three accuracy values:
where
The overall XParaCon score is then defined as
0
This construction has two reported properties. First, the per-example standard deviation 1 captures how inconsistent the model is for a single question across paraphrases. Second, the negative base-2 logarithm produces a scale on which higher values indicate better consistency (Choi, 26 Nov 2025).
3. Computation procedure
The reported computation requires three ingredients: a set of 2 examples with three question variants each, a model 3 that predicts an answer from a question and its choices, and 4 random shufflings of the choice list for each variant. In the paper, 5 is used to remove order bias (Choi, 26 Nov 2025).
Operationally, the procedure is as follows. For each example, the model is evaluated on each of the three variants under 6 different random permutations of the answer choices. The fraction of correct predictions over those shufflings is the accuracy for that variant. The standard deviation across the three resulting accuracies is then computed as 7. After repeating this for all examples, the mean of the per-example standard deviations is taken, and XParaCon is obtained by applying 8 to that mean.
The paper gives the following pseudocode:
3
This procedure shows that XParaCon is computed over accuracies rather than over raw answer strings. That design is central to the metric’s coupling of correctness and consistency (Choi, 26 Nov 2025).
4. Interpretation and comparison with adjacent metrics
The paper contrasts XParaCon with two common alternatives. Overall accuracy measures correctness on each individual query while collapsing over all paraphrases, thereby ignoring stability. Pairwise agreement rates measure how often two phrasings lead to the same answer, but consider only agreement and not whether those answers are correct (Choi, 26 Nov 2025). XParaCon differs by using the standard deviation of correctness rates across all variants.
The reported interpretation is explicit. A model that answers each variant correctly 9 of the time has 0 for every 1, so the mean standard deviation is 2 and XParaCon tends to 3; in practice, this appears as a large finite value capped by numeric precision. Conversely, a model whose performance wildly oscillates across variants has a large mean standard deviation and therefore a small, or even negative, XParaCon score.
The toy example provided in the paper illustrates the scale. For 4 questions, suppose the model accuracies across the three variants are
- 5
- 6
- 7
Then 8, while 9 and 0. The mean standard deviation is therefore approximately 1, which yields 2 (Choi, 26 Nov 2025). The paper interprets this as a modest consistency score on a tiny set.
A plausible implication is that XParaCon is most informative when paired with accuracy rather than substituted for it. That reading aligns with the authors’ explicit recommendation to report both measures.
5. Empirical values reported for RoParQ
The paper reports aggregate XParaCon values on two RoParQ subsets: general knowledge and math reasoning (Choi, 26 Nov 2025). The reported figures are as follows.
General knowledge subset
| Model | Accuracy | XParaCon |
|---|---|---|
| Llama-3.1-8B-Instruct | 0.781 | 2.186 |
| Llama-3.1-70B-Instruct | 0.869 | 3.219 |
| Claude 3.5 Sonnet (proprietary) | 0.876 | 3.428 |
Math reasoning subset
| Model | Accuracy | XParaCon |
|---|---|---|
| Qwen-3-4B-Instruct-2507 | 0.942 | 4.489 |
| Claude 3.5 Sonnet | 0.959 | 5.164 |
These values show that the benchmark distinguishes not only between models with different accuracies, but also between models with different cross-paraphrase stability. The authors also state that fine-tuned lightweight models achieved consistency levels comparable to much larger pre-trained models (Choi, 26 Nov 2025). This suggests that XParaCon is sensitive to alignment interventions and not merely to parameter scale.
6. Relationship to paraphrase-aware fine-tuning and reported outlook
RoParQ is accompanied by a reasoning-based, paraphrase-aware Supervised Fine-Tuning strategy designed to align models toward semantic invariance (Choi, 26 Nov 2025). The paper reports that targeted alignment significantly enhances robustness, with XParaCon increasing after fine-tuning in every listed case.
The reported before/after values are:
| Model and setting | Before FT | After FT |
|---|---|---|
| Llama-3.1-8B-Instruct, General Knowledge | Acc=0.781, XParaCon=2.186 | Acc=0.798, XParaCon=2.629 |
| Mistral-7B-Instruct, General Knowledge | Acc=0.697, XParaCon=2.663 | Acc=0.735, XParaCon=2.854 |
| Qwen-3-4B-Instruct, Math Reasoning | Acc=0.942, XParaCon=4.489 | Acc=0.951, XParaCon=4.856 |
The paper characterizes these changes as a substantial increase in XParaCon and states that such improvements often close the consistency gap between small models and much larger baselines (Choi, 26 Nov 2025). In that interpretation, XParaCon is not only an evaluation metric but also a target for alignment-oriented training.
The authors’ recommendations and future directions are also explicit. They state that XParaCon provides a direct, quantitative gauge of semantic invariance in multiple-choice QA and recommend including it alongside accuracy whenever robustness to rephrasings matters. They further suggest that future work could extend XParaCon beyond multiple-choice to open-ended settings, other languages, or integrate it into reinforcement learning objectives such as RLHF to drive paraphrase-stable behavior at inference time. This suggests a broader research program in which paraphrase consistency becomes a first-class optimization target rather than only a post hoc diagnostic.