OrderProbe: Structural & Phase Diagnostics
- OrderProbe is a versatile framework that probes deterministic internal structure and phase order in both language models and experimental physics.
- In computational linguistics, it benchmarks LLMs by reconstructing scrambled logographic idioms and quantifies semantic fidelity using exact-match scoring.
- In physics, it utilizes high-order cumulants and ARPES techniques to diagnose QCD phase transitions and detect hidden Fermi surface order.
OrderProbe is a term denoting distinct, rigorous probes of internal structure and phase order across several advanced research domains. It appears in two prominent forms: (1) as a deterministic benchmark for LLM evaluation via structural reconstruction of scrambled logographic idioms in East Asian languages (He et al., 13 Jan 2026), and (2) as a set of high-order, volume-independent cumulant and clustering observables used in relativistic heavy-ion experiments to diagnose the order of QCD phase transitions (Neff, 2024). Additionally, the OrderProbe concept is referenced in spectroscopic approaches to probing hidden Fermi surface reconstructions in condensed matter pseudogap phases (Boyd et al., 2012). The following systematically details methodologies, metrics, physical implications, and key empirical signatures.
1. Deterministic Structural Reconstruction: OrderProbe Benchmark for LLMs
OrderProbe, as introduced by Wang et al. (He et al., 13 Jan 2026), is a rigorous framework for assessing how LLMs reconstruct deterministic internal structure from scrambled inputs. The task leverages the unique properties of four-character idioms (成语/成語/熟語/사자성어) in Chinese, Japanese, and Korean, which possess a canonical, unambiguous word order. Each idiom is systematically scrambled into all 23 non-identity permutations; models must recover the canonical sequence and generate a concise semantic explanation.
Unlike standard sentence-level restoration—which is ill-posed due to paraphrasing order—OrderProbe’s deterministic ground truth enables exact-match scoring. The dataset comprises 3,543 idioms across four scripts, filtered for strict canonical form by expert linguists, and accompanied by controlled reference paraphrases for semantic evaluation.
2. Formalization, Metrics, and Diagnostic Evaluation
The core metric is Recovery Rate:
where (total permutations), is the model's prediction, and is the canonical target.
Beyond accuracy, OrderProbe diagnoses five structural and semantic dimensions per Table 2 (He et al., 13 Jan 2026):
| Metric Name | Technical Definition | Assessed Dimension |
|---|---|---|
| S | Weighted hybrid: cross-encoder, embedding, F | Semantic fidelity |
| S | Multilingual NLI entailment probability | Logical validity |
| S | Composite of deviation and rigidity sensitivity | Structural consistency |
| S | Sequential/structural robustness aggregation | Robustness to perturbation |
| S | ROUGE-precision penalized by brevity | Information density |
Each metric rigorously quantifies specific aspects: semantic fidelity, hallucination stability, consistency under different inputs, degradation under perturbation, and content density. For example, S distinguishes genuine reconstruction from consistent hallucination; S identifies models sensitive to anchor displacements within idioms.
Zero-shot Recovery rates for leading models (Qwen-3-14B, GPT-4o, etc.) repeatedly fall below 35%. Despite low exact-match rates, semantic metrics often remain high, confirming a dissociation between semantic competence and structural planning. The negative control, Korean Hangul, exhibits near-random recovery, revealing the critical importance of local semantic anchors in logographic scripts.
3. Experimental Methodology: STAR QCD Phase Structure Probing
OrderProbe also refers to quantitative techniques for diagnosing the nature (crossover vs first-order) of the QCD phase transition in heavy-ion collisions (Neff, 2024). The approach utilizes high-order cumulants of net-proton multiplicity distributions and the variance of proton number in azimuthal partitions at the STAR experiment (RHIC).
Cumulants up to the hyper-order () are defined via the moment and cumulant generating functions:
with the probability for net-proton number . Ratios such as are volume-independent and sensitive to correlation length (e.g., , ), thereby amplifying critical or first-order clustering effects.
STAR datasets (BES-I and BES-II) span –200 GeV, grouping events by centrality and employing rigorous selection criteria (, GeV/).
4. Physical Interpretation: Cumulant Ordering and Azimuthal Clustering
Cumulant ratios ordered by energy reveal distinct phase structure signatures:
- BES-I data ($7.7$–$200$ GeV): , (monotonic ordering), in qualitative agreement with lattice QCD crossover predictions.
- BES-II ($3$ GeV): All three ratios become positive, matching UrQMD (hadronic rescattering) trends—contradicting critical crossover hierarchy, signaling possible first-order behavior.
Azimuthal clustering is quantified by comparing measured variance in -bin partitions to the binomial baseline:
with positive indicative of spatial clustering (attraction), negative of repulsion or momentum conservation. STAR data exhibit a pronounced upward drift in at low energies, absent in AMPT (no first-order transition), suggesting emergent clustering—a hallmark of first-order coexistence.
5. Generalization: OrderProbe Concepts in Condensed Matter Spectroscopy
In the context of cuprate superconductivity (Boyd et al., 2012), OrderProbe describes a nonequilibrium ARPES-based method to reconstruct hidden Fermi surface order resulting from phase-incoherent pairing. By driving a transport current through an underdoped sample, paired electron pockets acquire a Doppler shift in quasiparticle spectra:
Spectral function analysis reveals partial or full restoration of gapped pockets at critical drift velocities; observing “turn-on” arcs as a function of provides unambiguous evidence of precursor pairing and Fermi surface reconstruction.
6. Implications and Key Findings
OrderProbe methodologies across disciplines establish reproducible, non-ambiguous benchmarks and observables for diagnosing structural recovery or phase order transitions:
- In language modeling, OrderProbe uncovers systematic gaps in structural planning, even among high-performance LLMs, highlighting the need for evaluation frameworks beyond semantic recall.
- In QCD phase mapping, hyper-order cumulants and azimuthal clustering observables enable discrimination between crossover and first-order regions, tightly constraining the location of the critical point and coexistence features.
- In nonequilibrium condensed matter physics, OrderProbe-type ARPES studies reveal hidden spectral order, directly informing the debate over pairing mechanisms in pseudogap regimes.
A plausible implication is that deterministic probes of internal order—whether structural, combinatorial, or statistical—are essential for identifying genuine robustness, phase coexistence, or reconstructing hidden structure across complex physical and computational systems.