IRT Models for LLMs

Updated 17 June 2026

IRT for LLMs is a probabilistic framework that decomposes responses into latent ability, item difficulty, and discrimination for clear performance evaluation.
Recent advances include neural, geometric, and nonparametric extensions that enable multidimensional analysis and scalable inference methods.
Applications span multi-LLM routing, benchmark design, and cross-modal evaluation, providing actionable insights for model assessment and improvement.

Item Response Theory (IRT) provides a formal probabilistic framework for analyzing response data—traditionally from human test-takers on assessment items—but has been widely adapted for evaluating and interpreting LLMs. IRT models seek to decompose observed responses into parameters reflecting model "ability," item "difficulty," and often discrimination and guessing tendencies. When robustly applied to LLMs, IRT yields interpretable measures of model competence, item informativeness, and evaluation reliability. Recent breakthroughs include simulation-based IRT using LLMs themselves, scalable computational techniques, extensions to multilingual and multimodal contexts, and geometric and neural IRT variants.

1. Classical IRT Formalism and Adaptation to LLMs

Classical IRT, especially the 2-parameter logistic (2PL) model, expresses the probability of a correct (or preferred) response as a function of latent model ability ( $\theta$ ), item difficulty ( $b$ ), and item discrimination ( $a$ ):

$P(y_{ij} = 1 | \theta_i, a_j, b_j) = \frac{1}{1 + \exp[-a_j(\theta_i - b_j)]}.$

For LLM evaluation, models act as "respondents" and benchmarks items as "questions." The 1PL (Rasch) model assumes a constant $a$ ; the 3PL adds a pseudo-guessing parameter $c_j$ (Cong et al., 30 Apr 2026). IRT parameters for LLMs are learned via maximum likelihood (MLE), Bayesian inference, or scalable stochastic optimization (Qu et al., 7 May 2026, Frick et al., 2024).

Extensions include multidimensional IRT (MIRT), where each model or item is associated with a vector embedding, allowing modeling of topical or skill-specific abilities as in (Song et al., 1 Jun 2025, Chen et al., 1 Oct 2025, Yao et al., 26 Sep 2025).

2. Neural, Geometric, and Nonparametric IRT Extensions

Recent work expands beyond classical logistic forms to model the complex and multidimensional nature of LLM performance:

Neural IRT: PSN-IRT employs pseudo-Siamese networks, mapping model/item IDs to ability and item parameters via multi-layer perceptrons, supporting full 4PL structure (difficulty, discrimination, guessing, feasibility) (Zhou et al., 21 May 2025).
Mixture-of-Experts (MoE) Neural IRT: IrtNet encodes model abilities and query discrimination/difficulty via neural networks, with end-to-end learning optimizing cross-entropy loss (Chen et al., 1 Oct 2025). This approach captures inter-model and inter-query diversity and supports nuanced routing and prediction tasks.
Geometric IRT: JE-IRT jointly embeds LLMs and benchmark items in a shared Euclidean space. For a question embedding $\mathbf{w}_q$ and a model embedding $\mathbf{v}_m$ , success probability is determined by the angle (representing semantic alignment) and the norm of $\mathbf{w}_q$ (representing difficulty). This relaxes the global ranking assumption of scalar $\theta$ , yielding a multidimensional and interpretable evaluation geometry (Yao et al., 26 Sep 2025).
Nonparametric IRT: GPIRT replaces parametric item response functions with Gaussian processes, flexibly modeling arbitrary, potentially non-monotonic response curves and enabling Bayesian inference over both abilities and item curves (Duck-Mayr et al., 2020).

3. Practical Frameworks and Scalable Inference

Computational scaling is a major bottleneck in LLM-scale IRT. Multiple strategies have been developed:

Coresets for Alternating Optimization: Compact weighted subsamples (coresets) provably approximate the alternating logistic regression subproblems in large-scale IRT, reducing both memory and runtime by orders of magnitude. This enables fitting 2PL/3PL models to evaluation matrices with $b$ 0 while retaining parameter accuracy (Frick et al., 2024).
Majorization–Minimization (MM) Matrix Factorization: Constrained block MM (cBMM) solves 2PL-IRT via a sequence of quadratic surrogate minimizations—effectively casting IRT parameter recovery as constrained matrix factorization. Each subproblem admits closed-form or nonnegative least squares updates. This yields $b$ 1-per-iteration complexity and guarantees global convergence under mild identifiability conditions (Qu et al., 7 May 2026).
Identity-Link and Additive IRT for Pairwise Comparisons: For "label-free" evaluation scenarios based on total variation distance mutual information (TVD-MI), an identity link is empirically shown to best preserve additivity—i.e., $b$ 2—when mapping pairwise critic scores to (agent, item) responses, outperforming nonlinear logit/probit links in both interpretability and fit (Robertson, 16 Oct 2025).

4. Simulation-Based and Generative IRT Using LLMs

Novel approaches use LLMs not only as test-takers but also as simulators of human/ability-conditioned responses, generating synthetic data to fit IRT item parameters:

Ability-Conditioned LLM Simulation: Large models such as Qwen-3 are finetuned using Low-Rank Adaptation (LoRA) to explicitly generate responses conditioned on discrete ability descriptors (e.g., "Proficient," "Exemplary"), covering a range of latent ability bands (Ormerod, 5 Jan 2026). Given prompts parameterized by semantic ability, LLMs generate choice probabilities which serve as "synthetic" item characteristic curves (ICCs). 2PL parameters are fit by minimizing the squared error between simulated probability curves and the logistic ICC, mapping discrete ability bands to expected values using the assumed latent ability distribution.
Advantages & Limitations: Simulation-based approaches reduce field testing requirements and produce strong discrimination modeling, but are subject to dataset scale limitations, regression-to-the-mean artifacts, and incomplete readiness for high-stakes assessment. Proposed extensions include chain-of-thought rationale modeling, direct 3PL fitting, and cross-domain scaling studies.

5. Applications: Routing, Benchmarking, and Analysis

IRT frameworks underpin a range of advanced applications for LLM analysis:

Multi-LLM Routing: IRT-Router casts each LLM as a test-taker and each user query as an item. Latent abilities and item parameters are fit across potentially multidimensional axes. Routing is performed by maximizing a score that trades off predicted accuracy and cost (via IRT probabilities and per-model pricing), with mechanisms for online "warm-up" to address query cold-start (Song et al., 1 Jun 2025). The approach yields strong performance-cost tradeoffs and interpretable model/query embeddings.
Evaluating Short Answer Grading: IRT models, including testlet-effect variants, provide response-level analysis of grading ability and response difficulty in LLM-based ASAG, diagnosing where models degrade as item difficulty increases and revealing distinct patterns of error on ambiguous or difficult responses (Cong et al., 30 Apr 2026).
Benchmark Curation and Human Alignment: Fisher information derived from IRT parameter fits (especially discrimination at top model ability) supports diagnostic test construction. Methods such as PSN-IRT can identify "most informative" items, enabling the construction of compact benchmarks yielding model rankings more aligned with human preference (e.g., improving Kendall’s $b$ 3 from 0.64 to 0.90 for $b$ 4 items) (Zhou et al., 21 May 2025).

6. Multilingual and Multimodal IRT Extensions

IRT has been adapted to evaluate LLMs across language and modality axes:

Multilingual IRT: Multilingual-IRT introduces per-language item difficulty deviations, content/language-specific discrimination, and per-model per-language ability residuals. This framework enables prediction of untested (item, LLM, lang) entries, principled detection of translation errors, and recovery of culture-specific items. Fitted language-correlation matrices recover known linguistic families and resource hierarchies (Lior et al., 14 Jun 2026).
Multimodal/Multidimensional IRT (M3IRT): The M3IRT framework decomposes both model ability and item difficulty into base, image-only, text-only, and cross-modal integration components. The model quantifies cross-modal ability/difficulty and ensures evaluation subsets prioritize "truly cross-modal" items over single-modality shortcuts. D-optimality and Fisher information criteria guide adaptive benchmark reduction, improving ranking fidelity and robustness to contamination (Uebayashi et al., 3 Mar 2026).

7. Interpretability, Diagnostic Utility, and Open Challenges

IRT models confer interpretability by mapping model performance to latent ability scales and item curves:

Interpretability and Visualization: Embedding-based and neural IRT models (e.g., JE-IRT, PSN-IRT, IrtNet) provide interpretable geometry—model-item alignments reveal strengths in semantic or domain-specific directions. IRT parameters (difficulty, discrimination) correlate with empirical difficulty, domain clusters, and identify non-discriminative or contaminated items (Yao et al., 26 Sep 2025, Zhou et al., 21 May 2025, Chen et al., 1 Oct 2025).
Diagnostic and Adaptive Testing: Bayesian and nonparametric IRT models (GPIRT) enable active learning and adaptive testing, selecting the most informative items for a given model to efficiently estimate ability (Duck-Mayr et al., 2020).
Challenges & Limitations: Open challenges include generalizing IRT fitting to non-parallel and open-ended benchmarks, extending to ordinal and graded response models, scaling in high-dimensional and sparse regimes, and formalizing uncertainty quantification and robust calibration.

In summary, the application of IRT models to LLMs has evolved from classical psychometric calibration to encompass highly scalable, simulation-driven, neural, geometric, multilingual/multimodal, and diagnostic frameworks. Modern IRT-based LLM benchmarks provide a principled mechanism for quantifying model abilities, analyzing benchmark efficacy, supporting routing and test design, and enabling cross-lingual and multimodal evaluation—all with rigorous statistical interpretability and computational efficiency (Ormerod, 5 Jan 2026, Song et al., 1 Jun 2025, Zhou et al., 21 May 2025, Chen et al., 1 Oct 2025, Yao et al., 26 Sep 2025, Frick et al., 2024, Qu et al., 7 May 2026, Duck-Mayr et al., 2020, Cong et al., 30 Apr 2026, Lior et al., 14 Jun 2026, Uebayashi et al., 3 Mar 2026, Robertson, 16 Oct 2025).