Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multimodal Multidimensional IRT (M³IRT)

Updated 16 May 2026
  • The paper introduces M³IRT to decompose both subject abilities and task difficulties into text, image, and cross-modal dimensions, enabling precise evaluation of AI reasoning.
  • The framework extends classical IRT by incorporating multidimensional latent traits, facilitating adaptive benchmarking and targeted diagnostics for modern AI systems.
  • Empirical results demonstrate that M³IRT achieves high rank fidelity and effectively filters out shortcut items, thereby improving benchmark quality under heavy contamination.

Multimodal Multidimensional Item Response Theory (M³IRT) is an extension of classical Item Response Theory (IRT) designed to rigorously capture and evaluate the cross-modal reasoning abilities required by modern artificial intelligence systems, notably Multimodal LLMs (MLLMs). The M³IRT framework decomposes both subject (model) abilities and item (task/question) difficulties into dimensions associated with individual modalities (e.g., image-only, text-only) and their integration (genuine cross-modal reasoning). This multidimensional approach enables precise separation of shortcut questions solvable using only a single modality from genuinely cross-modal tasks, thereby improving benchmarking fidelity and enabling adaptive evaluation (Uebayashi et al., 3 Mar 2026).

1. Latent Traits and Modality-Aware Parameters

M³IRT equips both models (“subjects”) and items (“questions”) with three explicit modality-aligned latent traits:

  • Model abilities: For model ii, ability vector θi=(θtext,i,θimage,i,θcross,i)\boldsymbol\theta_i = (\theta_{\text{text}, i}, \theta_{\text{image}, i}, \theta_{\text{cross}, i})^{\top}, where
    • θtext\theta_{\text{text}}: text-only reasoning capability,
    • θimage\theta_{\text{image}}: image-only (visual) reasoning capability,
    • θcross\theta_{\text{cross}}: integrated cross-modal reasoning capability.
  • Item parameters: For item jj,
    • βtext,j\beta_{\text{text},j}: text-only difficulty,
    • βimage,j\beta_{\text{image},j}: image-only difficulty,
    • βcross,j\beta_{\text{cross},j}: cross-modal (integration) difficulty,
    • atext,ja_{\text{text},j}, θi=(θtext,i,θimage,i,θcross,i)\boldsymbol\theta_i = (\theta_{\text{text}, i}, \theta_{\text{image}, i}, \theta_{\text{cross}, i})^{\top}0, θi=(θtext,i,θimage,i,θcross,i)\boldsymbol\theta_i = (\theta_{\text{text}, i}, \theta_{\text{image}, i}, \theta_{\text{cross}, i})^{\top}1: discrimination parameters for each dimension.

Large values of θi=(θtext,i,θimage,i,θcross,i)\boldsymbol\theta_i = (\theta_{\text{text}, i}, \theta_{\text{image}, i}, \theta_{\text{cross}, i})^{\top}2 indicate questions that demand joint reasoning across modalities, while low θi=(θtext,i,θimage,i,θcross,i)\boldsymbol\theta_i = (\theta_{\text{text}, i}, \theta_{\text{image}, i}, \theta_{\text{cross}, i})^{\top}3 identifies shortcut items.

2. Three-Dimensional Item Response Model

M³IRT generalizes the conventional 2PL model to three modality-aligned latent dimensions. For model θi=(θtext,i,θimage,i,θcross,i)\boldsymbol\theta_i = (\theta_{\text{text}, i}, \theta_{\text{image}, i}, \theta_{\text{cross}, i})^{\top}4 and item θi=(θtext,i,θimage,i,θcross,i)\boldsymbol\theta_i = (\theta_{\text{text}, i}, \theta_{\text{image}, i}, \theta_{\text{cross}, i})^{\top}5,

θi=(θtext,i,θimage,i,θcross,i)\boldsymbol\theta_i = (\theta_{\text{text}, i}, \theta_{\text{image}, i}, \theta_{\text{cross}, i})^{\top}6

where

θi=(θtext,i,θimage,i,θcross,i)\boldsymbol\theta_i = (\theta_{\text{text}, i}, \theta_{\text{image}, i}, \theta_{\text{cross}, i})^{\top}7

and θi=(θtext,i,θimage,i,θcross,i)\boldsymbol\theta_i = (\theta_{\text{text}, i}, \theta_{\text{image}, i}, \theta_{\text{cross}, i})^{\top}8.

The structure naturally supports extension to additional modalities by increasing dimensionality, or a four-dimensional variant including a “base” ability/difficulty.

3. Parameter Estimation and Regularization

Model parameters θi=(θtext,i,θimage,i,θcross,i)\boldsymbol\theta_i = (\theta_{\text{text}, i}, \theta_{\text{image}, i}, \theta_{\text{cross}, i})^{\top}9, θtext\theta_{\text{text}}0, and θtext\theta_{\text{text}}1 are estimated to maximize observed response likelihood given a sparse binary matrix (or tensor) θtext\theta_{\text{text}}2:

θtext\theta_{\text{text}}3

Mini-batch stochastic gradient descent (Adam optimizer, learning rate θtext\theta_{\text{text}}4) is employed to handle very large datasets and missing data (incomplete response matrices) without explicit imputation. Regularization via Gaussian priors or θtext\theta_{\text{text}}5 penalties on parameters is standard to prevent overfitting; implementation often constrains ability and difficulty within bounded intervals (e.g., θtext\theta_{\text{text}}6).

4. Modality Decomposition of Item Difficulty

Item difficulty decomposes additively by input format:

θtext\theta_{\text{text}}7

By exposing items under different conditions—no input, text-only, image-only, both—the separate contributions θtext\theta_{\text{text}}8, θtext\theta_{\text{text}}9, θimage\theta_{\text{image}}0 can be identified through likelihood maximization. Crucially:

  • θimage\theta_{\text{image}}1: how much harder with only text,
  • θimage\theta_{\text{image}}2: how much harder with only image,
  • θimage\theta_{\text{image}}3: the barrier when cross-modal integration is disabled.

A high θimage\theta_{\text{image}}4 flags that a question genuinely necessitates both modalities.

5. Evaluation Methodology and Empirical Results

The framework was evaluated on the MMMU, MathVista, and SEED-Bench benchmarks, covering 2,900+ multimodal items and 24 leading vision-LLMs. Key methodological features include:

  • Synthetic insertion of 50% low-quality “shortcut” items to simulate unbalanced or trivially solvable benchmarks.
  • Adaptive subset selection via computerized adaptive testing (Fisher information in the 2D case; D-optimality design in 3D).
  • Fidelity tracked by Spearman’s θimage\theta_{\text{image}}5 between rankings from the full benchmark and from small subsets.

Main findings (Uebayashi et al., 3 Mar 2026):

  • M³IRT achieves θimage\theta_{\text{image}}6 rank fidelity with only 1–3% of items, whereas standard or random IRT requires over 30%.
  • Even with 50% contamination by shortcuts, M³IRT filters out low-quality items (<25%) more effectively than baselines.
  • ROC-AUC for prediction of held-out responses is θimage\theta_{\text{image}}7 under heavy contamination, matching classical IRT.
  • Genuinely cross-modal items receive highest θimage\theta_{\text{image}}8 and are prioritized for instrumenting adaptive or compact benchmark variants.

6. Benchmark Construction and Model Diagnostics

M³IRT enables the construction of lean, reliable benchmark subsets by prioritizing items with high estimated cross-modal difficulty, thus substantially reducing evaluation costs (up to 90%). The explicit separation of modality-dependent abilities and difficulties allows fine-grained diagnostics:

  • A low θimage\theta_{\text{image}}9 value reveals models reliant on unimodal “shortcuts,” whereas high values indicate robust cross-modal integration capability.
  • Scores θcross\theta_{\text{cross}}0 can guide pretraining strategies and targeted fine-tuning on models underperforming in a specific modality.

The θcross\theta_{\text{cross}}1 parameter serves as an objective filter to improve benchmark quality by removing questions solvable by only one modality, thus increasing the interpretive value of model comparisons.

M³IRT's structure is inherently extensible to an arbitrary number of modalities (e.g., adding audio, video), and can accommodate generative response settings via alternative link functions (e.g., normal-ogive). Hierarchical Bayesian estimation is feasible for item banks, supporting item pooling across tasks or time. A related multidimensional latent class IRT model (Bacci et al., 2014) explicitly addresses non-ignorable missingness by distinguishing between abilities and a latent “propensity to respond,” estimated with the Expectation-Maximization algorithm; this approach demonstrates robust recovery of latent structure and highlights the importance of modeling non-ignorable data patterns, a plausible implication for M³IRT applications in practical, incomplete AI benchmark data.

M³IRT offers a principled psychometric foundation and practical methodology to refine and dynamically evaluate complex multimodal reasoning benchmarks in emerging AI systems (Uebayashi et al., 3 Mar 2026, Bacci et al., 2014).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multimodal Multidimensional Item Response Theory (M3IRT).