Papers
Topics
Authors
Recent
2000 character limit reached

Intraclass Correlation Coefficient (ICC)

Updated 30 November 2025
  • ICC is a statistical measure quantifying the proportion of total variance attributable to between-group differences versus measurement error.
  • It employs variance-components models (e.g., one-way, two-way, and mixed effects) to assess reliability across diverse fields including biomedical imaging and psychometrics.
  • Recent extensions such as rank-based, distance-based, and Bayesian approaches broaden ICC’s application to high-dimensional and non-Euclidean data.

The intraclass correlation coefficient (ICC) is a fundamental statistical index measuring the similarity of units within a group in repeated measurement or clustered data settings. It quantifies the proportion of total variance in an observed variable attributable to differences across groups, clusters, subjects, or experimental units, as opposed to within-unit (measurement) error or other sources of residual variation. ICCs have broad application across multilevel modeling, reliability assessment, biomedical imaging, machine learning, and clinical methodology.

1. Mathematical Formulations and Model Structures

Classic ICC definitions are grounded in variance-components models, usually with the random-effects form

Xij=μ+μi+εij,X_{ij} = \mu + \mu_i + \varepsilon_{ij},

where μiN(0,σμ2)\mu_i \sim N(0,\sigma^2_\mu) models the between-group (subject, cluster) deviation, and εijN(0,σ2)\varepsilon_{ij} \sim N(0,\sigma^2) captures within-group error. The canonical one-way ICC is given by

ICC=σμ2σμ2+σ2.\mathrm{ICC} = \frac{\sigma^2_\mu}{\sigma^2_\mu + \sigma^2}.

Extensions to two-way, mixed/random-effects models (e.g., ICC(2,1), ICC(3,1) in the Shrout–Fleiss taxonomy) include additional random variances for rater/method effects and their interactions. ICCs can also be computed directly from the mean squares of ANOVA or generalized variance-component models, recovering formulas such as

ICC(2,1)=MSRMSEMSR+(k1)MSE+kn(MSCMSE),\mathrm{ICC}(2,1) = \frac{MS_{R} - MS_{E}}{MS_{R} + (k-1)MS_{E} + \frac{k}{n}(MS_{C}-MS_{E})},

where MSRMS_R, MSCMS_C, MSEMS_E are the between-group, between-condition, and residual mean squares, respectively, with kk raters/measurements and nn groups (Bianchini et al., 2020).

In mixed models for count or non-Gaussian data, ICC/VPC definitions generalize in terms of marginal variances and covariances, as in

ICC=Var[E(Yijuj)]Var(Yij),\mathrm{ICC} = \frac{\operatorname{Var}[\mathbb{E}(Y_{ij}|u_j)]}{\operatorname{Var}(Y_{ij})},

with closed-form solutions given for Poisson, negative binomial, and hierarchical count models (Leckie et al., 2019).

Nonparametric and modern extensions encompass:

  • Rank ICC: The Pearson correlation of mid-ranks (ridits) of intra-cluster pairs, robust to outliers and scale (Tu et al., 2023).
  • Distance-based ICC (dbICC): Replaces variance with general squared distances, yielding

dbICC=1MSDwMSDb,\mathrm{dbICC} = 1 - \frac{\mathrm{MSD}_w}{\mathrm{MSD}_b},

applicable to non-Euclidean, high-dimensional, or graph-type data (Xu et al., 2019).

  • Bayesian Nonparametric (BNP) ICC: Under Dirichlet Process random-effects mixtures, defines a family of ICCs as the proportion of marginal variance explained by a mixture of latent classes, capturing latent heterogeneity (Mignemi et al., 28 Oct 2024, Mignemi et al., 2023).

2. Statistical Estimation and Implementation

Calculation of the ICC proceeds via estimation of model variance components, either through ANOVA mean squares, maximum likelihood, restricted maximum likelihood, or Bayesian/MCMC sampling for more complex or hierarchical structures.

  • Classical settings (biomedical, psychometric, imaging): ANOVA- or mixed-model-based ICCs are computed per-feature or per-voxel, and often used as explicit reliability selection criteria for feature inclusion (thresholds: ICC >0.90, "excellent" repeatability; 0.75–0.90, "good"; 0.50–0.75, "moderate"; <0.50, "poor") (Bianchini et al., 2020, Friedman et al., 2016, Zhou et al., 2022).
  • Generalized estimating equations (GEE2) for correlated outcomes use second-order moment equations to recover ICC under missing data and informative missingness, with extensions for IPW and doubly robust correction (Chen et al., 2018).
  • Bootstrap and nonparametric inference: Point estimators of ICC or dbICC are paired with cluster bootstrap schemes, bias correction for resampling artifacts (especially in high-dimensional settings), and asymptotic delta-method variance estimates for confidence intervals (Xu et al., 2019, Tu et al., 2023).
  • Bayesian implementation in BNP models uses blocked Gibbs sampling over Dirichlet processes (stick-breaking construction), with posterior moments yielding ICC distributions (mean, credible interval) (Mignemi et al., 28 Oct 2024, Mignemi et al., 2023).

Practical implementation includes joint modeling of means and pairwise correlations, careful handling of non-independence in clustered or repeated-measurement data, and simulation-based or plug-in criterion for model validity and convergence.

3. Applications Across Scientific Domains

ICCs play a central role in diverse formal and applied research contexts:

  • Medical imaging and radiomics: ICCs quantify test–retest repeatability for features extracted from images (MRI, CT, radiomic data) across scanners, protocols, or segmentation methods. High ICC is typically a prerequisite for feature inclusion in downstream predictive modeling for disease or structural quantification (Zhou et al., 2022, Bianchini et al., 2020, Yu et al., 2022).
  • Biometric and psychometric assessment: ICC is the gold standard for "temporal persistence" or test–retest reliability of candidate biometric markers (gait, oculomotor, behavioral features). Only highly reliable (high ICC) features are retained for robust identity recognition (Friedman et al., 2016).
  • Clustered randomized trials (CRTs): ICC directly informs design effect, power, and sample size calculations. Estimation of the ICC is vital for appropriately adjusting for intra-cluster correlation in treatment effect detection and heterogeneity analyses. A class of "ICC-ignorable" CRTs is identified where, under perfect within-cluster stratification, sample size for interaction effects does not depend on ICC value (Yang et al., 22 Apr 2025).
  • Speech and representation learning: ICC serves as a regularization and objective to maximize repeatability in deep embedding spaces, directly optimizing for discriminative, low intra-class variance representations (Zhang et al., 2023).
  • Multilevel modeling for counts and binomial data: Closed-form expressions for ICC/VPC guide interpretation of grouping effects and design decisions in educational, epidemiological, or other nested data structures, accounting for overdispersion and random coefficients (Leckie et al., 2019).
  • Graphs and network data: Generalizations such as the graphical ICC (GICC) address measurement reliability in multivariate binary graphical settings (e.g., brain connectomics) via latent multivariate probit mixed models (Yue et al., 2013).

4. Methodological Considerations and Limitations

The rigorous interpretability of the ICC depends on model choices and data structure:

  • Model specification: One-way vs two-way (random or mixed) models, absolute agreement vs consistency, and fixed vs random effects each yield different ICC flavors (Shrout–Fleiss taxonomy). Invariant reporting of model type and calculation details is necessary for replication and cross-paper comparison (Zhou et al., 2022, Friedman et al., 2016).
  • Scale and distribution: Parametric ICCs can be sensitive to skewness, heavy tails, or non-normality; nonparametric adaptations (rank ICC, discriminability) are robust alternatives (Tu et al., 2023, Wang et al., 2020).
  • Interpretive context: For moderate- or high-stakes applications (clinical tools, device approval), thresholds for “acceptably reliable” ICC are problem-specific and must be contextually justified.
  • Bias and robustness: Small sample sizes, cluster-size variance, extremes, or misunderstandings (e.g., treating ordered categories as continuous) can bias ICCs downward; explicit bias correction or resampling is advocated (Xu et al., 2019).
  • Heterogeneity: Uniform random-effects models can obscure latent multimodality (e.g., "strict" vs. "lenient" raters); BNP and mixture-model ICC variants are specifically designed to expose such structure and to support alternative summary indices of agreement and polarization (Mignemi et al., 2023, Mignemi et al., 28 Oct 2024).

5. Contemporary Extensions and Nonstandard ICCs

Recent work has generalized the ICC concept to accommodate high-dimensional, non-Euclidean, and complex data:

  • Distance-based ICC (dbICC) provides a unified framework for classical, functional, or matrix-valued data, requiring only a distance metric—not a “mean” or linear structure (Xu et al., 2019).
  • Rank-based ICC enables calculation for ordered categorical or skewed outcomes, preserving interpretation as a measure of within-cluster concordance (Tu et al., 2023).
  • Graphical ICC (GICC) extends the reproducibility assessment to binary or categorical multivariate data based on a latent probit mixed model (Yue et al., 2013).
  • Bayesian nonparametric ICC leverages Dirichlet processes to allow clustering in latent rater or subject distributions, providing posterior distributions for ICC—including distributions over cluster-specific or average-rater ICCs—and supporting the direct quantification of rating heterogeneity or polarization (Mignemi et al., 28 Oct 2024, Mignemi et al., 2023).
  • Machine learning objectives: ICC has been operationalized as a fully differentiable loss function in representation learning (e.g., as an explicit regularizer in neural network training) to drive low within-class variance and high between-class repeatability (Zhang et al., 2023).

6. Interpretation, Thresholds, and Empirical Utility

Interpretive standards for ICC are now codified in many domains, with widely cited guidelines such as: ICC < 0.5 ("poor"), 0.5–0.75 ("moderate"), 0.75–0.9 ("good"), ≥0.9 ("excellent") reliability. High ICC values are not sufficient alone: the model type, variance sources, and data context must be carefully considered. Empirical analyses in imaging, biometrics, and clinical trials have demonstrated that aggressive ICC-based feature selection improves downstream discriminative or predictive accuracy, reduces overfitting, and increases cross-site or cross-method reproducibility (Zhou et al., 2022, Friedman et al., 2016, Zhang et al., 2023).

Contemporary simulation studies confirm that under model mis-specification (non-Gaussianity, batch effects) nonparametric and rank-based ICCs or discriminability indices provide more stable and interpretable measures of repeatability and clustering compared to classical ANOVA-based ICCs (Tu et al., 2023, Wang et al., 2020). In genuinely clustered data where the number or type of clusters is unknown or complex (e.g., in rater studies with potential for multiple subpopulations), mixture- and BNP-based ICCs are advocated (Mignemi et al., 2023, Mignemi et al., 28 Oct 2024).

7. Practical Recommendations and Best Practices

For studies requiring reliability or clustering assessment via ICC:

  • Explicitly specify and justify the ICC type and model assumptions given the paper design (one-way vs. two-way, absolute agreement vs. consistency, parametric vs. nonparametric) (Bianchini et al., 2020, Tu et al., 2023).
  • For non-Gaussian/ordinal/complex data, use rank-based or distance-based ICC estimators, and compute bootstrap confidence intervals (Tu et al., 2023, Xu et al., 2019).
  • In multilevel/clustered trial settings, carefully estimate and report the ICC, conduct sensitivity analyses for design-effect inflation, and in HTE-detection settings consider ICC-ignorable designs where appropriate (Yang et al., 22 Apr 2025, Chen et al., 2018).
  • For high-dimensional features (imaging, biometrics, radiomics), use ICC-based thresholds as a primary selection step, and document all preprocessing, model, and inferential choices (Bianchini et al., 2020, Friedman et al., 2016).
  • For raters, experts, or crowdsourced annotation, consider BNP or mixture models for ICC estimation to quantify latent rater heterogeneity or polarization (Mignemi et al., 28 Oct 2024, Mignemi et al., 2023).
  • In machine learning, consider explicit ICC regularization or loss-based optimization for embedding models requiring low intra-class variability and high class repeatability (Zhang et al., 2023).

An emerging consensus is that ICC—suitably defined, robustly estimated, and judiciously interpreted—remains a cornerstone of reliability analysis, now extensible across new data modalities, models, and computational paradigms.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (14)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Intraclass Correlation Coefficient (ICC).