Bayesian TOST: Equivalence Testing
- Bayesian TOST is a framework for testing equivalence by assessing whether a parameter lies within a pre-specified region of practical equivalence (ROPE) rather than contrasting it with a null value.
- It employs methodologies such as constrained priors, Bayes factors, and direct posterior probability evaluation to compare evidence between equivalence and nonequivalence models across parametric, nonparametric, and multivariate contexts.
- The approach offers rigorous error control and enhanced interpretability, making it applicable in clinical bioequivalence, A/B testing, and functional data analysis while being sensitive to prior specifications.
Bayesian TOST (Two One-Sided Tests) is the family of Bayesian analogues to the classical frequentist procedure for equivalence testing, in which the goal is to establish that a parameter (e.g., effect size, mean difference) resides within a pre-specified region of practical equivalence (ROPE), rather than simply demonstrating it is different from a null value. Bayesian TOST methods formalize probabilistic evidence for equivalence by adapting priors, posteriors, and Bayes factors to encode and test interval hypotheses, yielding a more transparent and interpretable quantification of equivalence, which can be flexibly adapted to parametric, nonparametric, functional, and multivariate contexts.
1. Principles and Hypotheses in Bayesian TOST
In classical TOST, equivalence is declared if both one-sided null hypotheses at the equivalence margin are rejected. In Bayesian TOST, the hypotheses are reformulated as
where is the prespecified margin. Bayesian approaches assess support for by
- explicitly constraining priors or posteriors to the equivalence region,
- computing Bayes factors between equivalence and non-equivalence models,
- or directly evaluating the posterior probability that (or an infinite-dimensional parameter) lies in the ROPE.
This contrasts with classical TOST's reliance on dual one-sided p-values.
In practical terms, one either modifies the modeling structure so prior mass is assigned only to the equivalence or non-equivalence regions or, equivalently, calculates the posterior probability
A threshold (e.g., 0.95 or 0.99) is used to declare equivalence. Alternatively, one may compute the Bayes factor contrasting to or the marginal likelihoods for the two hypotheses (0906.4032).
2. Bayesian TOST Methodologies: Parametric and Nonparametric
Parametric Bayesian TOST
The Bayesian t-test is commonly generalized for TOST by ensuring that the likelihood (and/or prior) only supports the equivalence region. For models in the exponential family with conjugate priors, as in
and conjugate prior
the marginal likelihoods are integrated only over . The Bayes factor for equivalence is then the ratio of these marginal likelihoods (0906.4032):
and equivalence is inferred if this exceed a certain threshold (e.g., BF > 3 or BF > 10).
Alternatively, the posterior probability of equivalence can be computed directly:
Nonparametric Bayesian TOST
Nonparametric approaches (notably Dirichlet Process Mixtures, DPM) generalize the prior over the data-generating distribution. Here, the modeling is via density estimation over the unknown data-generating distribution , leading to posterior draws over :
with outcomes as above, but for arbitrary measurable functionals (e.g., mean difference, quantile difference, etc.). The posterior over is then directly estimated (e.g., via posterior sampling) and probabilistic evidence for is determined (0906.4032).
3. Extensions to Functional and Multivariate Data
Functional Data
For equivalence between functions, Bayesian TOST requires that an entire function-valued difference (e.g., ) lies within a prespecified band for all . Gaussian Process priors are placed on , sometimes structured as mixtures to ensure minimal prior support for equivalence, so that only strong empirical evidence can shift the posterior into the equivalence region (Fogarty et al., 2014).
The posterior probability is then
and equivalence is established if this exceeds a chosen (commonly 0.95).
Multivariate Extensions
In multivariate settings, the TOST is applied componentwise but the joint test is typically overly conservative due to the intersection-union principle. Finite-sample adjustments (such as the multivariate -TOST) recalibrate the nominal to achieve exact global size, increasing power while maintaining type I error (Boulaguiem et al., 25 Nov 2024). In Bayesian multivariate settings, analogues can be constructed by assessing the joint posterior probability that all components are within their respective equivalence bounds.
4. Calibration, Operating Characteristics, and Comparison to Frequentist TOST
Calibration of Bayesian TOST procedures is critical to ensure error rates are controlled analogously to their frequentist counterparts. Extensive simulation studies show that if procedures (e.g., Bayesian HDI-ROPE, Bayes factor interval null, classical TOST) are calibrated to have the same maximal type I error rate at the equivalence margin, their power (type II error) becomes essentially indistinguishable and decision boundaries are almost identical (Campbell et al., 2021). This operational equivalence highlights that methodological and philosophical considerations, rather than empirical performance, drive the choice of inferential paradigm.
A widely used Bayesian TOST operating characteristic is the posterior mass percentage (PMP) inside the ROPE, which forms a natural basis for decision rules analogous to frequentist control of type I/II errors.
Bayesian TOST can be more or less conservative than its frequentist counterpart, depending on the informativeness of the prior, particularly for small-sample or low-power scenarios (Ochieng, 25 Jul 2025).
5. Practical Implementation and Computational Strategies
- Marginal Likelihoods: For exponential family or t-models, marginal likelihoods can often be evaluated in closed or semi-closed form for both constrained (equivalence) and complementary (non-equivalence) models.
- Posterior Sampling: In nonparametric and more complex parametric models, Markov chain Monte Carlo or other posterior sampling methods are used to estimate the posterior over effect-size summaries.
- Threshold Selection: Posterior probabilities or Bayes factors are compared to prespecified thresholds, the selection of which should be motivated by desired error rates or decision-theoretic principles.
- Priors: Priors must be elicited or constructed to assign minimal mass to the equivalence region unless strong prior information dictates otherwise (to avoid anti-conservatism); explicit mixture formulations or non-local priors can be used for sensitivity.
6. Advantages and Limitations
Advantages:
- Quantifies evidence or probability for practical equivalence, providing more interpretability than binary p-value thresholds.
- Admits flexible incorporation of prior information, including domain expertise or historical data, which can be especially critical in small-sample or high-variance contexts.
- Nonparametric models accommodate complex, non-Gaussian, or multimodal data generating processes.
- Extensions to function-valued parameters, multivariate endpoints, and grouped hypotheses are natural and rigorous.
Limitations:
- Requires specification of equivalence margins, which is inherently subjective and context-dependent.
- Computational complexity can be substantial, especially for fully nonparametric or function-valued models.
- Sensitive to prior choices, particularly in finite samples or in models with weak identification.
7. Applications and Impact Across Domains
Bayesian TOST methods have been applied in clinical bioequivalence studies, A/B testing in business and healthcare, assessment of measurement systems, method comparison in functional data, and multivariate drug evaluation [(0906.4032); (Fogarty et al., 2014); (Gronau et al., 2019); (Boulaguiem et al., 25 Nov 2024); (Ochieng, 25 Jul 2025)]. They have enabled more rigorous quantification of evidence for similarity, informed regulatory decision-making, and transparent integration of prior knowledge. Simulation studies and real-data applications consistently demonstrate their effective error control and enhanced interpretability relative to both uncalibrated frequentist approaches and non-interval Bayesian tests.
Bayesian TOST thus constitutes a principled framework for equivalence testing, providing theoretically sound and practically robust tools for establishing similarity under uncertainty in a broad range of settings.