Two One-Sided Test (TOST) Procedure
- TOST is a statistical framework that establishes equivalence by testing whether observed effects lie entirely within pre-specified margins using two one-sided tests.
- It employs an intersection-union methodology to control type I error while being pivotal in bioequivalence studies, method comparisons, and adaptive trial designs.
- Extensions of TOST to multivariate, functional, and Bayesian designs allow for robust sample size, power adjustments, and regulatory compliance across diverse research applications.
The Two One-Sided Test (TOST) procedure is a statistical framework developed to establish practical, rather than strictly statistical, equivalence between two parameters or processes. Unlike traditional hypothesis testing, which seeks to detect differences, TOST formalizes the demonstration that an effect or difference is confined within pre-specified equivalence margins. TOST is foundational in regulatory bioequivalence studies, method comparison, multivariate and functional data settings, adaptive designs, and equivalence-focused replication studies. Its methodology is codified through intersection-union logic and underpins several extensions and modifications for specific application domains.
1. Principles and Formulation of the TOST Procedure
The core structure of TOST reverses the typical null/alternative paradigm. The null hypothesis asserts non-equivalence, typically , while the alternative claims equivalence, . The procedure splits the composite null into two one-sided tests:
- vs.
- vs.
TOST declares equivalence only if both one-sided tests reject their nulls at level (i.e., the observed effect is neither too low nor too high). In practice, this is usually operationalized via construction of a confidence interval for the parameter and confirming the interval lies entirely within the equivalence bounds. The procedure is fundamentally an intersection-union test (IUT): the overall rejection region is the intersection of individual one-sided rejections, ensuring strong control of the type I error at (Li et al., 2023, Li et al., 2023, Li et al., 2023).
For bioequivalence with log-transformed parameters (e.g., ), margins are typically symmetric, such as . The method's duality with confidence intervals is exact under the “equal-tailed” condition, with each tail at significance level (Li et al., 2023).
2. Statistical Properties, Error Control, and Power
TOST is theoretically grounded in IUT, controlling type I error at regardless of the dependency structure among the component tests (Li et al., 2023, Ochieng, 3 Jan 2024, Boulaguiem et al., 25 Nov 2024). In univariate settings, the procedure is known to be slightly conservative, and this conservatism increases sharply for the multivariate (multiple-outcome) extension, especially as dependencies weaken or outcome variances grow (Boulaguiem et al., 25 Nov 2024). This conservatism is manifested in reduced empirical power as the number of marginals grows.
In finite samples, direct application of marginal TOST to each outcome (as in averaged bioequivalence for outcomes) leads to a rapid loss of power and under-rejection under the null. To mitigate this, finite-sample adjusted TOST methods have been developed. Multivariate -TOST, for instance, calibrates the marginal significance levels through an iterative search so that the global type I error is maintained at the nominal level, restoring power to near the theoretical optimum (Boulaguiem et al., 25 Nov 2024). Simulation and theoretical studies confirm that as the effective test level is increased (from to ), empirical power improves significantly while maintaining type I error control.
3. TOST Extensions: Functional and Adaptive Designs
Functional Data
For functional responses, TOST extends from scalar summaries to testing entire difference functions (e.g., ) over a domain . The hypotheses are formulated pointwise, and the global test is only rejected if equivalence is established at every . Operationalization employs nonparametric bootstrap resampling to construct pointwise one-sided confidence intervals, taken over a finely discretized grid. For each , checks are made:
- Is within the equivalence band for location?
- Is the variance ratio within its own band ?
This pointwise rejection leads to strictness (“stringency”): one violation anywhere along leads to failure to declare equivalence. Unique challenges include grid selection, handling autocorrelation, and designing bootstrap schemes for paired or hierarchical data (Fogarty et al., 2014).
Adaptive and Replication Designs
In modern clinical research, adaptive two-stage TOST has been incorporated to permit interim futility/efficacy stopping, sample size reassessment, and familywise error control over multiple endpoints (Østerdal et al., 2022). For confirmatory trials, combination tests aggregate stage-wise p-values for each one-sided hypothesis, and “decision freezing” ensures valid inference when interim boundaries are crossed.
In the context of replication studies, the sceptical TOST adapts reverse-Bayes methodology, using sceptical prior integration to aggregate evidence across original and replication data, increasing project power and enabling flexibility in sample size recalculation—even when the original paper yields non-significant results (Micheloud et al., 2022).
4. Implementation for Power, Sample Size, and Multiple Testing
Sample size calculations in TOST contexts must account for non-normality of test statistics in small samples. Noniterative formulas augment normal-approximation-based sample size estimates with corrections for small-sample variance bias and the t-distribution’s heavier tails (Tang, 2018). For equivalence or bioequivalence trials, these corrections ensure that type I error and actual power more closely achieve target specifications, even with covariate adjustment (ANCOVA or MMRM).
The design and interpretation of power in TOST is nontrivial in bounded discrete spaces or with composite nulls. Recent work has established randomized -value methodologies (e.g., RAND2) to restore uniformity of the -value distribution under the null. In large-scale or high-dimensional multiple testing, adaptive Bonferroni or FDR-based procedures leverage these properties to provide less conservative control of familywise error and more accurate estimation of the proportion of true nulls (Ochieng, 3 Jan 2024, Ochieng, 25 Jul 2025).
5. Bayesian TOST and Alternative Evidence Measures
Bayesian analogues of TOST operate by computing posterior probabilities that the parameter falls within the equivalence region. With suitable prior choices (e.g., noninformative Beta or Normal), the Bayesian posterior “p-value” can be less conservative and more powerful than the frequentist TOST, provided the prior variance is not excessively large. For one-sided hypotheses, the Bayesian evidence measure is the posterior mass in the non-equivalence region, and the overall test is defined analogously by (Ochieng, 25 Jul 2025).
Simulation studies demonstrate that with diffuse priors and wide margins, the Bayesian TOST achieves higher power with type I error under control; as the prior becomes concentrated or margins narrow, Bayesian and frequentist TOSTs converge in properties. In the multiple testing context, the Bayesian version’s power via FDR framework is typically lower, but the disparity diminishes with larger equivalence margins or less informative priors.
6. Regulatory Context, Confidence Interval Duality, and Practical Implications
TOST is embedded in regulatory guidance for bioequivalence, notably via the FDA and EMA, which require a 90% confidence interval (for ) for the geometric mean ratio of PK parameters (typically log-AUC and log-) to fall within . This interval-based equivalence test is algebraically equivalent to the two one-sided tests if and only if the confidence interval is “equal-tailed,” i.e., uncertainty is split evenly across both tails (Li et al., 2023).
Deviation from equal tails or from symmetry in log-margins invalidates this equivalence, necessitating careful construction of test statistics and intervals. In multivariate or high-dimensional settings, practitioners must be vigilant for excessive conservatism in marginal TOST, and the use of finite-sample corrected methods is recommended to restore operating characteristics.
The strictness of TOST in both functional and multivariate domains ensures strong control of false equivalence but comes at the cost of reduced power; newer methods and adaptive designs partially redress this imbalance while adhering to regulatory and statistical rigor.
Summary Table: TOST Applications, Extensions, and Limitations
Context/Extension | Adjusted Methods/Key Features | Limitation Addressed |
---|---|---|
Functional Equivalence | Pointwise TOST; bootstrap CIs over grid | Retains function structure; high stringency |
Multivariate Equivalence | Multivariate -TOST (iterative calibration) | Conservatism, power loss for many outcomes |
Adaptive/Replication | Stagewise combination tests; sceptical TOST | Type I error control, increased power |
Bayesian TOST | Posterior probability-based decision | Sensitivity to prior, power vs. conservatism |
Multiple Testing | RAND2 p-values; adaptive Bonferroni | Discreteness, control of familywise error |
TOST is thus a versatile and rigorously validated framework for equivalence assessment, with theoretical consistency and practical adaptations across scalar, functional, multivariate, and adaptive designs. It forms the backbone of regulatory equivalence testing and continues to evolve with methodological innovations.