BloodHound Equivalency Test

Updated 10 January 2026

BloodHound Equivalency Test is a statistical method that evaluates whether two measurement methods are equivalent using a pre-specified RMS margin.
It employs a generalized pivotal quantity approach to jointly assess mean and variance components, enhancing accuracy in small to moderate sample studies.
Monte Carlo simulation is used to derive hypothesis tests and confidence intervals, making it a robust tool for diagnostic device comparison.

The BloodHound Equivalency Test refers to a rigorous statistical methodology for assessing whether two measurement methods are equivalent—up to a pre-specified performance margin—based on paired repeated measures data. Developed in the context of diagnostic device comparison studies, such as oximetry, this test is grounded in a generalized pivotal quantity approach that jointly evaluates both mean and variance components via a root mean square (RMS) criterion. The methodology addresses limitations of large-sample normal approximations, especially in small or moderate sample size settings, and provides procedures for hypothesis testing and confidence interval estimation for practical equivalence (Bai et al., 2019).

1. Root Mean Square Criterion and Model Framework

In diagnostic device studies, the equivalency of two methods is often evaluated by controlling the absolute difference in measurements, summarized as paired differences $Y_{ij} = \text{(Method A)} - \text{(Method B)}$ for subject $i$ and replicate $j$ . The statistical model underlying these differences is a one-factor random-effects ANOVA:

$Y_{ij} = \mu + u_i + \epsilon_{ij}, \quad u_i \sim N(0, \sigma_b^2),\;\epsilon_{ij}\sim N(0,\sigma_w^2)$

Here, $\mu$ denotes the mean difference, $\sigma_b^2$ the between-subject variance, and $\sigma_w^2$ the within-subject variance. The primary performance metric is the root mean–square (RMS) difference between methods:

$\rho = \sqrt{E[Y_{ij}^2]} = \sqrt{\mu^2 + \sigma_b^2 + \sigma_w^2}$

This composite parameter $\rho$ integrates both systematic bias and total variability, matching regulatory requirements (e.g., FDA) that specify equivalence in terms of a pre-specified upper bound $\Delta_0$ on $\rho$ .

2. Hypothesis Formulation for Equivalence

Equivalency testing targets the composite RMS metric, using hypotheses of the form:

$H_0: \rho \geq \Delta_0 \quad \text{vs.} \quad H_a: \rho < \Delta_0$

The threshold $\Delta_0$ must be specified a priori based on clinical or regulatory criteria. For pulse oximetry, $\Delta_0=3\%$ is a typical margin based on FDA guidance.

3. Generalized Pivotal Quantity Construction

To formulate a statistically rigorous test and confidence interval, the BloodHound approach leverages generalized pivotal quantities (GPQs):

Summarize the data with:
- Per-subject means $\bar{y}_i = \frac{1}{m_i}\sum_{j=1}^{m_i} y_{ij}$
- Sum of squared errors $sse = \sum_{i=1}^n (m_i-1) s_i^2$ , $s_i^2 = \frac{1}{m_i-1}\sum_{j=1}^{m_i}(y_{ij} - \bar{y}_i)^2$
Let $N = \sum_i m_i$ .
Use Cochran’s theorem to relate sums of squares to scaled chi-square distributions:

$SSE \sim \sigma_w^2 \chi^2_{N-n},\qquad SSR \sim \sigma_b^2 \chi^2_{n-1}$

Define GPQs for each component:
- $Q_w = sse / (SSE / \sigma_w^2)$ , with $Q_w \sim \sigma_w^2$
- $Q_b = h(\bar{y}, Q_w, SSR)$ , an explicit function solving for $\sigma_b^2$
- $Q_\mu = \tilde{y} - Z/\sqrt{\sum W_i}$ , $Z \sim N(0,1)$ , where $W_i = 1/(\sigma_b^2+\sigma_w^2/m_i)$ and $\tilde{Y}$ is the inverse-variance weighted mean.

The generalized pivotal quantity for $\rho$ is formulated as:

$Q = Q_\mu + Q_b + Q_w$

Testing $Q \geq \Delta_0$ is algebraically equivalent to testing $\sqrt{Q_\mu^2 + Q_b^2 + Q_w^2} \geq \Delta_0$ , but practical implementation proceeds with the sum $Q$ .

4. Algorithmic Procedure via Monte Carlo Simulation

The practical implementation involves Monte Carlo sampling:

Set number of simulations $B$ (e.g., $B = 10^4$ ).
For $k = 1, \ldots, B$ $k = 1, \dots, B$ :
- Simulate $u_k \sim \chi^2_{N-n}$ and $v_k \sim \chi^2_{n-1}$ .
- Compute $Q_{w,k} = sse / u_k$ , $Q_{b,k} = h(\bar{y}, Q_{w,k}, v_k)$ .
- Sample $Z_k \sim N(0,1)$ to obtain $Q_{\mu,k} = \tilde{y} - Z_k/\sqrt{\sum W_i}$ .
- Form $Q_k = Q_{\mu,k} + Q_{b,k} + Q_{w,k}$ .
Compute the generalized $p$ -value:

$\hat{p} = \frac{1}{B} \sum_{k=1}^B \mathbf{1}\{ Q_k \geq \Delta_0 \}$

The two-sided $100(1-\alpha)\%$ confidence interval for $\rho$ is

$\left[ Q_{(\lfloor B \alpha/2 \rfloor)},\ Q_{(\lceil B(1-\alpha/2)\rceil)} \right]$

Sorting $Q_1, \ldots, Q_B$ .

The analytic integration of $Z$ (Section 2.3) in place of repeated normal sampling can further enhance numerical accuracy.

5. Performance Margin Selection and Sensitivity Considerations

Selecting the equivalency threshold $\Delta_0$ requires consultation with clinical guidelines, device specifications, or subject-matter experts. For pulse oximetry, the FDA frequently uses $\Delta_0 = 3\%$ in saturation units. Sensitivity analyses over a plausible range (e.g., $2$– $5\%$ ) for $\Delta_0$ are recommended to contextualize conclusions, especially when margins are based on pragmatic or evolving standards.

6. Performance Characteristics and Method Comparison

Extensive simulation studies reveal that the generalized pivotal test (GT) maintains well-controlled type I error near nominal levels across balanced and unbalanced study designs, outperforming large-sample normal approximations. The score-based $Z$ -test is conservative, while the Wald-style $Z$ -test is anti-conservative and not recommended. The GT also provides substantially higher power, particularly in small sample or stringent alpha scenarios; for example, for $n=16$ , $m_i \in [5,20]$ , and $\alpha=0.05$ , GT achieves power of 82.1% compared to 72.1% for the $Z$ -score test. In more stringent settings ( $\alpha=0.01$ ), the difference is even more pronounced (Bai et al., 2019).

7. Software Implementation and Practical Guidelines

The BloodHound-style equivalency test is implemented in the R package RAMgt, available on GitHub and CRAN. The test requires only summary statistics—replicate counts, within-subject means, and sum of squared errors—obviating the need for full linear mixed model fits when subject-level summaries are available. Recommended study sizes are in the range $10 \leq n \leq 30$ with moderate replicates per subject. Monte Carlo sample sizes ( $B$ ) can be scaled for desired precision, and batch reuse of random draws is enabled for multiple thresholds or significance levels. Pre-study sample size calculation, grounded in prior estimates of $\sigma_w$ and $\sigma_b$ , is advised to ensure adequate power.

Step	Input	Output
Data summary	$\bar{y},\ sse$	$(Q_w,\ Q_\mu,\ Q_b)$
Monte Carlo algorithm	$B$	$p$ -value, confidence interval
R package use	`ng, mus, sse`	`p.value,\ ci.lower,\ ci.upper`

8. Practical Considerations and Study Design Recommendations

The BloodHound equivalency test is particularly effective for studies with small to medium sample sizes where large-sample approximations fail. When complete subject-level data are unavailable, summary statistics suffice for inference. Computation is efficient in R for standard study sizes and simulation batch sizes. Pre-study power and sample size analyses are essential to calibrate operating characteristics to regulatory or clinical demands. The method's type I error control and statistical power make it the preferred approach for paired repeated measures equivalency testing in diagnostic device evaluation scenarios (Bai et al., 2019).

PDF Markdown Chat (Pro)

References (1)

Efficient and powerful equivalency test on combined mean and variance with application to diagnostic device comparison studies (2019)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to BloodHound Equivalency Test.