Efficient estimation of relative risk, odds ratio and their logarithms for rare events

Published 5 Apr 2026 in stat.ME and math.ST | (2604.04278v1)

Abstract: Sequential estimators are proposed for the relative risk, odds ratio, log relative risk or log odds ratio of a dichotomous attribute in two populations. The estimators take the same number of observations from each population, and guarantee that the relative mean-square error for the relative risk or odds ratio, or the mean-square error for their logarithmic versions, is less than a given target. The efficiency of the estimators, defined in terms of the Cramér-Rao bound, is high when the considered attribute is rare or moderately rare.

Abstract PDF Upgrade to Chat

Authors (1)

Luis Mendo

Summary

The paper introduces a sequential estimation framework that guarantees unbiased estimators for RR, OR, LRR, and LOR in rare event regimes.
The paper develops Bernoulli factory-style algorithms with inverse binomial sampling to precisely control the mean-square error in parameter estimation.
The paper demonstrates that its methodology achieves efficiency nearing the Cramér–Rao lower bound, ensuring minimal sample waste even under strict accuracy demands.

Efficient Estimation of Relative Risk, Odds Ratio, and Their Logarithms for Rare Events

Introduction

This paper addresses unbiased, high-efficiency estimation of the relative risk (RR), odds ratio (OR), log-relative risk (LRR), and log-odds ratio (LOR) for dichotomous attributes between two populations, focusing on the rare event regime. The authors propose a sequential estimation methodology wherein samples are paired across populations, such that the mean-square-error (MSE) for logarithmic parameters, or the relative mean-square-error (relative MSE) for non-logarithmic parameters, is guaranteed to stay below a given target. Special attention is given to scenarios with small Bernoulli probabilities $p_1, p_2$ , common in biomedical and social science applications.

Sequential Estimation Approach

The methodology constructs a composite Bernoulli variable from paired samples in the two populations, such that estimation of the parameter of interest (RR, OR, etc.) is reduced to estimation of the odds or log-odds on a suitably transformed Bernoulli sequence. Specifically:

For RR and LRR, the transformation defines $p = p_1/(p_1 + p_2)$ .
For OR and LOR, $p = p_1(1-p_2)/[p_1(1-p_2)+p_2(1-p_1)]$ .

To generate samples from this composite Bernoulli, the paper introduces two Bernoulli factory-style algorithms that consume paired observations until a success in one of the populations terminates the process. This forms the inner loop of the estimation framework.

On top of this, the outer estimation loop utilizes inverse binomial sampling (IBS) applied to the composite Bernoulli sequence, running twice: once for a set target number of successes, once for failures. The estimator forms the sample size ratio precisely, ensuring symmetry in paired sampling.

The estimators for the various parameters are carefully constructed to be unbiased and, critically, are parameter-robust: their MSE or relative MSE (for the non-log parameters) is upper-bounded by a function of the IBS target and does not depend on $(p_1, p_2)$ .

Statistical Guarantees and Efficiency Analysis

The primary statistical guarantee is that, for all $p_1, p_2 \in (0,1)$ , the estimator's (relative) MSE for RR and OR and the MSE for LRR and LOR are strictly below a user-chosen threshold, enforced via the selection of the IBS target parameter. This strict error control is realized regardless of the prevalence parameters.

A crucial metric is estimation efficiency, defined as the ratio of the Cramér–Rao lower bound (CRLB) for a fixed-sample estimator to the achieved variance, normalized by mean sample usage. The paper establishes that, for small $p_1, p_2$ (rare events), the efficiency approaches 1 for all four parameters. This efficiency is quantified as better than $0.9$ (i.e., sample waste is less than 10%) even for moderate accuracy demands and remains high as one enforces stricter estimation error or as event probabilities decrease. These results are supported by both analytic lower bounds and extensive Monte Carlo experiments.

Contrastingly, for non-rare events ( $p_1, p_2$ larger), efficiency degrades, but the regime of primary interest in epidemiology and related fields is precisely when events are rare.

The methodology is further compared to existing sequential and group estimators in the literature (notably [Mendo, Stat Papers 2026]), evidencing efficiency gains—especially when sample pairing is enforced and ratios are extreme or probabilities are small.

Practical and Theoretical Implications

Practical implications include:

The estimators are deployable in clinical or survey contexts with strict sample-size constraints and accuracy guarantees.
The sequential IBS framework is amenable to real-time or online estimation where sampling cost and error control are critical.
The method automatically adapts to the underlying event rate without requiring prior knowledge or parameter tuning.

Theoretical implications:

The unbiasedness and guaranteed-risk tradeoff are achieved for all $p_1, p_2$ , not only asymptotically.
The sequential pairing framework and the Bernoulli factory algorithms provide a novel blueprint for constructing high-efficiency, strictly controlled estimators in other multi-population settings.
Robustness to the entire parameter range ensures suitability for meta-analytic or sensitivity applications.

Future Directions

Potential avenues for extension include:

Generalization to more than two populations, or to cases with covariate adjustment, leveraging the presented pairing and transformation techniques.
Adaptation to dependent Bernoulli sequences (e.g., longitudinal or clustered data).
Further tightening of the analytic bounds for efficiency in the intermediate-probability regime.
Investigation of the method's properties under misspecified models and real-world signal contamination.

Conclusion

The paper presents a rigorous, efficient, and unbiased sequential estimation strategy for RR, OR, LRR, and LOR, with strong theoretical guarantees and high practical efficiency in the rare event regime. The framework’s robustness, scalability, and strict error control make it a superior alternative for association parameter estimation when both accuracy and efficiency are essential, particularly in medical, epidemiological, and large-scale survey contexts.

Markdown Report Issue