Randomized Email Audit Studies

Updated 14 September 2025

Randomized email audit studies are controlled field experiments that employ A/B tests, factorial designs, and adaptive multi-armed bandit methods to infer causal effects.
They provide robust internal validity by using random assignment and model-free inference to measure outcomes like discrimination, phishing susceptibility, and engagement.
Applications span assessing fairness in AI, optimizing behavioral nudges, and guiding policy interventions while ensuring adequate power and balanced design.

Randomized email audit studies are controlled field experiments in which email messages—systematically varied along key dimensions—are sent to recipients (individuals or organizations) to measure behavioral outcomes or infer causal effects. These studies provide empirical evidence about human decision-making, discrimination, engagement, and intervention effectiveness with robust internal validity due to random assignment. By leveraging factorial, A/B, and adaptive designs, randomized email audit studies have become essential tools for both social science and computational research into discrimination, fairness interventions, phishing susceptibility, and behavioral nudging.

1. Core Experimental Designs in Email Audit Studies

Randomized email audit studies utilize a set of experimental methodologies to maximize causal identification and statistical power:

A/B Experiments:

Simple randomization divides subjects into treatment and control groups, allowing comparison of binary outcomes such as reply or engagement rates. For instance, a randomized A/B comparison was applied in an undergraduate homework setting by randomly shuffling students into email reminder (treatment) and no-reminder (control) groups over several weeks (Zavaleta-Bernuy et al., 2022).

Factorial ( $2^K$ ) Designs:

Audit experiments often deploy multifactor randomization, combining signals—such as race, gender, and income—across $K$ dimensions at two levels each (e.g., Black/White, Male/Female, Low/High income), resulting in $2^K$ treatment groups (Pashley et al., 13 Mar 2025). Assignment is handled via completely randomized allocation, sometimes with equal group sizes.

Adaptive Multi-Armed Bandit (MAB) Designs:

Dynamic experiments use algorithms such as Thompson Sampling (TS) or hybrid methods (TS†) to adaptively allocate subjects to different email conditions (“arms”) based on accumulating outcome data (e.g., email open rates). These policies balance exploration (gathering data across all arms) and exploitation (assigning more subjects to promising arms) (Yanez et al., 2022).

Table 1: Key Design Features

Design Type	Randomization Method	Dimensionality
A/B	Simple shuffle	1 (Treatment/Control)
Factorial ( $2^K$ )	Orthogonal random assignment	$K$ factors ( $2^K$ groups)
Multi-Armed Bandit	TS/UR/Hybrid adaptive	$M$ arms (e.g., subject lines)

2. Measurement and Statistical Analysis

Outcomes in email audit studies span binary responses (clicked/replied), behavioral metrics (timing or engagement), and survey/interview data. Statistical analysis leverages randomization-driven inference, power calculations, and both linear and non-linear estimands.

Randomization-Based Neymanian Inference:

Factorial effects are expressed as contrasts among observed group proportions ( $p_j$ ) and estimated as

$\hat\tau = \frac{1}{2^{K-1}} L^\top p$

where $L$ is an orthogonal contrast matrix (Pashley et al., 13 Mar 2025). Variance estimation is model-free, relying on assignment-induced sampling variability.

Panel Regression and Distribution Tests:

Panel data models quantify within-subject and between-period variation, e.g.,

$y_{it} = \alpha_i + \delta_t + \beta T_{it} + \epsilon_{it}$

where $T_{it}$ is the binary indicator for treatment exposure (Zavaleta-Bernuy et al., 2022). Distributional differences are tested with two-sample z-tests and Kolmogorov–Smirnov statistics.

Multi-Armed Bandit Update Mechanisms:

TS updates arm allocations via Beta priors, while TS† combines evidence from uniform random and TS assignments, aiming to mitigate premature convergence and maintain exploration (Yanez et al., 2022).

Nonlinear Estimands:

Logarithmic and logit-scale factorial effects extend reporting to causal risk ratios and odds ratios, useful for interpreting effect sizes in audit studies (Pashley et al., 13 Mar 2025).

3. Behavioral Factors and Decision Processes

Empirical investigations into email response behaviors, especially in phishing contexts, have identified multifaceted cognitive and situational influences:

Sender Legitimacy and Link Perception:

Every participant in the cited scenario-based phishing paper considered sender details and link appearance, but often relied on superficial cues (domain familiarity, "no-reply" addresses, apparent HTTPS) (Jayatilaka et al., 2021).

Emotional and Situational Cues:

Approximately 95% of participants acknowledged that emotions (excitement, anxiety, urgency) shaped their response tendencies. The perceived likelihood of receiving a message and the granularity of information further modulated trust levels (e.g., detailed emails seen as more legitimate) (Jayatilaka et al., 2021).

Validation Behaviors and Habits:

Users differentiated between safe validation (cross-checking outside the email client) and unsafe practices (clicking links for verification). Idiosyncratic habits and past phishing experiences variably reinforced accurate or inaccurate detection strategies.

Statistical Modeling:

Decision outcomes can be represented as:

$\text{Decision} = \beta_0 + \sum_{k=1}^{11} \beta_k \, \text{Factor}_k + \epsilon$

where each $\text{Factor}_k$ corresponds to the influential behavioral dimension above (Jayatilaka et al., 2021).

Disjunction Between Cognition and Action:

The empirical results show that correctly judging an email as phishing does not always guarantee safe behavior, especially when habitual or emotional factors intervene.

4. Applications in Discrimination, Fairness, and Intervention Effectiveness

Randomized email audit studies have broad and significant applications:

Discrimination Detection:

Multifactor audits have been deployed to measure racial, gender, and income bias by randomizing signals in email communication with legal professionals, revealing significant disparities in response rates attributable to race alone (main effect estimate = 0.1875, SE ≈ 0.0917) (Pashley et al., 13 Mar 2025).

Fairness Interventions in AI Systems:

Audit paper data provide a principled benchmark for evaluating the fairness of automated classifiers. Conventional techniques, such as base rate equalization, may yield misleading parity when only convenience samples are used. Label bias—where human decision labels are already discriminative—can create an illusion of fairness (Sariola et al., 2 Jul 2025).

ITE-Based Label Repair:

Counterfactual estimation of individual treatment effect (ITE) with virtual twins ( $\hat\tau(x_i) = \hat{Y}(x_i, A=1) - \hat{Y}(x_i, A=0)$ ) makes it possible to iteratively "repair" labels, resulting in more equitable model decisions (Sariola et al., 2 Jul 2025).

Behavioral Nudges and Educational Engagement:

Randomized A/B audits demonstrated that well-designed email prompts can increase the proportion of students starting homework (by ≈4–9 percentage points), but design variants (sender, subject, content) have differential and sometimes non-significant impacts on precise timing or engagement (Zavaleta-Bernuy et al., 2022).

Adaptive Experimentation:

Multi-armed bandit methods allow scalable, real-time optimization of email interventions (e.g., subject line personalization or urgency cues), but pose challenges: over-exploitation may occur when differences are not statistically meaningful, calling for hybrid approaches and ongoing uniform exploration (Yanez et al., 2022).

5. Sample Size, Power, and Design Optimization

Audit studies must balance statistical validity, resource constraints, and ethical considerations:

Power Analysis and Sample Size Determination:

Randomization-based formulas ensure sufficient sample size to detect pre-specified effect sizes in factorial experiments:

$N \approx (\text{SE}^{\lim}(\hat{\tau}_l))^2 \left[ \frac{z_\alpha - z_\beta}{\tau_l^*} \right]^2$

where pilot data inform group variances and expected group proportions (Pashley et al., 13 Mar 2025). This safeguards inference against underpowered studies.

Optimal Allocation Strategies:

D-optimal designs maximize information per assignment, while A- and E-optimal approaches enable trade-offs among variance reduction and resource efficiency. When sample sizes are small, type I error is controlled via comparison-wise or Bonferroni-adjusted ensemble procedures.

6. Challenges, Limitations, and Future Directions

Several methodological and ethical challenges are endemic to randomized email audit studies:

Premature Exploitation in Adaptive Designs:

When observed differences between arms are not statistically significant, adaptive algorithms such as TS may rapidly allocate most subjects to one arm, impeding reliability in inference (Yanez et al., 2022). Hybrid (TS†) and maintained uniform random exploration offer partial solutions.

Confounding and Operational Constraints:

Signals (e.g., names for racial identity) in factorial designs can inadvertently confound unmeasured social attributes. Simultaneous manipulation of multiple signals and rigorous orthogonal assignment help minimize confounding, though real-world constraints may restrict the number of feasible arms or sample sizes (Pashley et al., 13 Mar 2025).

Label and Selection Bias:

Reliance on convenience samples not generated by random audit processes can produce substantial bias in fairness evaluations—remedied only via rigorously randomized audit data and causal ITE-based label repair (Sariola et al., 2 Jul 2025).

Balance Between Engagement and Annoyance:

In behavioral interventions, excessive or non-personalized reminders may provoke negative emotional responses, as seen in the educational context (Zavaleta-Bernuy et al., 2022). This underscores the need for adaptive, personalized audit interventions and measurement of both short- and long-term outcomes.

Non-Stationarity and Contextual Adaptation:

Future research is needed to incorporate dynamic (non-stationary) behavioral contexts into allocation frameworks, potentially by integrating contextual covariates or moving from batch to real-time updates (Yanez et al., 2022).

Conclusion

Randomized email audit studies, deploying factorial, A/B, and adaptive multi-armed experimental designs, have become central to quantifying discrimination, assessing fairness interventions, and optimizing behavioral nudging. The methodology’s model-free inference, robust design features, and nuanced attention to behavioral factors enable principled causal conclusions unattainable with convenience samples. Functionally, these studies guide both academic research and practical policy, complemented by rigorous sample size calculation, optimal allocation, and careful interpretation of effect metrics—including nonlinear estimands such as risk ratios and odds ratios. As research progresses, the integration of contextual adaptation, causal ITE repair, and real-world behavioral metrics will enhance the precision, scale, and ethical underpinnings of audit-based empirical research.