Sequential Testing Framework
- Sequential Testing Framework is a dynamic statistical method that determines sample size based on accumulating data and adapts testing procedures in real time.
- It employs techniques like SPRT and self-tuning generalized likelihood ratios to enforce precise error control with calibrated stopping rules.
- The framework achieves asymptotic optimality by minimizing expected sample size and supports adaptive designs, including computerized adaptive testing.
Sequential Testing Framework
A sequential testing framework provides statistical decision procedures in which the sample size is not fixed in advance but determined dynamically based on the incoming data and, optionally, adaptive experiment selection. This approach underlies classical sequential probability ratio tests (SPRT), modern generalized likelihood ratio (GLR) procedures, and their extensions to adaptive designs, non-parametric models, and real-time applications such as computerized adaptive testing (@@@@1@@@@). Contemporary sequential frameworks optimize expected sample size subject to rigorous control of type I and II error probabilities across both fixed-length and open-ended settings, and adaptively focus sampling on critical regions of uncertainty.
1. Fundamental Model and GLR Construction
Let be a sequence of observations under an exponential-family model, with densities
Observations may be i.i.d., or, in adaptive designs, generated according to item-specific models (e.g., in CAT, each item has and a corresponding Kullback-Leibler information ).
The sequential test considers composite hypotheses defined via cut-points for "mastery":
with an "indifference region" .
The classical SPRT utilizes the fixed-point likelihood ratio . Modern frameworks generalize this to the self-tuning generalized likelihood ratio (GLR):
where is the MLE after observations, and is a context-specific reference (typically or ).
2. Stopping Rules and Error Control via Modified Haybittle–Peto Procedure
Sequential frameworks enforce a maximum sample size and control type-I () and type-II () error probabilities. The modified Haybittle–Peto procedure is defined as follows, with a burn-in period and tuning parameter :
- For with , compute
- Decision boundaries:
- Reject ("mastery") if and ,
- Accept ("non-mastery") if and .
- At , declare mastery if .
Thresholds are calibrated so that
achieving exact overall error rates.
Threshold calibration is performed via Monte Carlo simulation, normal-approximation recursions, or Siegmund’s closed-form formulas.
3. Asymptotic Optimality and Theory of Sequential Experiment Selection
Define as the random stopping time. Among all tests that stop in and satisfy error constraints, the modified Haybittle–Peto test achieves
meaning no other test in this class can asymptotically achieve a lower expected sample size at any parameter value .
Extensions to adaptive experiment selection (e.g., CAT):
- At each stage, select item informed by past data, observe .
- Provided long-run item frequencies exist and all satisfy a uniform convexity bound, the modHP procedure remains asymptotically optimal in the adapted setting.
- If items fall into classes with common response models and only limiting class-frequencies need control, optimality persists.
Proofs rely on Hoeffding-type lower bounds for expected sample size and martingale CLT for GLR increments.
4. Sequential CAT Algorithmic Realization
For item pools with parameters under 3PL models,
the algorithm selects at each step the unused item maximizing chosen information index at the current ability estimate :
- Fisher information ,
- KL information .
After observing response , update the log-likelihood, recompute the MLE
and check stopping-rule conditions.
5. Real-Time Adaptive Mastery Testing and Performance Benchmarking
The sequential testing protocol enables:
- Early stopping for clear mastery () or clear non-mastery (),
- Prolonged testing within the indifference region .
The self-tuning GLR statistic dynamically concentrates statistical information on the hardest to classify examinees.
Empirical comparison using a large test-item pool (ETS Chauncey data, 1136 items) reveals:
- Classical truncated SPRT yields inflated type-I error (≈16%, target 5%) and longer average test length.
- Modified Haybittle–Peto test (modHP) achieves error rates (α, β) exactly, and reduces average test length by 40–50% compared to fixed-length and TSPRT designs, without exceeding the maximum allowed N.
- Exposure-control and content-balancing overlays can be applied without compromising statistical validity as long as item selection remains outcome-adaptive and limiting frequencies exist.
6. Calibration, Implementation, and Robustness Considerations
Calibration of thresholds is accomplished via:
- Monte Carlo routines: estimation of implied alternatives for fixed-N tests and subsequent simulation to resolve target error rates.
- Normal-approximation formulas: use the signed-root statistic
enabling efficient computation via recursion.
- Empirical choices for the burn-in and deliver robust practical performance.
Exposure-control/content-balancing layers can be safely added when item-selection protocols satisfy long-run frequency existence.
7. Summary of Theoretical and Practical Advances
By deploying self-tuning GLR thresholds in modified Haybittle–Peto boundaries, rigorously calibrated via simulation or analytic approximations, the modern sequential testing framework for CAT and related domains:
- Enforces exact type-I/type-II error control at pre-specified levels ,
- Guarantees not to exceed user-chosen maximum test length ,
- Adapts in real time to individual subject ability,
- Achieves asymptotic optimality in expected sample size among all procedures meeting the constraints,
- Demonstrates in simulation 30–50% reduction in mean sample size compared to classical and fixed-length sequential approaches, with robust empirical and analytic validation (Bartroff et al., 2011).