CRSAIL: Conformalized Rejection Sampling for AIL
- The paper introduces CRSAIL, a query-efficient algorithm that uses conformal prediction and novelty metrics to selectively query expert demonstrations.
- Methodologically, CRSAIL employs a k-nearest neighbor state novelty score and a globally calibrated threshold to maintain rigorous query rate control.
- Empirical results on MuJoCo benchmarks demonstrate up to 96% query reduction compared to DAgger while achieving expert-level performance.
Conformalized Rejection Sampling for Active Imitation Learning (CRSAIL) is a query-efficient algorithm for active imitation learning (AIL) that leverages geometric state-space novelty and @@@@1@@@@ to select which states require expert demonstration. By coupling nearest-neighbor-based novelty assessment with a globally calibrated threshold, CRSAIL enables principled, distribution-free control of expert labeling budgets, significantly reducing query costs compared to methods such as DAgger while maintaining or exceeding expert-level performance (Firouzkouhi et al., 29 Nov 2025).
1. Problem Formulation and Motivation
Imitation learning is framed within an unknown Markov Decision Process (MDP),
where is the state space, is the action space, the transition kernel, the reward (used for evaluation only), and episodes terminate on or at . The expert policy provides demonstration pairs , each incurring a unit query cost. A parametric learner is trained to minimize the on-policy imitation loss,
with trajectories generated under . Pure behavior cloning suffers from covariate shift, leading to compounding errors when visits underrepresented states. AIL mitigates this by querying the expert selectively, but query cost is often dominated by the cost per demonstration, especially in GPU-intensive, human-in-the-loop, or repetitive state settings.
Existing techniques such as DAgger query too frequently, while others require real-time interventions or rely on action uncertainty, which does not always capture state-space novelty. CRSAIL addresses this by quantifying geometric novelty and applying conformal prediction to control the query rate through a single, globally calibrated threshold.
2. Novelty Quantification: -th Nearest Neighbor Distance
The core of CRSAIL is its state-space novelty score based on the distance to the -th nearest neighbor in the expert-labeled state set. For episode , let denote all expert-labeled state-action pairs, and the projection onto states. For query state , the nonconformity (novelty) score is defined as:
where . This score measures the radius of the smallest Euclidean ball centered at encompassing at least expert states. High values indicate state-space sparsity, guiding the algorithm to query only in underrepresented regions.
3. Conformal Calibration: Distribution-Free Threshold Selection
CRSAIL sets a single query threshold via conformal prediction, enabling rigorous statistical control of the expected query rate . Calibration proceeds as follows:
- Calibration Dataset: Execute episodes under the initial behavior-cloned policy , collecting all visited states into .
- Score Computation: For each , calculate .
- Threshold Selection: Define and set , with the sorted scores.
Under exchangeability, the conformal guarantee ensures:
so at most an fraction of new states will be queried in expectation. Larger lowers and increases the nominal query rate; smaller raises and decreases queries. This statistic is robust to outliers due to the high quantile selection and the th-neighbor scoring.
4. The CRSAIL Algorithm and Computational Properties
CRSAIL alternates between closed-loop rollouts, batch (post hoc) expert queries, dataset aggregation, and policy updates, governed by budgets for total environment steps and queries . The protocol is:
Step 1: Radius Calibration
- Roll out for episodes to collect .
- Compute novelty scores for all .
- Set , where .
Step 2: Iterative Training
- At iteration , roll out , obtaining a trajectory .
- Query the expert at if , forming .
- Aggregate: .
- Update policy: .
- Increment counters and repeat until budgets are exhausted.
Naive complexity per episode is for distance computations and for -nearest selection. Batch computation and small (e.g., ) render this overhead minor compared to environment simulation and policy optimization.
5. Theoretical Guarantees and Hyperparameter Robustness
The conformally calibrated threshold provides finite-sample coverage: under exchangeability, at most an fraction of new states should trigger queries in expectation. CRSAIL’s query rate exhibits monotonic dependence on and is robust to both and . Empirically, setting in balances query efficiency and convergence rate. CRSAIL is less sensitive to than action-uncertainty–based AIL, and all yielded robust convergence on benchmarks, with suggested as a default.
6. Empirical Results on MuJoCo Robotics Benchmarks
CRSAIL was evaluated on MuJoCo environments—Inverted Double Pendulum, Pusher, and Hopper—against DAgger, EnsembleDAgger, and ThriftyDAgger using metrics such as convergence rate, queries to convergence, and total queries. Key findings averaged over five offline datasets and all values include:
- Inverted Double Pendulum: CRSAIL reduced queries by ~96% versus DAgger, and ~65% versus the best prior method.
- Pusher: ~72% fewer queries than DAgger, ~48% fewer than the best prior.
- Hopper: Still competitive, outperforming ThriftyDAgger in total queries and matching or exceeding EnsembleDAgger in efficiency.
Empirical query rates closely tracked across all tasks, affirming the efficacy of conformal calibration in distribution-free query rate control.
| Environment | Query Savings vs DAgger | Query Savings vs Best Prior | Notes |
|---|---|---|---|
| Inverted Double Pendulum | ~96% | ~65% | 100% convergence; robust across , |
| Pusher | ~72% | ~48% | 100% convergence; insensitive to |
| Hopper | Competitive | Superior to ThriftyDAgger | Task harder; state-space novelty less effective |
7. Discussion, Limitations, and Future Extensions
Advantages:
CRSAIL eliminates the need for real-time expert takeovers or action-uncertainty gating by adopting batch, post hoc querying. Its principled, distribution-free thresholding via conformal prediction exposes as a global, easily interpretable control for expert query budgets. State-space novelty avoids conflating aleatoric and epistemic uncertainty and requires no auxiliary networks or complex estimators. Hyperparameter robustness to both and was observed empirically across all evaluated domains.
Limitations:
Conformal guarantees require exchangeability, which may be violated as evolves, though remains reliable empirically. Tasks where success pivots on fine-grained action choices in narrow state regions (e.g., Hopper) can reveal failure modes for state-space novelty approaches. The use of a static threshold may be suboptimal as coverage increases, potentially motivating nonstationary query policies.
Potential Extensions:
Natural directions include implementing time-varying or recalibrating to promote query rate decay as state-space coverage improves; integrating action-space distances or learned adaptive metrics (e.g., Mahalanobis distances) into ; and extending conformal methods to nonexchangeable or online settings.
CRSAIL represents a geometric, statistically principled approach to expert query control in AIL, providing state-of-the-art query efficiency alongside interpretability and practical deployment robustness (Firouzkouhi et al., 29 Nov 2025).