CRSAIL: Conformalized Rejection Sampling for AIL

Updated 6 December 2025

The paper introduces CRSAIL, a query-efficient algorithm that uses conformal prediction and novelty metrics to selectively query expert demonstrations.
Methodologically, CRSAIL employs a k-nearest neighbor state novelty score and a globally calibrated threshold to maintain rigorous query rate control.
Empirical results on MuJoCo benchmarks demonstrate up to 96% query reduction compared to DAgger while achieving expert-level performance.

Conformalized Rejection Sampling for Active Imitation Learning (CRSAIL) is a query-efficient algorithm for active imitation learning (AIL) that leverages geometric state-space novelty and @@@@1@@@@ to select which states require expert demonstration. By coupling nearest-neighbor-based novelty assessment with a globally calibrated threshold, CRSAIL enables principled, distribution-free control of expert labeling budgets, significantly reducing query costs compared to methods such as DAgger while maintaining or exceeding expert-level performance (Firouzkouhi et al., 29 Nov 2025).

1. Problem Formulation and Motivation

Imitation learning is framed within an unknown Markov Decision Process (MDP),

$M = (\mathcal{X}, \mathcal{U}, P, r, \mathcal{X}_0, \mathcal{X}_T, T_{\max}),$

where $\mathcal{X} \subset \mathbb{R}^d$ is the state space, $\mathcal{U}$ is the action space, $P$ the transition kernel, $r$ the reward (used for evaluation only), and episodes terminate on $\mathcal{X}_T$ or at $T_{\max}$ . The expert policy $\pi_E$ provides demonstration pairs $(x, u_E)$ , each incurring a unit query cost. A parametric learner $\pi_\theta$ is trained to minimize the on-policy imitation loss,

$J(\theta) = \mathbb{E}_{x_0 \sim \mathcal{X}_0} \mathbb{E}\left[ \sum_{t=0}^{L-1} \ell(\pi_\theta(x_t), u_{E, t}) \right],$

with trajectories generated under $\pi_\theta$ . Pure behavior cloning suffers from covariate shift, leading to compounding errors when $\pi_\theta$ visits underrepresented states. AIL mitigates this by querying the expert selectively, but query cost is often dominated by the cost per demonstration, especially in GPU-intensive, human-in-the-loop, or repetitive state settings.

Existing techniques such as DAgger query too frequently, while others require real-time interventions or rely on action uncertainty, which does not always capture state-space novelty. CRSAIL addresses this by quantifying geometric novelty and applying conformal prediction to control the query rate through a single, globally calibrated threshold.

2. Novelty Quantification: $K$ -th Nearest Neighbor Distance

The core of CRSAIL is its state-space novelty score based on the distance to the $K$ -th nearest neighbor in the expert-labeled state set. For episode $i$ , let $D^{(i)}_{\text{exp}}$ denote all expert-labeled state-action pairs, and $D_X^{(i)} = \{ x : (x, u) \in D^{(i)}_{\text{exp}} \}$ the projection onto states. For query state $x$ , the nonconformity (novelty) score is defined as:

$s_K(x; D^{(i)}_{\text{exp}}) := \inf\{ r \ge 0 : |B(x, r) \cap D_X^{(i)}| \ge K \},$

where $B(x, r) = \{ z : \|z - x\| \le r \}$ . This score measures the radius of the smallest Euclidean ball centered at $x$ encompassing at least $K$ expert states. High $s_K$ values indicate state-space sparsity, guiding the algorithm to query only in underrepresented regions.

3. Conformal Calibration: Distribution-Free Threshold Selection

CRSAIL sets a single query threshold $R$ via conformal prediction, enabling rigorous statistical control of the expected query rate $\alpha$ . Calibration proceeds as follows:

Calibration Dataset: Execute $M_{\text{cal}}$ episodes under the initial behavior-cloned policy $\pi_{\theta_0}$ , collecting all visited states into $X_{\text{cal}} = \{ x_j \}_{j=1}^{N_{\text{cal}}}$ .
Score Computation: For each $x_j$ , calculate $s_j = s_K(x_j; D^{(0)}_{\text{exp}})$ .
Threshold Selection: Define $m = \lceil (N_{\text{cal}} + 1)(1 - \alpha) \rceil$ and set $R = s_{(m)}$ , with $\{ s_{(1)}, \ldots, s_{(N_{\text{cal}})} \}$ the sorted scores.

Under exchangeability, the conformal guarantee ensures:

$P_{x \sim \text{on-policy}}(s_K(x; D^{(0)}_{\text{exp}}) \le R) \ge 1 - \alpha,$

so at most an $\alpha$ fraction of new states will be queried in expectation. Larger $\alpha$ lowers $R$ and increases the nominal query rate; smaller $\alpha$ raises $R$ and decreases queries. This statistic is robust to outliers due to the high quantile selection and the $K$ th-neighbor scoring.

4. The CRSAIL Algorithm and Computational Properties

CRSAIL alternates between closed-loop rollouts, batch (post hoc) expert queries, dataset aggregation, and policy updates, governed by budgets for total environment steps $T_{\text{train}}$ and queries $B$ . The protocol is:

Step 1: Radius Calibration

Roll out $\pi_{\theta_0}$ for $M_{\text{cal}}$ episodes to collect $X_{\text{cal}}$ .
Compute novelty scores $s_j = s_K(x_j; D^{(0)}_{\text{exp}})$ for all $x_j$ .
Set $R \leftarrow s_{(m)}$ , where $m = \lceil (N_{\text{cal}} + 1)(1 - \alpha) \rceil$ .

Step 2: Iterative Training

At iteration $i$ , roll out $\pi_{\theta_i}$ , obtaining a trajectory $\{x_t\}_{t=0}^{L_i-1}$ .
Query the expert at $x_t$ if $s_K(x_t; D^{(i)}_{\text{exp}}) > R$ , forming $Q_i$ .
Aggregate: $D^{(i+1)}_{\text{exp}} \leftarrow D^{(i)}_{\text{exp}} \cup Q_i$ .
Update policy: $\theta_{i+1} \leftarrow \text{Update}(\theta_i, D^{(i+1)}_{\text{exp}})$ .
Increment counters and repeat until budgets are exhausted.

Naive complexity per episode is $O(d \cdot |D_{\text{exp}}| \cdot L_i)$ for distance computations and $O(L_i \cdot |D_{\text{exp}}|)$ for $K$ -nearest selection. Batch computation and small $K$ (e.g., $K=5$ ) render this overhead minor compared to environment simulation and policy optimization.

5. Theoretical Guarantees and Hyperparameter Robustness

The conformally calibrated threshold $R$ provides finite-sample coverage: under exchangeability, at most an $\alpha$ fraction of new states should trigger queries in expectation. CRSAIL’s query rate exhibits monotonic dependence on $\alpha$ and is robust to both $\alpha$ and $K$ . Empirically, setting $\alpha$ in $[0.9, 0.95]$ balances query efficiency and convergence rate. CRSAIL is less sensitive to $\alpha$ than action-uncertainty–based AIL, and all $K \in \{1,3,5,7,9\}$ yielded robust convergence on benchmarks, with $K=5$ suggested as a default.

6. Empirical Results on MuJoCo Robotics Benchmarks

CRSAIL was evaluated on MuJoCo environments—Inverted Double Pendulum, Pusher, and Hopper—against DAgger, EnsembleDAgger, and ThriftyDAgger using metrics such as convergence rate, queries to convergence, and total queries. Key findings averaged over five offline datasets and all $M$ values include:

Inverted Double Pendulum: CRSAIL reduced queries by ~96% versus DAgger, and ~65% versus the best prior method.
Pusher: ~72% fewer queries than DAgger, ~48% fewer than the best prior.
Hopper: Still competitive, outperforming ThriftyDAgger in total queries and matching or exceeding EnsembleDAgger in efficiency.

Empirical query rates closely tracked $\alpha$ across all tasks, affirming the efficacy of conformal calibration in distribution-free query rate control.

Environment	Query Savings vs DAgger	Query Savings vs Best Prior	Notes
Inverted Double Pendulum	~96%	~65%	100% convergence; robust across $\alpha$ , $K$
Pusher	~72%	~48%	100% convergence; insensitive to $K$
Hopper	Competitive	Superior to ThriftyDAgger	Task harder; state-space novelty less effective

7. Discussion, Limitations, and Future Extensions

Advantages:

CRSAIL eliminates the need for real-time expert takeovers or action-uncertainty gating by adopting batch, post hoc querying. Its principled, distribution-free thresholding via conformal prediction exposes $\alpha$ as a global, easily interpretable control for expert query budgets. State-space novelty avoids conflating aleatoric and epistemic uncertainty and requires no auxiliary networks or complex estimators. Hyperparameter robustness to both $\alpha$ and $K$ was observed empirically across all evaluated domains.

Limitations:

Conformal guarantees require exchangeability, which may be violated as $\pi_\theta$ evolves, though $\alpha$ remains reliable empirically. Tasks where success pivots on fine-grained action choices in narrow state regions (e.g., Hopper) can reveal failure modes for state-space novelty approaches. The use of a static threshold $R$ may be suboptimal as coverage increases, potentially motivating nonstationary query policies.

Potential Extensions:

Natural directions include implementing time-varying $\alpha$ or recalibrating $R$ to promote query rate decay as state-space coverage improves; integrating action-space distances or learned adaptive metrics (e.g., Mahalanobis distances) into $s_K$ ; and extending conformal methods to nonexchangeable or online settings.

CRSAIL represents a geometric, statistically principled approach to expert query control in AIL, providing state-of-the-art query efficiency alongside interpretability and practical deployment robustness (Firouzkouhi et al., 29 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Sample-Efficient Expert Query Control in Active Imitation Learning via Conformal Prediction (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Conformalized Rejection Sampling for Active Imitation Learning (CRSAIL).