Papers
Topics
Authors
Recent
2000 character limit reached

CRSAIL: Conformalized Rejection Sampling for AIL

Updated 6 December 2025
  • The paper introduces CRSAIL, a query-efficient algorithm that uses conformal prediction and novelty metrics to selectively query expert demonstrations.
  • Methodologically, CRSAIL employs a k-nearest neighbor state novelty score and a globally calibrated threshold to maintain rigorous query rate control.
  • Empirical results on MuJoCo benchmarks demonstrate up to 96% query reduction compared to DAgger while achieving expert-level performance.

Conformalized Rejection Sampling for Active Imitation Learning (CRSAIL) is a query-efficient algorithm for active imitation learning (AIL) that leverages geometric state-space novelty and @@@@1@@@@ to select which states require expert demonstration. By coupling nearest-neighbor-based novelty assessment with a globally calibrated threshold, CRSAIL enables principled, distribution-free control of expert labeling budgets, significantly reducing query costs compared to methods such as DAgger while maintaining or exceeding expert-level performance (Firouzkouhi et al., 29 Nov 2025).

1. Problem Formulation and Motivation

Imitation learning is framed within an unknown Markov Decision Process (MDP),

M=(X,U,P,r,X0,XT,Tmax),M = (\mathcal{X}, \mathcal{U}, P, r, \mathcal{X}_0, \mathcal{X}_T, T_{\max}),

where XRd\mathcal{X} \subset \mathbb{R}^d is the state space, U\mathcal{U} is the action space, PP the transition kernel, rr the reward (used for evaluation only), and episodes terminate on XT\mathcal{X}_T or at TmaxT_{\max}. The expert policy πE\pi_E provides demonstration pairs (x,uE)(x, u_E), each incurring a unit query cost. A parametric learner πθ\pi_\theta is trained to minimize the on-policy imitation loss,

J(θ)=Ex0X0E[t=0L1(πθ(xt),uE,t)],J(\theta) = \mathbb{E}_{x_0 \sim \mathcal{X}_0} \mathbb{E}\left[ \sum_{t=0}^{L-1} \ell(\pi_\theta(x_t), u_{E, t}) \right],

with trajectories generated under πθ\pi_\theta. Pure behavior cloning suffers from covariate shift, leading to compounding errors when πθ\pi_\theta visits underrepresented states. AIL mitigates this by querying the expert selectively, but query cost is often dominated by the cost per demonstration, especially in GPU-intensive, human-in-the-loop, or repetitive state settings.

Existing techniques such as DAgger query too frequently, while others require real-time interventions or rely on action uncertainty, which does not always capture state-space novelty. CRSAIL addresses this by quantifying geometric novelty and applying conformal prediction to control the query rate through a single, globally calibrated threshold.

2. Novelty Quantification: KK-th Nearest Neighbor Distance

The core of CRSAIL is its state-space novelty score based on the distance to the KK-th nearest neighbor in the expert-labeled state set. For episode ii, let Dexp(i)D^{(i)}_{\text{exp}} denote all expert-labeled state-action pairs, and DX(i)={x:(x,u)Dexp(i)}D_X^{(i)} = \{ x : (x, u) \in D^{(i)}_{\text{exp}} \} the projection onto states. For query state xx, the nonconformity (novelty) score is defined as:

sK(x;Dexp(i)):=inf{r0:B(x,r)DX(i)K},s_K(x; D^{(i)}_{\text{exp}}) := \inf\{ r \ge 0 : |B(x, r) \cap D_X^{(i)}| \ge K \},

where B(x,r)={z:zxr}B(x, r) = \{ z : \|z - x\| \le r \}. This score measures the radius of the smallest Euclidean ball centered at xx encompassing at least KK expert states. High sKs_K values indicate state-space sparsity, guiding the algorithm to query only in underrepresented regions.

3. Conformal Calibration: Distribution-Free Threshold Selection

CRSAIL sets a single query threshold RR via conformal prediction, enabling rigorous statistical control of the expected query rate α\alpha. Calibration proceeds as follows:

  1. Calibration Dataset: Execute McalM_{\text{cal}} episodes under the initial behavior-cloned policy πθ0\pi_{\theta_0}, collecting all visited states into Xcal={xj}j=1NcalX_{\text{cal}} = \{ x_j \}_{j=1}^{N_{\text{cal}}}.
  2. Score Computation: For each xjx_j, calculate sj=sK(xj;Dexp(0))s_j = s_K(x_j; D^{(0)}_{\text{exp}}).
  3. Threshold Selection: Define m=(Ncal+1)(1α)m = \lceil (N_{\text{cal}} + 1)(1 - \alpha) \rceil and set R=s(m)R = s_{(m)}, with {s(1),,s(Ncal)}\{ s_{(1)}, \ldots, s_{(N_{\text{cal}})} \} the sorted scores.

Under exchangeability, the conformal guarantee ensures:

Pxon-policy(sK(x;Dexp(0))R)1α,P_{x \sim \text{on-policy}}(s_K(x; D^{(0)}_{\text{exp}}) \le R) \ge 1 - \alpha,

so at most an α\alpha fraction of new states will be queried in expectation. Larger α\alpha lowers RR and increases the nominal query rate; smaller α\alpha raises RR and decreases queries. This statistic is robust to outliers due to the high quantile selection and the KKth-neighbor scoring.

4. The CRSAIL Algorithm and Computational Properties

CRSAIL alternates between closed-loop rollouts, batch (post hoc) expert queries, dataset aggregation, and policy updates, governed by budgets for total environment steps TtrainT_{\text{train}} and queries BB. The protocol is:

Step 1: Radius Calibration

  • Roll out πθ0\pi_{\theta_0} for McalM_{\text{cal}} episodes to collect XcalX_{\text{cal}}.
  • Compute novelty scores sj=sK(xj;Dexp(0))s_j = s_K(x_j; D^{(0)}_{\text{exp}}) for all xjx_j.
  • Set Rs(m)R \leftarrow s_{(m)}, where m=(Ncal+1)(1α)m = \lceil (N_{\text{cal}} + 1)(1 - \alpha) \rceil.

Step 2: Iterative Training

  • At iteration ii, roll out πθi\pi_{\theta_i}, obtaining a trajectory {xt}t=0Li1\{x_t\}_{t=0}^{L_i-1}.
  • Query the expert at xtx_t if sK(xt;Dexp(i))>Rs_K(x_t; D^{(i)}_{\text{exp}}) > R, forming QiQ_i.
  • Aggregate: Dexp(i+1)Dexp(i)QiD^{(i+1)}_{\text{exp}} \leftarrow D^{(i)}_{\text{exp}} \cup Q_i.
  • Update policy: θi+1Update(θi,Dexp(i+1))\theta_{i+1} \leftarrow \text{Update}(\theta_i, D^{(i+1)}_{\text{exp}}).
  • Increment counters and repeat until budgets are exhausted.

Naive complexity per episode is O(dDexpLi)O(d \cdot |D_{\text{exp}}| \cdot L_i) for distance computations and O(LiDexp)O(L_i \cdot |D_{\text{exp}}|) for KK-nearest selection. Batch computation and small KK (e.g., K=5K=5) render this overhead minor compared to environment simulation and policy optimization.

5. Theoretical Guarantees and Hyperparameter Robustness

The conformally calibrated threshold RR provides finite-sample coverage: under exchangeability, at most an α\alpha fraction of new states should trigger queries in expectation. CRSAIL’s query rate exhibits monotonic dependence on α\alpha and is robust to both α\alpha and KK. Empirically, setting α\alpha in [0.9,0.95][0.9, 0.95] balances query efficiency and convergence rate. CRSAIL is less sensitive to α\alpha than action-uncertainty–based AIL, and all K{1,3,5,7,9}K \in \{1,3,5,7,9\} yielded robust convergence on benchmarks, with K=5K=5 suggested as a default.

6. Empirical Results on MuJoCo Robotics Benchmarks

CRSAIL was evaluated on MuJoCo environments—Inverted Double Pendulum, Pusher, and Hopper—against DAgger, EnsembleDAgger, and ThriftyDAgger using metrics such as convergence rate, queries to convergence, and total queries. Key findings averaged over five offline datasets and all MM values include:

  • Inverted Double Pendulum: CRSAIL reduced queries by ~96% versus DAgger, and ~65% versus the best prior method.
  • Pusher: ~72% fewer queries than DAgger, ~48% fewer than the best prior.
  • Hopper: Still competitive, outperforming ThriftyDAgger in total queries and matching or exceeding EnsembleDAgger in efficiency.

Empirical query rates closely tracked α\alpha across all tasks, affirming the efficacy of conformal calibration in distribution-free query rate control.

Environment Query Savings vs DAgger Query Savings vs Best Prior Notes
Inverted Double Pendulum ~96% ~65% 100% convergence; robust across α\alpha, KK
Pusher ~72% ~48% 100% convergence; insensitive to KK
Hopper Competitive Superior to ThriftyDAgger Task harder; state-space novelty less effective

7. Discussion, Limitations, and Future Extensions

Advantages:

CRSAIL eliminates the need for real-time expert takeovers or action-uncertainty gating by adopting batch, post hoc querying. Its principled, distribution-free thresholding via conformal prediction exposes α\alpha as a global, easily interpretable control for expert query budgets. State-space novelty avoids conflating aleatoric and epistemic uncertainty and requires no auxiliary networks or complex estimators. Hyperparameter robustness to both α\alpha and KK was observed empirically across all evaluated domains.

Limitations:

Conformal guarantees require exchangeability, which may be violated as πθ\pi_\theta evolves, though α\alpha remains reliable empirically. Tasks where success pivots on fine-grained action choices in narrow state regions (e.g., Hopper) can reveal failure modes for state-space novelty approaches. The use of a static threshold RR may be suboptimal as coverage increases, potentially motivating nonstationary query policies.

Potential Extensions:

Natural directions include implementing time-varying α\alpha or recalibrating RR to promote query rate decay as state-space coverage improves; integrating action-space distances or learned adaptive metrics (e.g., Mahalanobis distances) into sKs_K; and extending conformal methods to nonexchangeable or online settings.

CRSAIL represents a geometric, statistically principled approach to expert query control in AIL, providing state-of-the-art query efficiency alongside interpretability and practical deployment robustness (Firouzkouhi et al., 29 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Conformalized Rejection Sampling for Active Imitation Learning (CRSAIL).