Taylor Series Imitation Learning (TaSIL)
- Taylor Series Imitation Learning (TaSIL) is a framework that augments behavioral cloning by incorporating higher-order derivatives of the expert policy to reduce imitation gaps.
- It employs a composite loss function that penalizes discrepancies in both action outputs and their derivatives, enforcing closed-loop stability through explicit theoretical guarantees.
- Empirical results demonstrate that TaSIL improves sample efficiency and robustness to distribution shifts, achieving near-expert performance on challenging continuous control benchmarks.
Taylor Series Imitation Learning (TaSIL) is an imitation learning framework that augments conventional behavioral cloning by incorporating higher-order Taylor series terms of the expert policy into the imitation loss. By directly penalizing both function value and derivative mismatches between learned and expert policies along expert trajectories, TaSIL provides explicit closed-loop performance guarantees for continuous control, particularly under assumptions of system stability such as incremental input-to-state stability (δ-ISS) or contraction. The approach delivers robustness to policy-induced distribution shift and specifies sample-complexity requirements for high-probability generalization bounds. TaSIL has demonstrated empirical improvements over standard behavior cloning and related interactive imitation learning techniques when evaluated on a range of continuous control benchmarks (Pfrommer et al., 2022, Gahlawat et al., 19 Dec 2025).
1. Taylor-Augmented Imitation Loss
TaSIL modifies the standard behavioral cloning (BC) loss— which minimizes the mean squared error between learned and expert actions on recorded expert state trajectories—by also penalizing discrepancies in higher-order derivatives (Jacobian, Hessian, etc.) of the policy with respect to state. The -order TaSIL loss for a trajectory is:
where and denotes the -th derivative with respect to state. The empirical risk minimization (ERM) objective optimizes over the function class :
In practice, weighted combinations of the different order discrepancies may be used:
Here, are user-chosen regularization weights for each derivative order (Pfrommer et al., 2022, Gahlawat et al., 19 Dec 2025).
2. Stability Assumptions and Imitation Gap
Closed-loop system stability is central to the theoretical guarantees of TaSIL. The key assumption is that the expert closed-loop system is incrementally input-to-state stable (δ-ISS) or contracting. For a discrete-time system , -ISS means that for any bounded disturbance sequence :
For contracting continuous-time systems, the assumption is the existence of such that
implying exponential convergence of trajectories and robustness to policy perturbations (Gahlawat et al., 19 Dec 2025).
3. Theoretical Guarantees and Sample Complexity
The central performance metric is the imitation gap, defined as the maximum expected state discrepancy between the learned and expert trajectories:
TaSIL provides finite-sample high-probability bounds on the imitation gap, with rates depending on the system's sensitivity function :
- If with (rapid decay), matching only zeroth-order terms (standard BC) suffices for robust imitation.
- If with (slow decay), it is necessary to match up to order derivatives.
For the realizable case (), the expected TaSIL loss generalizes at rate , where is the parameter dimension and the number of expert trajectories. The dependence on the required number of expert demonstrations to achieve an imitation gap of at most (with high probability) improves with the robustness of the expert policy (Pfrommer et al., 2022, Gahlawat et al., 19 Dec 2025).
4. Algorithmic Implementation
The practical training procedure for TaSIL entails:
- Sampling minibatches of initial conditions and corresponding expert trajectories.
- Computing both action and derivative discrepancies between learned and expert policies at each step along the trajectory.
- Employing automatic differentiation or finite-difference methods to estimate derivatives if the expert is a black-box.
- Constructing and minimizing the composite Taylor-augmented empirical loss via stochastic gradient descent.
Pseudocode structure:
- For each expert trajectory, store states, expert actions, and their derivatives.
- For each training iteration:
- Evaluate learned policy and all necessary derivatives.
- Accumulate loss across sample points and Taylor orders.
- Update policy parameters via gradient descent.
Even when gradients of the expert are unavailable, finite-difference approximations along random directions can provide effective Jacobian estimates for first-order TaSIL (Pfrommer et al., 2022, Gahlawat et al., 19 Dec 2025).
5. Robustness to Distribution Shift
TaSIL addresses the problem of compounding errors and distribution shift in standard imitation learning by penalizing both the action and the local dynamics sensitivity of the learned policy with respect to the expert. This Taylor-based regularization is theoretically and empirically shown to reduce drift when the learned policy deviates from the expert trajectory during deployment. Unlike interactive approaches such as DAgger, TaSIL is a single-batch method that does not require querying the expert on off-data-policy states, yet achieves comparable or superior closed-loop error guarantees, provided the expert is sufficiently smooth and differentiable (Pfrommer et al., 2022, Gahlawat et al., 19 Dec 2025).
6. Empirical Evaluation
TaSIL achieves significant performance improvements across a range of continuous control tasks:
- On MuJoCo benchmarks (e.g., Walker2d-v3, HalfCheetah-v3, Ant-v3, Humanoid-v3), BC+1-TaSIL enables rapid attainment of near-expert cumulative reward with as few as 20–50 expert demonstrations, particularly on challenging tasks where standard BC fails in low-data regimes.
- Augmenting DAgger or DART with TaSIL further improves sample efficiency and performance.
- On less challenging environments (e.g., Swimmer), all methods, including standard BC, quickly match the expert.
- Synthetic experiments varying the exponent in confirm theoretical predictions: higher-order TaSIL matches are necessary as the expert's robustness parameter decreases (Pfrommer et al., 2022).
7. Practical Considerations and Limitations
- The choice of Taylor order should reflect the robustness of the expert policy; highly robust (contracting) experts may only require , while fragile experts necessitate higher .
- Regularization weights for each Taylor order are often tuned on a validation set, with decreasing weight for higher derivatives due to increased dimensionality.
- Finite-difference gradient estimation is effective when expert models are non-differentiable or available only as black boxes.
- Computational cost increases with Taylor order , scaling as for state dimension ; in practice, captures most of the benefit with manageable overhead.
- TaSIL relies on access to expert derivatives or their estimates; it does not, in isolation, address distribution shifts due to aleatoric disturbances or model mismatch, motivating the integration with robust control layers in architectures such as DRIP (Gahlawat et al., 19 Dec 2025).
References
- "TaSIL: Taylor Series Imitation Learning" (Pfrommer et al., 2022)
- "Distributionally Robust Imitation Learning: Layered Control Architecture for Certifiable Autonomy" (Gahlawat et al., 19 Dec 2025)