Compliant Residual DAgger (CR-DAgger)
- Compliant Residual DAgger (CR-DAgger) is a method for improving robot manipulation by allowing humans to provide smooth, real-time force-aware corrections to a base policy.
- It employs a compliant kinesthetic interface for human delta corrections and trains a lightweight residual policy to predict adjustments based on sensory and force inputs.
- CR-DAgger substantially enhances success rates in contact-rich tasks with minimal human effort, outperforming standard policy retraining or finetuning techniques.
Compliant Residual DAgger (CR-DAgger) is a dataset aggregation (DAgger) variant designed for real-world contact-rich robotic manipulation, with emphasis on enabling efficient policy improvement through human corrections. CR-DAgger introduces a compliant intervention interface for human-in-the-loop data collection and a force-aware residual policy formulation, enabling robust policy adaptation using minimal but high-fidelity correction data.
1. Compliant Intervention Interface for On-Policy Human Correction
The compliant intervention interface in CR-DAgger addresses the challenge of safely and effectively integrating human corrections during the execution of a learned policy on real robotic systems. It departs from conventional methods that suspend or fully override the robot's policy when a human intervenes. Instead, it permits delta corrections—smooth, incremental adjustments provided by the human in real-time—without halting policy execution.
- Implementation: The robot follows its current learned policy, while a human applies gentle correction forces through a kinesthetic handle mounted to the end effector. The handle is instrumented with a 6-axis force/torque sensor (e.g., ATI 6D F/T). An admittance controller, parameterized for moderate compliance (e.g., ~1000 N/m stiffness), fuses the robot and human inputs, ensuring corrections are executed compliantly and stably at all times.
- Benefits: This configuration enables humans to feel the policy's action and provide precise, context-aware modifications, while preserving the distribution of states visited and preventing abrupt control transitions. Fine-grained, in-distribution correction signals are logged without disrupting base policy execution.
- Data Logging: Both robot commands, correction episodes (button-annotated), and measured forces are recorded for subsequent policy learning.
2. Compliant Residual Policy for Correction Integration
CR-DAgger employs a residual policy approach, training a lightweight, force-aware neural network to predict only the corrective actions—and not the full task policy. This residual is combined with the fixed, pre-trained base policy to yield the final action.
- Inputs: The residual policy receives original sensory inputs (e.g., images, proprioception) and additionally high-frequency force signals.
- Outputs: The network predicts SE(3) delta pose corrections (9D) and target wrenches (force/torque, 6D), for a total of 15D per step, over a receding horizon (e.g., 5 frames at 50 Hz = 0.1s).
- Execution: The final commanded action at time is:
where is the base policy pose, is the residual pose, and is the predicted wrench. Both are sent to the admittance controller for compliant movement.
- Correction vs. No-Correction Data: Segments with no human intervention are labeled as zero residual, enabling the model to learn when not to override the base policy. Episodes with always-on low-level corrections enhance robustness.
Network Design: The residual policy reuses the base policy’s (frozen) image encoder, combines it with force signals processed via temporal convolutions (e.g., WaveNet-styled), and produces actions through a multilayer perceptron (MLP).
3. Empirical Performance and Efficiency
CR-DAgger substantially improves the success rate of real-world robot manipulation policies with a small number of high-quality correction episodes.
Book Flipping Task:
- Base policy: 40% success (failure modes: missed insertions, incomplete flips).
- CR-DAgger: ~100% success—an improvement of 60%. The residual policy learns strategies such as finger pitching and forceful upright motion.
Belt Assembly Task:
- Base policy: 20% success (failure modes: missing pulleys, incorrect height).
- CR-DAgger: 70% success—an improvement of 50%. The policy refines belt height with contact feedback and adapts trajectory dynamically.
In both tasks, fewer than 50 intervention episodes sufficed for substantial improvement, demonstrating sample efficiency. Stagewise task evaluation shows CR-DAgger is most effective in phases requiring delicate or dynamic force adaptation.
4. Comparison to Retraining, Finetuning, and Prior Methods
CR-DAgger has been evaluated against three main baselines:
Task | Base Policy | Retrain | Finetune | Residual | CR-DAgger |
---|---|---|---|---|---|
Book Flipping | 40% | 38.3% | 10% | 70% | 100% |
Belt Assembly | 20% | 25% | 5% | 50% | 70% |
- Retraining-from-scratch: Trains the whole policy anew on combined data. Improvements are marginal due to data imbalance and computational cost.
- Finetuning: Updates selected parameters of the base policy on corrections alone. This approach is unstable, often resulting in performance drops (e.g., -30% on book flipping).
- Position-only residual: Omits force inputs and targets. This ablation results in markedly lower success rates than CR-DAgger, especially in contact-heavy stages.
CR-DAgger’s force-aware residual formulation thus provides not only sample efficiency but also robustness and adaptability—exceeding standard approaches in all scenarios examined.
5. Practical Implementation Guidance
The following empirically-driven principles are recommended for deploying CR-DAgger or its variants in real-world manipulation:
- Policy Update: Train the residual policy in a single batch after collecting all correction data, rather than incrementally updating after small batches, to avoid catastrophic forgetting.
- Data Labeling: Annotate ‘no-correction’ intervals as zero residual, and ensure dense sampling immediately after intervention onset for maximum learning impact.
- Initiation Window: Begin corrections once the base policy consistently achieves 10–20% success. This suggests corrections are most effective once the core policy distribution is stable but has clear failure modes.
- Sensor Modalities: Add force feedback and wrench prediction to the residual even if the base policy used position only.
- Tooling: Employ kinesthetic, compliant teaching interfaces with haptic feedback and real-time force sensing.
- Compute Requirements: CR-DAgger training and inference are lightweight and feasible on standard GPUs.
- Limitations: If the base policy’s success rate is below 10%, residual correction becomes impractical; foundational competency is needed.
6. Broader Context and Theoretical Underpinnings
CR-DAgger builds on the theoretical advances of imitation learning from converging supervisors (1907.03423). Residual DAggers, compliant or otherwise, are a natural fit for frameworks where the supervisor (here, a human correction via delta inputs) evolves or improves over time. The regret bounds proven for converging supervisors—specifically, static and dynamic regret sublinear in as long as corrections/labels converge—directly apply so long as the human correction process becomes more stable and consistent. This provides a formal justification for the sample-efficient, stable improvements observed in CR-DAgger in real-world settings.
The compliant design—admittance control, kinesthetic intervention, real-time force feedback—enables human corrections to be smoothly integrated, logged, and learned from without instabilities or distributional shift, emphasizing safety and practical adoption on physical hardware.
Summary Table: Core Features and Benefits of CR-DAgger
Component | CR-DAgger Approach | Benefit |
---|---|---|
Correction Interface | On-policy, compliant kinesthetic delta correction with force sensing | Precise, smooth, and in-distribution human corrections |
Policy Update | Force-aware, lightweight residual model (images + force → delta+force) | High-frequency, robust corrections, enables new sensor modalities |
Sample Efficiency | <50 intervention episodes for >50% performance improvement | Fast, low-cost policy refinement |
Baseline Comparison | Outperforms retrain-from-scratch, finetuning, position-only variants | Avoids instability, data inefficiency, distribution shift |
Implementation | Single batch, targeted sampling, suitable initial policy, multimodal | Real-world applicability and stability |
Conclusion
CR-DAgger advances practical robot learning in contact-rich manipulation by enabling continuous, compliant human intervention and efficient, force-sensitive residual adaptation. It achieves significant performance gains over base and conventional policies with minimal corrective data, and is supported by theoretical regret guarantees applicable to converging, evolving supervisors. Its design principles, empirical results, and implementation recommendations form a foundation for effective deployment of human-in-the-loop learning in complex, real-world robotic environments.
Results videos and resources: https://compliant-residual-dagger.github.io/