Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
131 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Distilling Realizable Students from Unrealizable Teachers (2505.09546v1)

Published 14 May 2025 in cs.RO and cs.LG

Abstract: We study policy distillation under privileged information, where a student policy with only partial observations must learn from a teacher with full-state access. A key challenge is information asymmetry: the student cannot directly access the teacher's state space, leading to distributional shifts and policy degradation. Existing approaches either modify the teacher to produce realizable but sub-optimal demonstrations or rely on the student to explore missing information independently, both of which are inefficient. Our key insight is that the student should strategically interact with the teacher --querying only when necessary and resetting from recovery states --to stay on a recoverable path within its own observation space. We introduce two methods: (i) an imitation learning approach that adaptively determines when the student should query the teacher for corrections, and (ii) a reinforcement learning approach that selects where to initialize training for efficient exploration. We validate our methods in both simulated and real-world robotic tasks, demonstrating significant improvements over standard teacher-student baselines in training efficiency and final performance. The project website is available at : https://portal-cornell.github.io/CritiQ_ReTRy/

Summary

Distilling Realizable Students from Unrealizable Teachers

The paper "Distilling Realizable Students from Unrealizable Teachers" addresses a significant challenge in policy distillation for robotic systems operating under partial observability. The research explores the policy distillation framework where a student policy, restricted by partial observations, learns from a teacher policy that benefits from full-state access. This teacher-student framework encounters difficulties due to the information asymmetry inherent in the setup, leading to distributional shifts and policy degradation when the student attempts to imitate the teacher.

Key Contributions

  1. Framework and Challenges: The paper formalizes the policy distillation problem within a Contextual Markov Decision Process (CMDP), emphasizing the issue of information mismatch. The teacher's policy often operates on privileged information, which is not directly accessible to the student. This information asymmetry results in state aliasing, whereby multiple teacher states can correspond to the same student observation, causing conflicting actions for similar inputs.
  2. Critical State Query (CritiQ): CritiQ is an imitation learning method introduced to strategically query the teacher only at critical states, where the student risks veering off a recoverable path. This selective querying helps avoid the destabilizing effects of state aliasing found when continuously imitating the teacher, thus improving policy stability and effectiveness without accumulating unnecessary aliasing errors.
  3. Resetting to Teacher Recovery (ReTRy): ReTRy employs a reinforcement learning approach that adapts the reset distribution iteratively. By resetting the student to states visited by both the teacher and student, ReTRy enhances exploration efficiency and learning robustness. Unlike traditional reset strategies that only consider teacher paths, ReTRy includes unexplored, yet recoverable states identified through the student's experience.
  4. Performance Analysis: The paper rigorously analyzes the performance of the proposed methods. CritiQ reduces the error bound associated with DAgger algorithms in CMDP settings by limiting the accumulation of aliasing errors. The paper also invokes the Performance Difference Lemma to argue how ReTRy's adaptive reset strategy maintains bounded density ratios, resulting in efficient sample usage and rapid policy convergence toward optimal behavior.

Empirical Validation

The authors conduct experiments in both simulated and real-world robotic tasks to validate their methodologies. The tasks include navigation and manipulation challenges where the student lacks access to the semantic information available to the teacher, such as object locations. CritiQ and ReTRy demonstrate significant improvements over traditional imitation learning (e.g., Behavior Cloning, DAgger) and reinforcement learning (e.g., SAC) baselines.

Implications and Future Directions

This research has practical implications for developing more robust collaborative systems in robotics, where agents must operate effectively without full environmental knowledge. The approaches could significantly enhance robotic autonomy and adaptability, especially in complex, unstructured environments.

Theoretically, the techniques proposed could extend to more complex POMDP settings and other domains requiring partial observability solutions. Future work might focus on further optimizing critical state discovery and generalizing the reset strategies to accommodate broader classes of CMDPs.

In conclusion, this paper provides a comprehensive exploration of overcoming the challenges posed by unrealizable teacher demonstrations through strategic student-teacher interactions, setting the stage for more scalable and robust robot learning frameworks.

Github Logo Streamline Icon: https://streamlinehq.com