Value-suboptimality guarantees under suboptimal demonstrators for general bounded rewards

Determine whether value-suboptimality guarantees can be achieved for general bounded reward model classes when the demonstrator is suboptimal, within the learning-from-correct-demonstrations framework that relies on low-cardinality reward model classes rather than policy-class assumptions.

Background

The paper extends its main results from binary to bounded rewards and provides guarantees when demonstrations are optimal. For suboptimal demonstrators, the authors develop a method that competes with the demonstrator’s loss up to a constant factor, but they note that such loss guarantees do not translate to value-suboptimality in general non-binary reward settings.

They contrast this with prior work under stronger policy-class assumptions that can achieve value-suboptimality via distribution matching, and explicitly raise whether analogous value guarantees can be obtained under their reward-model-class assumption when the demonstrator is suboptimal.

References

We leave it as an interesting and important direction for future work whether we can achieve $$-value suboptimality for general bounded reward model classes even if the demonstrator is suboptimal.

Learning to Answer from Correct Demonstrations (2510.15464 - Joshi et al., 17 Oct 2025) in Remark (Value Suboptimality), Section 6 (Learning from Suboptimal Demonstrator)