Source of representation gains in C51: distributional backups vs. cross-entropy loss

Determine whether the representation improvements observed with the C51 categorical distributional reinforcement learning algorithm arise primarily from modeling the distribution of returns through distributional Bellman backups or from training value functions using a categorical cross-entropy loss.

Background

Distributional reinforcement learning methods such as C51 represent returns as categorical distributions and are trained with a cross-entropy loss. Prior analyses have attributed some of C51’s performance to improved representations, but it is unclear whether these gains stem from the distributional modeling itself or from the optimization benefits of cross-entropy training.

Clarifying the origin of these representational gains is important for understanding which components are essential for scaling value-based deep RL—whether modeling return distributions is necessary, or whether similar benefits can be achieved by reframing scalar value regression as classification using cross-entropy losses without explicitly modeling distributions.

References

Lyle et al. (2019) showed that gains from C51 can be partially attributed to improved representations but it remains unknown whether they stem from backing up distributions of returns or the use of cross-entropy loss.

Stop Regressing: Training Value Functions via Classification for Scalable Deep RL  (2403.03950 - Farebrother et al., 2024) in Section “Does Classification Learn More Expressive Representations?” (sec:repr_analysis)