Unknown normalization details for Gato’s Procgen results

Ascertain the score of the data collection policy used by Gato for Procgen and determine how its reported performance is normalized relative to standard Procgen normalization scores in order to enable a direct, fair comparison.

Background

When situating GEA’s Procgen results against prior work, the authors note that the Gato model reports scores relative to a data collection policy whose score is not disclosed. Without this information, it is unclear how Gato’s Procgen performance aligns with the standard normalization protocol, preventing a fair comparison.

Clarifying the normalization procedure and the underlying score used for Gato’s reporting would resolve this reproducibility and comparability gap in Procgen evaluations.

References

While Gato also reports numbers in Procgen, we are not able to compare to these numbers because Gato reports performance relative to unknown score of the data collection policy. To the best of our knowledge the score of the data collection policy is not released. Thus, it is unclear how the Gato Procgen performance is normalized according to the standard Procgen normalization scores rendering a direct comparison impossible.

From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons (2412.08442 - Szot et al., 11 Dec 2024) in Appendix: Further Experimental Details, Additional Baseline Details (Procgen)