PoliFormer: Scaling On-Policy RL with Transformers Results in Masterful Navigators (2406.20083v1)

Published 28 Jun 2024 in cs.RO and cs.CV

Abstract: We present PoliFormer (Policy Transformer), an RGB-only indoor navigation agent trained end-to-end with reinforcement learning at scale that generalizes to the real-world without adaptation despite being trained purely in simulation. PoliFormer uses a foundational vision transformer encoder with a causal transformer decoder enabling long-term memory and reasoning. It is trained for hundreds of millions of interactions across diverse environments, leveraging parallelized, multi-machine rollouts for efficient training with high throughput. PoliFormer is a masterful navigator, producing state-of-the-art results across two distinct embodiments, the LoCoBot and Stretch RE-1 robots, and four navigation benchmarks. It breaks through the plateaus of previous work, achieving an unprecedented 85.5% success rate in object goal navigation on the CHORES-S benchmark, a 28.5% absolute improvement. PoliFormer can also be trivially extended to a variety of downstream applications such as object tracking, multi-object navigation, and open-vocabulary navigation with no finetuning.

Authors (9)

Kuo-Hao Zeng (22 papers)
Zichen Zhang (30 papers)
Kiana Ehsani (31 papers)
Rose Hendrix (12 papers)
Jordi Salvador (15 papers)
Alvaro Herrasti (11 papers)
Ross Girshick (75 papers)
Aniruddha Kembhavi (79 papers)
Luca Weihs (46 papers)

Citations (4)

View on Semantic Scholar

Summary

Overview of PoliFormer: Scaling On-Policy RL with Transformers for Masterful Navigation

The paper introduces PoliFormer, a transformer-based policy model designed for on-policy reinforcement learning (RL) that achieves state-of-the-art (SoTA) performance in embodied navigation tasks. The system exclusively uses RGB inputs and operates effectively in both simulated and real-world environments despite being trained solely in simulation. The research emphasizes the efficacy of combining foundational vision transformer encoders with causal transformer decoders. PoliFormer is implemented at scale within AI2-THOR simulation using large-scale RL training across diverse environments, significantly improving on previous benchmarks.

Key Contributions

Model Architecture:
- PoliFormer leverages a vision transformer (ViT), utilizing a foundational model (DINOv2) for visual encoding.
- The architecture includes a transformer state encoder and a causal transformer decoder, providing comprehensive state summarization and temporal memory modeling.
- The use of KV-cache in the causal transformer decoder reduces computational overhead, enhancing training efficiency.
Training Efficiency:
- The RL training setup employs parallelized, multi-machine rollouts, facilitating high throughput and efficient learning.
- The batch size and training steps are scaled progressively to optimize learning without compromising stability.
- This methodology allows for a high volume of environment interactions, further improving the model's robustness.
Benchmarking and Results:
- PoliFormer excels across four navigation benchmarks: Chores- $\mathbb{S}$ , ProcTHOR-10k, ArchitecTHOR, and AI2-iTHOR.
- The model achieves an 85.5% success rate on the Chores- $\mathbb{S}$ benchmark, marking a 28.5% improvement over the previous best.
- Significant gains between 8.7% and 10.0% are observed on the other three benchmarks, solidifying PoliFormer's superiority in navigation tasks.
Generalization and Adaptability:
- PoliFormer demonstrates notable zero-shot transfer to real-world scenarios with LoCoBot and Stretch RE-1 robots, achieving better success rates compared to existing baselines.
- The model can be extended trivially to various downstream applications like multi-object navigation, object tracking, and more, without requiring additional fine-tuning.

Implications and Future Directions

Practical Implications:

The research suggests that utilizing large-scale training and high-capacity models can significantly improve the navigation capabilities of embodied agents.
The methodology of leveraging parallel rollouts and scalable environments can be adapted to other complex RL tasks beyond navigation.

Theoretical Implications:

The combination of vision transformers and causal transformers for state representation and memory modeling demonstrates the potential of transformer architectures in RL and embodied AI.
The incremental scaling of batch sizes and effective temporal caching strategies can provide insights into training large neural models in other domains.

Future Developments:

Further scaling of model parameters and training steps is warranted to explore the upper limits of navigation task performance. The current results indicate that PoliFormer's performance has not saturated and can benefit from additional computational resources.
Detailed exploration of cross-embodiment models could reveal the feasibility of training a single PoliFormer variant adaptable to multiple robotic platforms without separate training.
Extending PoliFormer to incorporate additional sensory inputs, like depth sensors, could enhance its applicability to mobile manipulation tasks, broadening its utility in varied real-world applications.

In conclusion, PoliFormer exemplifies a significant advancement in the application of transformers in RL, setting a high bar for future research in embodied navigation. The combination of efficient training methodologies and robust architecture potentially paves the way for the development of more generalized and highly capable AI agents in complex environments.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/arankomatsuzaki/status/1807597357191532965

https://twitter.com/fly51fly/status/1807894949448634374

https://twitter.com/AtakanTekparmak/status/1808138466691002860

https://twitter.com/gm8xx8/status/1807598705299575260