Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Crossway Diffusion: Improving Diffusion-based Visuomotor Policy via Self-supervised Learning (2307.01849v3)

Published 4 Jul 2023 in cs.RO, cs.CV, and cs.LG

Abstract: Sequence modeling approaches have shown promising results in robot imitation learning. Recently, diffusion models have been adopted for behavioral cloning in a sequence modeling fashion, benefiting from their exceptional capabilities in modeling complex data distributions. The standard diffusion-based policy iteratively generates action sequences from random noise conditioned on the input states. Nonetheless, the model for diffusion policy can be further improved in terms of visual representations. In this work, we propose Crossway Diffusion, a simple yet effective method to enhance diffusion-based visuomotor policy learning via a carefully designed state decoder and an auxiliary self-supervised learning (SSL) objective. The state decoder reconstructs raw image pixels and other state information from the intermediate representations of the reverse diffusion process. The whole model is jointly optimized by the SSL objective and the original diffusion loss. Our experiments demonstrate the effectiveness of Crossway Diffusion in various simulated and real-world robot tasks, confirming its consistent advantages over the standard diffusion-based policy and substantial improvements over the baselines.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (84)
  1. D. A. Pomerleau, “Alvinn: An autonomous land vehicle in a neural network,” in Advances in Neural Information Processing Systems (NeurIPS), 1988.
  2. A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, T. Jackson, S. Jesmonth, N. J. Joshi, R. Julian, D. Kalashnikov, Y. Kuang, I. Leal, K.-H. Lee, S. Levine, Y. Lu, U. Malla, D. Manjunath, I. Mordatch, O. Nachum, C. Parada, J. Peralta, E. Perez, K. Pertsch, J. Quiambao, K. Rao, M. Ryoo, G. Salazar, P. Sanketi, K. Sayed, J. Singh, S. Sontakke, A. Stone, C. Tan, H. Tran, V. Vanhoucke, S. Vega, Q. Vuong, F. Xia, T. Xiao, P. Xu, S. Xu, T. Yu, and B. Zitkovich, “Rt-1: Robotics transformer for real-world control at scale,” Robotics science and systems (RSS), 2023.
  3. L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P. Abbeel, A. Srinivas, and I. Mordatch, “Decision transformer: Reinforcement learning via sequence modeling,” Advances in Neural Information Processing Systems (NeurIPS), vol. 34, pp. 15 084–15 097, 2021.
  4. M. Janner, Q. Li, and S. Levine, “Offline reinforcement learning as one big sequence modeling problem,” in Advances in Neural Information Processing Systems (NeurIPS), Dec. 2021.
  5. J. Shang, X. Li, K. Kahatapitiya, Y.-C. Lee, and M. S. Ryoo, “Starformer: Transformer with state-action-reward representations for robot learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–16, 2022.
  6. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  7. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in Proceedings of the International Conference on Learning Representations (ICLR), 2020.
  8. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems (NeurIPS), Dec. 2017.
  9. A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn et al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” arXiv preprint arXiv:2307.15818, 2023.
  10. J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” in International Conference on Machine Learning.   PMLR, 2015, pp. 2256–2265.
  11. J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 6840–6851, 2020.
  12. J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” in Proceedings of the International Conference on Learning Representations (ICLR), 2021.
  13. R. Burgert, X. Li, A. Leite, K. Ranasinghe, and M. S. Ryoo, “Diffusion illusions: Hiding images in plain sight,” arXiv preprint arXiv:2312.03817, 2023.
  14. M. Janner, Y. Du, J. B. Tenenbaum, and S. Levine, “Planning with diffusion for flexible behavior synthesis,” Proceedings of the International Conference on Machine Learning (ICML), 2022.
  15. Z. Wang, J. J. Hunt, and M. Zhou, “Diffusion policies as an expressive policy class for offline reinforcement learning,” Proceedings of the International Conference on Learning Representations (ICLR), 2023.
  16. T. Pearce, T. Rashid, A. Kanervisto, D. Bignell, M. Sun, R. Georgescu, S. V. Macua, S. Z. Tan, I. Momennejad, K. Hofmann et al., “Imitating human behaviour with diffusion models,” Proceedings of the International Conference on Learning Representations (ICLR), 2023.
  17. C. Chi, S. Feng, Y. Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,” Robotics science and systems (RSS), 2023.
  18. A. Mandlekar, D. Xu, J. Wong, S. Nasiriany, C. Wang, R. Kulkarni, L. Fei-Fei, S. Savarese, Y. Zhu, and R. Martín-Martín, “What matters in learning from offline demonstrations for robot manipulation,” Conference on Robot Learning (CoRL), 2021.
  19. A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,” arXiv preprint arXiv:2204.06125, 2022.
  20. R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 10 684–10 695.
  21. S. Gu, D. Chen, J. Bao, F. Wen, B. Zhang, D. Chen, L. Yuan, and B. Guo, “Vector quantized diffusion model for text-to-image synthesis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 10 696–10 706.
  22. R. Burgert, K. Ranasinghe, X. Li, and M. S. Ryoo, “Peekaboo: Text to image diffusion models are zero-shot segmentors,” arXiv preprint arXiv:2211.13224, 2022.
  23. H.-C. Wang, S.-F. Chen, and S.-H. Sun, “Diffusion model-augmented behavioral cloning,” arXiv preprint arXiv:2302.13335, 2023.
  24. C. Lu, P. J. Ball, and J. Parker-Holder, “Synthetic experience replay,” arXiv preprint arXiv:2303.06614, 2023.
  25. T. Yu, T. Xiao, A. Stone, J. Tompson, A. Brohan, S. Wang, J. Singh, C. Tan, J. Peralta, B. Ichter et al., “Scaling robot learning with semantically imagined experience,” Robotics science and systems (RSS), 2023.
  26. Y. Dai, M. Yang, B. Dai, H. Dai, O. Nachum, J. Tenenbaum, D. Schuurmans, and P. Abbeel, “Learning universal policies via text-guided video generation,” arXiv preprint arXiv:2302.00111, 2023.
  27. J. Shang, K. Kahatapitiya, X. Li, and M. S. Ryoo, “Starformer: Transformer with state-action-reward representations for visual reinforcement learning,” in Proceedings of the European Conference on Computer Vision (ECCV).   Springer, 2022, pp. 462–479.
  28. E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville, “Film: Visual reasoning with a general conditioning layer,” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), vol. 32, 2018.
  29. N. Watters, L. Matthey, C. P. Burgess, and A. Lerchner, “Spatial broadcast decoder: A simple architecture for learning disentangled representations in vaes,” arXiv preprint arXiv:1901.07017, 2019.
  30. B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” in Proceedings of the European Conference on Computer Vision (ECCV).   Springer, 2020, pp. 405–421.
  31. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems (NeurIPS), vol. 30, 2017.
  32. P. Florence, C. Lynch, A. Zeng, O. A. Ramirez, A. Wahid, L. Downs, A. Wong, J. Lee, I. Mordatch, and J. Tompson, “Implicit behavioral cloning,” in Conference on Robot Learning (CoRL).   PMLR, 2022, pp. 158–168.
  33. M. Laskin, A. Srinivas, and P. Abbeel, “Curl: Contrastive unsupervised representations for reinforcement learning,” in Proceedings of the International Conference on Machine Learning (ICML).   PMLR, 2020, pp. 5639–5650.
  34. X. Li, J. Shang, S. Das, and M. Ryoo, “Does self-supervised learning really improve reinforcement learning from pixels?” in Advances in Neural Information Processing Systems (NeurIPS), vol. 35, 2022, pp. 30 865–30 881.
  35. P. Kormushev, S. Calinon, and D. G. Caldwell, “Imitation learning of positional and force skills demonstrated via kinesthetic teaching and haptic input,” Advanced Robotics, vol. 25, no. 5, pp. 581–603, 2011.
  36. C. G. Atkeson and S. Schaal, “Robot learning from demonstration,” in Proceedings of the International Conference on Machine Learning (ICML), vol. 97, 1997, pp. 12–20.
  37. A. Y. Ng and S. Russell, “Algorithms for inverse reinforcement learning,” in Proceedings of the International Conference on Machine Learning (ICML), vol. 1, 2000, p. 2.
  38. T. Osa, J. Pajarinen, G. Neumann, J. A. Bagnell, P. Abbeel, J. Peters et al., “An algorithmic perspective on imitation learning,” Foundations and Trends® in Robotics, vol. 7, no. 1-2, pp. 1–179, 2018.
  39. J. Ho and S. Ermon, “Generative adversarial imitation learning,” Advances in Neural Information Processing Systems (NeurIPS), vol. 29, 2016.
  40. X. B. Peng, Z. Ma, P. Abbeel, S. Levine, and A. Kanazawa, “Amp: Adversarial motion priors for stylized physics-based character control,” ACM Transactions on Graphics (TOG), vol. 40, no. 4, pp. 1–20, 2021.
  41. P. Florence, L. Manuelli, and R. Tedrake, “Self-supervised correspondence in visuomotor policy learning,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 492–499, 2019.
  42. T. Zhang, Z. McCarthy, O. Jow, D. Lee, X. Chen, K. Goldberg, and P. Abbeel, “Deep imitation learning for complex manipulation tasks from virtual reality teleoperation,” in IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2018, pp. 5628–5635.
  43. R. Rahmatizadeh, P. Abolghasemi, L. Bölöni, and S. Levine, “Vision-based multi-task manipulation for inexpensive robots using end-to-end learning from demonstration,” in 2018 IEEE international conference on robotics and automation (ICRA).   IEEE, 2018, pp. 3758–3765.
  44. S. Fujimoto and S. S. Gu, “A minimalist approach to offline reinforcement learning,” Advances in Neural Information Processing Systems (NeurIPS), vol. 34, pp. 20 132–20 145, 2021.
  45. X. B. Peng, P. Abbeel, S. Levine, and M. Van de Panne, “Deepmimic: Example-guided deep reinforcement learning of physics-based character skills,” ACM Transactions On Graphics (TOG), vol. 37, no. 4, pp. 1–14, 2018.
  46. A. Rajeswaran, V. Kumar, A. Gupta, G. Vezzani, J. Schulman, E. Todorov, and S. Levine, “Learning complex dexterous manipulation with deep reinforcement learning and demonstrations,” Robotics science and systems (RSS), 2018.
  47. P. Abbeel and A. Y. Ng, “Apprenticeship learning via inverse reinforcement learning,” in Proceedings of the International Conference on Machine Learning (ICML), 2004, p. 1.
  48. S. Hu, L. Shen, Y. Zhang, and D. Tao, “Graph decision transformer,” arXiv preprint arXiv:2303.03747, 2023.
  49. Q. Zheng, A. Zhang, and A. Grover, “Online decision transformer,” in Proceedings of the International Conference on Machine Learning (ICML), 2022.
  50. H. Furuta, Y. Matsuo, and S. S. Gu, “Distributional decision transformer for hindsight information matching,” in Proceedings of the International Conference on Learning Representations (ICLR), 2022.
  51. K. Wang, H. Zhao, X. Luo, K. Ren, W. Zhang, and D. Li, “Bootstrapped transformer for offline reinforcement learning,” in Advances in Neural Information Processing Systems (NeurIPS), 2022.
  52. Z. Zhu, H. Zhao, H. He, Y. Zhong, S. Zhang, Y. Yu, and W. Zhang, “Diffusion models for reinforcement learning: A survey,” arXiv preprint arXiv:2311.01223, 2023.
  53. P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in Proceedings of the International Conference on Machine Learning (ICML), 2008, pp. 1096–1103.
  54. K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 16 000–16 009.
  55. Z. Tong, Y. Song, J. Wang, and L. Wang, “Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training,” Advances in Neural Information Processing Systems (NeurIPS), vol. 35, pp. 10 078–10 093, 2022.
  56. L. Wang, B. Huang, Z. Zhao, Z. Tong, Y. He, Y. Wang, Y. Wang, and Y. Qiao, “Videomae v2: Scaling video masked autoencoders with dual masking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 14 549–14 560.
  57. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 1877–1901, 2020.
  58. T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in Proceedings of the International Conference on Machine Learning (ICML).   PMLR, 2020, pp. 1597–1607.
  59. S. Das, T. Jain, D. Reilly, P. Balaji, S. Karmakar, S. Marjit, X. Li, A. Das, and M. S. Ryoo, “Limited data, unlimited potential: A study on vits augmented by masked autoencoders,” in IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024, pp. 6878–6888.
  60. D. Ha and J. Schmidhuber, “Recurrent world models facilitate policy evolution,” Advances in Neural Information Processing Systems (NeurIPS), vol. 31, 2018.
  61. P. Sermanet, C. Lynch, Y. Chebotar, J. Hsu, E. Jang, S. Schaal, S. Levine, and G. Brain, “Time-contrastive networks: Self-supervised learning from video,” in IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2018, pp. 1134–1141.
  62. A. Zhan, P. Zhao, L. Pinto, P. Abbeel, and M. Laskin, “A framework for efficient robotic manipulation,” in Advances in Neural Information Processing Systems (NeurIPS), 2021.
  63. R. Shah and V. Kumar, “Rrl: Resnet as representation for reinforcement learning,” in Proceedings of the International Conference on Learning Representations (ICLR), 2021.
  64. A. Stooke, K. Lee, P. Abbeel, and M. Laskin, “Decoupling representation learning from reinforcement learning,” in Proceedings of the International Conference on Machine Learning (ICML).   PMLR, 2021, pp. 9870–9879.
  65. J. Shang and M. S. Ryoo, “Self-supervised disentangled representation learning for third-person imitation learning,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2021, pp. 214–221.
  66. C. Wang, X. Luo, K. Ross, and D. Li, “Vrl3: A data-driven framework for visual deep reinforcement learning,” Advances in Neural Information Processing Systems (NeurIPS), 2022.
  67. T. Xiao, I. Radosavovic, T. Darrell, and J. Malik, “Masked visual pre-training for motor control,” arXiv preprint arXiv:2203.06173, 2022.
  68. A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
  69. M. Igl, L. Zintgraf, T. A. Le, F. Wood, and S. Whiteson, “Deep variational reinforcement learning for pomdps,” in Proceedings of the International Conference on Machine Learning (ICML).   PMLR, 2018, pp. 2117–2126.
  70. D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson, “Learning latent dynamics for planning from pixels,” in Proceedings of the International Conference on Machine Learning (ICML).   PMLR, 2019, pp. 2555–2565.
  71. P. Yingjun and H. Xinwen, “Learning representations in reinforcement learning: An information bottleneck approach,” arXiv preprint arXiv:1911.05695, 2019.
  72. D. Yarats, A. Zhang, I. Kostrikov, B. Amos, J. Pineau, and R. Fergus, “Improving sample efficiency in model-free reinforcement learning from images,” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2021, pp. 10 674–10 681.
  73. A. X. Lee, A. Nagabandi, P. Abbeel, and S. Levine, “Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model,” Advances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 741–752, 2020.
  74. J. Zhu, Y. Xia, L. Wu, J. Deng, W. Zhou, T. Qin, T.-Y. Liu, and H. Li, “Masked contrastive representation learning for reinforcement learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 03, pp. 3421–3433, 2023.
  75. B. Mazoure, R. Tachet des Combes, T. L. Doan, P. Bachman, and R. D. Hjelm, “Deep reinforcement and infomax learning,” Advances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 3686–3698, 2020.
  76. K.-H. Lee, I. Fischer, A. Liu, Y. Guo, H. Lee, J. Canny, and S. Guadarrama, “Predictive information accelerates learning in rl,” Advances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 11 890–11 901, 2020.
  77. Z. D. Guo, B. A. Pires, B. Piot, J.-B. Grill, F. Altché, R. Munos, and M. G. Azar, “Bootstrap latent-predictive representations for multitask reinforcement learning,” in Proceedings of the International Conference on Machine Learning (ICML).   PMLR, 2020, pp. 3875–3886.
  78. M. Schwarzer, A. Anand, R. Goel, R. D. Hjelm, A. Courville, and P. Bachman, “Data-efficient reinforcement learning with momentum predictive representations,” in Proceedings of the International Conference on Learning Representations (ICLR), 2021.
  79. A. Zhang, R. McAllister, R. Calandra, Y. Gal, and S. Levine, “Learning invariant representations for reinforcement learning without reconstruction,” in Proceedings of the International Conference on Learning Representations (ICLR), 2021.
  80. T. Yu, C. Lan, W. Zeng, M. Feng, Z. Zhang, and Z. Chen, “Playvirtual: Augmenting cycle-consistent virtual trajectories for reinforcement learning,” Advances in Neural Information Processing Systems (NeurIPS), vol. 34, 2021.
  81. T. D. Kulkarni, A. Gupta, C. Ionescu, S. Borgeaud, M. Reynolds, A. Zisserman, and V. Mnih, “Unsupervised learning of object keypoints for perception and control,” Advances in Neural Information Processing Systems (NeurIPS), vol. 32, 2019.
  82. X. Wang, L. Lian, and S. X. Yu, “Unsupervised visual attention and invariance for reinforcement learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 6677–6687.
  83. T. Salimans and J. Ho, “Progressive distillation for fast sampling of diffusion models,” Proceedings of the International Conference on Learning Representations (ICLR), 2022.
  84. Y. Song, P. Dhariwal, M. Chen, and I. Sutskever, “Consistency models,” in Proceedings of the International Conference on Machine Learning (ICML), 2023.
Citations (19)

Summary

  • The paper presents Crossway Diffusion, a novel approach that jointly optimizes diffusion loss with a self-supervised state reconstruction objective.
  • It integrates a state decoder that reconstructs raw inputs from intermediate representations to improve visuomotor policy learning.
  • Experiments on simulated and real robot tasks show up to a 15.7% success rate improvement and robust performance under distractions.

The paper presents "Crossway Diffusion," a method designed to enhance diffusion-based visuomotor policy learning. It builds upon existing diffusion policy approaches by incorporating a state decoder and an auxiliary self-supervised learning (SSL) objective. The key idea is to reconstruct input states (raw image pixels and other state information) from the intermediate representations generated during the reverse diffusion process. This reconstruction is enforced through a specifically designed state decoder. The whole model is optimized jointly using the SSL objective and the standard diffusion loss.

Here's a breakdown of the key components and contributions:

  • Problem: Standard diffusion-based policies for robot imitation learning can be improved, particularly in terms of visual representation learning.
  • Method: Crossway Diffusion introduces:
    • State Decoder: A neural network that reconstructs raw image pixels and other state information from intermediate representations within the reverse diffusion process. This forces the model to learn better intermediate representations.
    • Self-Supervised Learning (SSL) Objective: A loss function that measures the difference between the reconstructed states and the original input states, encouraging accurate reconstruction. This loss is combined with the original diffusion loss to train the entire model.
    • Intersection Transformation: Transformation applied to the intermediate diffusion representation before feeding it to the state decoder.
  • Architecture: The Crossway Diffusion model consists of a state encoder, action encoder, action decoder (same as Diffusion Policy), and the new state decoder. The intermediate representation is dubbed as "intersection," because both flows of information pass through this representation.
  • Experiments:
    • Evaluated on simulated robot tasks from Robomimic and a Push-T task from IBC.
    • Demonstrated on real-world robot manipulation tasks.
    • Compared against Diffusion Policy and Implicit Behavior Cloning (IBC).
  • Results:
    • Crossway Diffusion consistently outperforms the baseline Diffusion Policy and IBC.
    • Shows significant improvement (e.g., 15.7% in success rate on "Transport, mh") on tasks with varied demonstration proficiency.
    • Qualitative results show good image reconstruction, suggesting effective representation learning.
    • The method exhibits robustness to distractions such as unseen objects and partial occlusions in real-world settings.
    • Ablation studies validate the design choices, including the state decoder architecture and the auxiliary reconstruction objective.
  • Ablation Studies:
    • Different designs of the state decoder
    • Predicting the future state versus reconstructing the current state with the state decoder
    • Using contrastive learning instead of image reconstruction as an SSL objective.
  • Key Findings from Ablations:
    • Forcing the two flows of information (denoising and state reconstruction) to intersect is important. Design D, where these two flows were disentangled, showed worse performance.
    • Reconstructing the current state is more beneficial than predicting future states as an auxiliary objective.
    • Not all SSL objectives are beneficial. Contrastive learning as an auxiliary loss performed worse than the image reconstruction auxiliary task.
  • Contributions:
    • A novel method (Crossway Diffusion) for improving diffusion-based visuomotor policies.
    • Extensive experiments on simulated and real-world tasks.
    • Detailed ablation studies to justify the design choices.

In essence, Crossway Diffusion leverages self-supervised learning through state reconstruction to improve the visual representations learned by diffusion-based policies, leading to better performance in robot imitation learning tasks.

Youtube Logo Streamline Icon: https://streamlinehq.com