CLIPSwarm: Generating Drone Shows from Text Prompts with Vision-Language Models (2403.13467v1)
Abstract: This paper introduces CLIPSwarm, a new algorithm designed to automate the modeling of swarm drone formations based on natural language. The algorithm begins by enriching a provided word, to compose a text prompt that serves as input to an iterative approach to find the formation that best matches the provided word. The algorithm iteratively refines formations of robots to align with the textual description, employing different steps for "exploration" and "exploitation". Our framework is currently evaluated on simple formation targets, limited to contour shapes. A formation is visually represented through alpha-shape contours and the most representative color is automatically found for the input word. To measure the similarity between the description and the visual representation of the formation, we use CLIP [1], encoding text and images into vectors and assessing their similarity. Subsequently, the algorithm rearranges the formation to visually represent the word more effectively, within the given constraints of available drones. Control actions are then assigned to the drones, ensuring robotic behavior and collision-free movement. Experimental results demonstrate the system's efficacy in accurately modeling robot formations from natural language descriptions. The algorithm's versatility is showcased through the execution of drone shows in photorealistic simulation with varying shapes. We refer the reader to the supplementary video for a visual reference of the results.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in Int. Conf. on Machine Learning, 2021, pp. 8748–8763.
- “Chatgpt,” https://openai.com/blog/chatgpt, accessed: 2024-02-29.
- W. Huang, C. Wang, R. Zhang, Y. Li, J. Wu, and L. Fei-Fei, “Voxposer: Composable 3d value maps for robotic manipulation with language models,” in Conference on Robot Learning. PMLR, 2023, pp. 540–562.
- L. Scalera, S. Seriani, A. Gasparetto, and P. Gallina, “Non-photorealistic rendering techniques for artistic robotic painting,” Robotics, vol. 8, no. 1, p. 10, 2019.
- Z. Ma, S. Duenser, C. Schumacher, R. Rust, M. Baecher, F. Gramazio, M. Kohler, and S. Coros, “Stylized robotic clay sculpting,” Computers & Graphics, vol. 98, pp. 150–164, 2021.
- P. Pueyo, J. Dendarieta, E. Montijano, A. C. Murillo, and M. Schwager, “Cinempc: A fully autonomous drone cinematography system incorporating zoom, focus, pose, and scene composition,” IEEE Transactions on Robotics, 2024.
- D. Nar and R. Kotecha, “Optimal waypoint assignment for designing drone light show formations,” Results in Control and Optimization, vol. 9, p. 100174, 2022.
- R. Firoozi, J. Tucker, S. Tian, A. Majumdar, J. Sun, W. Liu, Y. Zhu, S. Song, A. Kapoor, K. Hausman et al., “Foundation models in robotics: Applications, challenges, and the future,” arXiv preprint arXiv:2312.07843, 2023.
- F. Zeng, W. Gan, Y. Wang, N. Liu, and P. S. Yu, “Large language models for robotics: A survey,” arXiv preprint arXiv:2311.07226, 2023.
- C. Zhang, J. Chen, J. Li, Y. Peng, and Z. Mao, “Large language models for human-robot interaction: A review,” Biomimetic Intelligence and Robotics, p. 100131, 2023.
- Y. Cui, S. Niekum, A. Gupta, V. Kumar, and A. Rajeswaran, “Can foundation models perform zero-shot task specification for robot manipulation?” in Learning for Dynamics and Control Conference. PMLR, 2022, pp. 893–905.
- W. Huang, P. Abbeel, D. Pathak, and I. Mordatch, “Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,” in International Conference on Machine Learning. PMLR, 2022, pp. 9118–9147.
- C. Kim, Y. Seo, H. Liu, L. Lee, J. Shin, H. Lee, and K. Lee, “Guide your agent with adaptive multimodal rewards,” Advances in Neural Information Processing Systems, vol. 36, 2024.
- N. Wake, A. Kanehira, K. Sasabuchi, J. Takamatsu, and K. Ikeuchi, “Gpt-4v (ision) for robotics: Multimodal task planning from human demonstration,” arXiv preprint arXiv:2311.12015, 2023.
- S. Vemprala, R. Bonatti, A. Bucker, and A. Kapoor, “Chatgpt for robotics: Design principles and model abilities,” Microsoft Auton. Syst. Robot. Res, vol. 2, p. 20, 2023.
- A. Beltramello, L. Scalera, S. Seriani, and P. Gallina, “Artistic robotic painting using the palette knife technique,” Robotics, vol. 9, no. 1, 2020.
- A. Karimov, E. Kopets, G. Kolev, S. Leonov, L. Scalera, and D. Butusov, “Image preprocessing for artistic robotic painting,” Inventions, vol. 6, no. 1, p. 19, 2021.
- G. Chen, S. Baek, J.-D. Florez, W. Qian, S.-w. Leigh, S. Hutchinson, and F. Dellaert, “Gtgraffiti: Spray painting graffiti art from human painting motions with a cable driven parallel robot,” in Int. Conf. on Robotics and Automation, 2022, pp. 4065–4072.
- H. Peng, C. Zhou, H. Hu, F. Chao, and J. Li, “Robotic dance in social robotics—a taxonomy,” IEEE Transactions on Human-Machine Systems, vol. 45, no. 3, pp. 281–293, 2015.
- R. Bonatti, W. Wang, C. Ho, A. Ahuja, M. Gschwindt, E. Camci, E. Kayacan, S. Choudhury, and S. Scherer, “Autonomous aerial cinematography in unstructured environments with learned artistic decision-making,” Journal of Field Robotics, vol. 37, no. 4, pp. 606–641, 2020.
- P. Pueyo, E. Montijano, A. C. Murillo, and M. Schwager, “Cinetransfer: Controlling a robot to imitate cinematographic style from a single example,” in 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2023, pp. 10 044–10 049.
- J. Alonso-Mora, A. Breitenmoser, M. Rufli, R. Siegwart, and P. Beardsley, “Multi-robot system for artistic pattern formation,” in IEEE Int. Conf. on Robotics and Automation, 2011, pp. 4512–4517.
- J. Alonso-Mora, A. Breitenmoser, M. Rufli, R. Siegwart, and P.Beardsley, “Image and animation display with multiple mobile robots,” The International Journal of Robotics Research, vol. 31, no. 6, pp. 753–773, 2012.
- S. Hauri, J. Alonso-Mora, A. Breitenmoser, R. Siegwart, and P. Beardsley, “Multi-robot formation control via a real-time drawing interface,” in 8th Int. Conf. of Field and Service Robotics, 2013, pp. 175–189.
- M. Waibel, B. Keays, and F. Augugliaro, “Drone shows: Creative potential and best practices,” ETH Zurich, Tech. Rep., 2017.
- H.-J. Kim and H.-S. Ahn, “Realization of swarm formation flying and optimal trajectory generation for multi-drone performance show,” in IEEE/SICE Int. Symposium on System Integration, 2016, pp. 850–855.
- H. Sun, J. Qi, C. Wu, and M. Wang, “Path planning for dense drone formation based on modified artificial potential fields,” in 2020 39th Chinese Control Conference, 2020, pp. 4658–4664.
- S. Luo and W. Hu, “Diffusion probabilistic models for 3d point cloud generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 2837–2845.
- L. Hui, R. Xu, J. Xie, J. Qian, and J. Yang, “Progressive point cloud deconvolution generation network,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XV 16. Springer, 2020, pp. 397–413.
- H. Edelsbrunner, “Alpha shapes-a survey,” in Tessellations in the Sciences: Virtues, Techniques and Applications of Geometric Tilings, 2011.
- J. Snape and D. Manocha, “Navigating multiple simple-airplanes in 3d workspace,” in 2010 IEEE International Conference on Robotics and Automation. IEEE, 2010, pp. 3974–3980.
- S. Shah, D. Dey, C. Lovett, and A. Kapoor, “Airsim: High-fidelity visual and physical simulation for autonomous vehicles,” in Field and service robotics, 2018, pp. 621–635.
- P. Pueyo, E. Cristofalo, E. Montijano, and M. Schwager, “Cinemairsim: A camera-realistic robotics simulator for cinematographic purposes,” in IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, 2020, pp. 1186–1191.
- “Orca 3d,” https://github.com/mtreml/Python-RVO2-3D, accessed: 2023-03-2.