MEIA: Multimodal Embodied Perception and Interaction in Unknown Environments (2402.00290v3)
Abstract: With the surge in the development of LLMs, embodied intelligence has attracted increasing attention. Nevertheless, prior works on embodied intelligence typically encode scene or historical memory in an unimodal manner, either visual or linguistic, which complicates the alignment of the model's action planning with embodied control. To overcome this limitation, we introduce the Multimodal Embodied Interactive Agent (MEIA), capable of translating high-level tasks expressed in natural language into a sequence of executable actions. Specifically, we propose a novel Multimodal Environment Memory (MEM) module, facilitating the integration of embodied control with large models through the visual-language memory of scenes. This capability enables MEIA to generate executable action plans based on diverse requirements and the robot's capabilities. Furthermore, we construct an embodied question answering dataset based on a dynamic virtual cafe environment with the help of the LLM. In this virtual environment, we conduct several experiments, utilizing multiple large models through zero-shot learning, and carefully design scenarios for various situations. The experimental results showcase the promising performance of our MEIA in various embodied interactive tasks.
- Uniter: Universal image-text representation learning. In European conference on computer vision, pages 104–120. Springer, 2020.
- Visual-linguistic causal intervention for radiology report generation. arXiv preprint arXiv:2303.09117, 2023.
- Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
- Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
- Embodied question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–10, 2018.
- Leveraging commonsense knowledge from large language models for task and motion planning. In RSS 2023 Workshop on Learning for Task and Motion Planning, 2023.
- A survey for in-context learning. arXiv:2301.00234, Dec 2022.
- Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
- A survey of embodied ai: From simulators to research tasks. IEEE Transactions on Emerging Topics in Computational Intelligence, 6(2):230–244, 2022.
- Cogagent: A visual language model for gui agents. arXiv:2312.08914, Dec 2023.
- 3d-llm: Injecting the 3d world into large language models. arXiv preprint arXiv:2307.12981, 2023.
- Look before you leap: Unveiling the power of gpt-4v in robotic vision-language planning. arXiv preprint arXiv:2311.17842, 2023.
- Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. arXiv:2201.07207, 2022.
- Prompter: Utilizing large language model prompting for a data efficient embodied instruction following. arXiv preprint arXiv:2211.03267, 2022.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
- Denselight: Efficient control for large-scale traffic signals with dense feedback. In Edith Elkind, editor, Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI-23, pages 6058–6066. International Joint Conferences on Artificial Intelligence Organization, 8 2023. AI for Good.
- Combining multiple features for cross-domain face sketch recognition. In Biometric Recognition: 11th Chinese Conference, CCBR 2016, Chengdu, China, October 14-16, 2016, Proceedings 11, pages 139–146. Springer, 2016.
- Hierarchically learned view-invariant representations for cross-view action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 29(8):2416–2430, 2018.
- Global temporal representation based cnns for infrared action recognition. IEEE Signal Processing Letters, 25(6):848–852, 2018.
- Transferable feature representation for visible-to-infrared cross-dataset human action recognition. Complexity, 2018:1–20, 2018.
- Deep image-to-video adaptation and fusion networks for action recognition. IEEE Transactions on Image Processing, 29:3168–3182, 2019.
- Semantics-aware adaptive knowledge distillation for sensor-to-vision action recognition. IEEE Transactions on Image Processing, 30:5573–5588, 2021.
- Tcgl: Temporal contrastive graph for self-supervised video representation learning. IEEE Transactions on Image Processing, 31:1978–1993, 2022.
- Causal reasoning meets visual representation learning: A prospective study. Machine Intelligence Research, 19(6):485–511, 2022.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
- Causalvlr: A toolbox and benchmark for visual-linguistic causal reasoning. arXiv preprint arXiv:2306.17462, 2023.
- Cross-modal causal relational reasoning for event-level visual question answering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
- Self-supervised contrastive learning for audio-visual action recognition. In 2023 IEEE International Conference on Image Processing (ICIP), pages 1000–1004. IEEE, 2023.
- Vcd: Visual causality discovery for cross-modal question reasoning. In Chinese Conference on Pattern Recognition and Computer Vision (PRCV), pages 309–322. Springer, 2023.
- Film: Following instructions in language with modular methods. arXiv preprint arXiv:2110.07342, 2021.
- Following natural language instructions for household tasks with landmark guided search and reinforced pose adjustment. IEEE Robotics and Automation Letters, 7(3):6870–6877, 2022.
- OpenAI. Gpt-4 technical report, 2023.
- OpenAI. Gpt-4v(ision) system card, 2023.
- Habitat: A platform for embodied ai research. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Oct 2019.
- Velma: Verbalization embodiment of llm agents for vision and language navigation in street view. arXiv preprint arXiv:2307.06082, 2023.
- Progprompt: Generating situated robot task plans using large language models. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 11523–11530. IEEE, 2023.
- Llm-planner: Few-shot grounded planning for embodied agents with large language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2998–3009, 2023.
- Knowledge-based embodied question answering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
- Towards causalgpt: A multi-agent approach for faithful knowledge reasoning via promoting causal consistency in llms. arXiv preprint arXiv:2308.11914, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Chatgpt empowered long-step robot control in various environments: A case application. arXiv preprint arXiv:2304.03893v6, Apr 2023.
- Chatgpt empowered long-step robot control in various environments: A case application. IEEE Access, 11:95060–95078, 2023.
- Urban regional function guided traffic flow prediction. Information Sciences, 634:308–320, 2023.
- Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023.
- Visual causal scene refinement for video question answering. MM ’23, page 377–386, New York, NY, USA, 2023. Association for Computing Machinery.
- Dual adversarial adaptation for cross-device real-world image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5667–5676, 2022.
- Skeletonmae: graph-based masked autoencoder for skeleton sequence pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5606–5618, 2023.
- Llm-grounder: Open-vocabulary 3d visual grounding with large language model as an agent. arXiv preprint arXiv:2309.12311, 2023.
- React: Synergizing reasoning and acting in language models. arXiv:2210.03629, Oct 2022.
- Scene-driven multimodal knowledge graph construction for embodied ai. arXiv preprint arXiv:2311.03783, 2023.
- Navgpt: Explicit reasoning in vision-and-language navigation with large language models. arXiv preprint arXiv:2305.16986, 2023.
- Hybrid-order representation learning for electricity theft detection. IEEE Transactions on Industrial Informatics, 19(2):1248–1259, 2022.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
- Yang Liu (2253 papers)
- Xinshuai Song (4 papers)
- Kaixuan Jiang (4 papers)
- Weixing Chen (17 papers)
- Jingzhou Luo (6 papers)
- Guanbin Li (177 papers)
- Liang Lin (318 papers)