Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World (2401.08577v1)

Published 16 Jan 2024 in cs.CV, cs.AI, cs.CL, cs.LG, and cs.RO
MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World

Abstract: Human beings possess the capability to multiply a melange of multisensory cues while actively exploring and interacting with the 3D world. Current multi-modal LLMs, however, passively absorb sensory data as inputs, lacking the capacity to actively interact with the objects in the 3D environment and dynamically collect their multisensory information. To usher in the study of this area, we propose MultiPLY, a multisensory embodied LLM that could incorporate multisensory interactive data, including visual, audio, tactile, and thermal information into LLMs, thereby establishing the correlation among words, actions, and percepts. To this end, we first collect Multisensory Universe, a large-scale multisensory interaction dataset comprising 500k data by deploying an LLM-powered embodied agent to engage with the 3D environment. To perform instruction tuning with pre-trained LLM on such generated data, we first encode the 3D scene as abstracted object-centric representations and then introduce action tokens denoting that the embodied agent takes certain actions within the environment, as well as state tokens that represent the multisensory state observations of the agent at each time step. In the inference time, MultiPLY could generate action tokens, instructing the agent to take the action in the environment and obtain the next multisensory state observation. The observation is then appended back to the LLM via state tokens to generate subsequent text or action tokens. We demonstrate that MultiPLY outperforms baselines by a large margin through a diverse set of embodied tasks involving object retrieval, tool use, multisensory captioning, and task decomposition.

Overview of MultiPLY

The recently introduced MultiPLY framework signifies a transformative approach in LLMs by enabling them to not only absorb multisensory data passively but also to actively interact with three-dimensional (3D) environments. This capability injects an unprecedented level of dynamism into AI agents, allowing them to glean information from the environment through multiple senses—visual, auditory, tactile, and thermal.

Data Collection and Representation

Underpinning this framework is the newly established Multisensory Universe dataset, which provides over half a million instances of multisensory interaction data. To amass this dataset, a virtual agent, powered by an LLM, is deployed within diverse 3D settings to collect observations across several sensory modalities. These 3D environments are abstractly encoded as object-centric representations that inform the LLM of the objects present and their spatial arrangement. Along with this high-level view, the LLM is designed to recognize and employ action tokens that correspond to specific interactions, such as navigating to an object or touching it to acquire tactile information.

State Tokens and Inference

After performing an action, the collected multisensory observation is communicated back to the LLM using state tokens, allowing the model to continuously update its understanding of the environment and determine the next action. This cycle repeats, enabling the agent to methodically explore its surroundings and gather comprehensive sensory data to generate text or further action tokens. MultiPLY's performance exceeds existing baselines across various tasks, including object retrieval, tool usage, multisensory captioning, and task decomposition.

Experimental Findings

Through its unique interactive and multisensory capabilities, MultiPLY demonstrates superiority over previous models that solely process passive data and generate one-off outputs. This is particularly evident in its object retrieval ability, where taking into account the multiple modalities heavily influences the success of identifying the correct object among visually similar candidates. In scenarios that require tool usage, MultiPLY's detailed interaction with its environment allows it to reason more effectively about the functionality of objects given their multisensory attributes, thus providing more accurate solutions. Furthermore, in multisensory captioning tasks, the model's prowess in leveraging various sensory inputs to comprehensively describe objects is evident. Finally, MultiPLY's iterative interaction approach lends itself well to tasks that involve breaking down complex activities into sequential actions.

By establishing a more intricate and closer-to-human method of environmental interaction, MultiPLY marks a significant stride in the direction of embodied AI research. This innovation not only expands the potential uses of LLMs but also enriches the overall modality of how AI systems can learn from and engage with the world around them.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yining Hong (23 papers)
  2. Zishuo Zheng (5 papers)
  3. Peihao Chen (28 papers)
  4. Yian Wang (26 papers)
  5. Junyan Li (17 papers)
  6. Chuang Gan (195 papers)
Citations (12)
X Twitter Logo Streamline Icon: https://streamlinehq.com