LLM-enhanced Scene Graph Learning for Household Rearrangement (2408.12093v2)

Published 22 Aug 2024 in cs.RO and cs.CV

Abstract: The household rearrangement task involves spotting misplaced objects in a scene and accommodate them with proper places. It depends both on common-sense knowledge on the objective side and human user preference on the subjective side. In achieving such task, we propose to mine object functionality with user preference alignment directly from the scene itself, without relying on human intervention. To do so, we work with scene graph representation and propose LLM-enhanced scene graph learning which transforms the input scene graph into an affordance-enhanced graph (AEG) with information-enhanced nodes and newly discovered edges (relations). In AEG, the nodes corresponding to the receptacle objects are augmented with context-induced affordance which encodes what kind of carriable objects can be placed on it. New edges are discovered with newly discovered non-local relations. With AEG, we perform task planning for scene rearrangement by detecting misplaced carriables and determining a proper placement for each of them. We test our method by implementing a tiding robot in simulator and perform evaluation on a new benchmark we build. Extensive evaluations demonstrate that our method achieves state-of-the-art performance on misplacement detection and the following rearrangement planning.

Authors (8)

Wenhao Li (136 papers)
Zhiyuan Yu (25 papers)
Qijin She (5 papers)
Zhinan Yu (5 papers)
Yuqing Lan (8 papers)
Chenyang Zhu (41 papers)
Ruizhen Hu (45 papers)
Kai Xu (312 papers)

Citations (1)

View on Semantic Scholar

Summary

LLM-Enhanced Scene Graph Learning for Household Rearrangement

The paper "LLM-enhanced Scene Graph Learning for Household Rearrangement" presents a novel methodology for tackling household rearrangement tasks by leveraging scene graph representations augmented with insights from LLMs. The core of the method is the development of an affordance-enhanced graph (AEG), which enhances traditional scene graphs with contextual information and affordances directly inferred through a combination of textual and visual data. This method sidesteps the need for explicit human intervention by mining object functionalities and aligning user preferences directly from the scene.

Methodology

The authors propose an innovative pipeline that begins with the extraction of an initial vanilla scene graph (SG) from an RGB-D scan of the scene. This scene graph includes object instances and their spatial relationships. The graph is then transformed into an AEG through a detailed two-stage context analysis:

Local Context Analysis: This stage involves leveraging the surrounding spatial and visual information to create detailed affordance descriptions for each object. A prompt template is used to organize textual and visual information for LLM-based affordance extraction. This analysis helps to identify the current use and potential interactions of objects within their immediate context (e.g., a shelf near clothes being marked as a "cloakroom utility shelf").
Global Context Analysis: This stage broadens the scope by incorporating larger scene contexts such as areas and room-level information. By organizing objects into a hierarchical structure (object-area-room), the method allows for the identification of non-local relationships (e.g., a sofa and a TV being linked due to their functional relationship in a living room). This enrichment is critical for identifying suitable placements for objects by considering how they fit into the broader scene.

The resulting AEG is used to identify misplaced objects and determine optimal placements through a two-step process:

Misplaced Object Detection: An LLM-based scorer evaluates if objects are placed reasonably by analyzing affordances. Objects scoring below a threshold are flagged as misplaced.
Placement Decision Generation: For each misplaced object, potential receptacle candidates are retrieved from the AEG using a Retrieval Augmented Generation (RAG) strategy. The top candidates are then fed into a placement decision generator, which finalizes the rearrangement plan.

Evaluation

The authors evaluate their method using both the Tidybot benchmark and an annotated dataset based on the Habitat Synthetic Scenes Dataset (HSSD 200). These datasets provide a diverse array of scenarios, including multi-room environments, ensuring comprehensive testing of the proposed system. The evaluations demonstrate the practicality and effectiveness of the method:

Misplacement Detection: The approach achieved high scores in accuracy, recall, precision, and F1 score, highlighting the ability to accurately detect misplaced objects.
Rearrangement Planning: Compared against existing methods like Housekeep and Tidybot, the proposed method consistently achieved higher Normalized Discounted Cumulative Gain (NDCG) across various testing criteria.

Implications and Future Directions

The implications of this work are substantial for both practical applications and theoretical advancements:

Practical Implications:

Autonomous Household Assistants: This method can significantly improve the capabilities of household robots, enabling them to perform complex, user-preference-aligned tasks autonomously without extensive human input.
Human-Robot Interaction: By embedding common-sense reasoning and personalized affordance understanding into scene graphs, the interactions between humans and robots become more intuitive and practical.

Theoretical Implications:

Scene Graph Augmentation: The approach sets a precedent for integrating LLMs with scene graphs, paving the way for more nuanced scene understanding.
Affordance Analysis: It advances the concept of affordance from simple spatial relationships to include comprehensive contextual understanding, which could inform future research in embodied AI.

Future Directions:

Joint Scene Graph Optimization: Further work could explore optimizing the initial scene graph construction process using end-to-end learning approaches.
Keyframe Selection Improvement: Developing advanced techniques for selecting keyframes that minimize ambiguities and maximize informative content could enhance the robustness of visual context analysis.

Overall, this method represents a significant step in the development of intelligent, context-aware systems capable of performing complex tasks in dynamic environments. As LLMs continue to evolve, their integration with scene graph learning could lead to even more sophisticated and versatile AI systems.

PDF Markdown

Related Papers

Tweets

https://twitter.com/OWW/status/1834774424161051008