GraphPad: Adaptive 3D Scene Graph Memory

Updated 3 March 2026

GraphPad is a dynamic memory system for embodied agents characterized by a mutable 3D scene graph that updates in real time based on linguistic queries.
It integrates structured components such as a navigation log, graphical scratch-pad, and auxiliary frame memory with API-driven updates for efficient spatial and functional reasoning.
Empirical results on embodied question answering show GraphPad achieving 55.3% accuracy with fewer frame inputs, highlighting its practical advantage over static scene representations.

GraphPad is a structured, modifiable memory system for embodied agents that enables inference-time updates of 3D scene graphs to support task-conditional spatial reasoning. Unlike traditional static scene representations built prior to task specification, GraphPad maintains a dynamic, mutable scene graph and associated memory structures that evolve in response to real-time linguistic queries and reasoning requirements. Its architecture is specifically designed to resolve the “task–memory mismatch” endemic to embodied question answering, providing just-in-time object, relation, and attribute acquisition without excessive up-front computational overhead (Ali et al., 1 Jun 2025).

1. Core Architecture and Data Structures

GraphPad’s structured scene memory (SSM) comprises three primary components as well as an auxiliary frame buffer:

Mutable 3D Scene Graph ( $G = (\mathcal{N}, \mathcal{E})$ ):
- Nodes ( $n_i$ ): Object tracks, each containing a point cloud ( $P_i \in \mathbb{R}^{N_i \times 3}$ ), CLIP-based visual embedding ( $V_i \in \mathbb{R}^d$ ), BGE-based language embedding ( $L_i \in \mathbb{R}^k$ ), consolidated caption ( $C_i$ ), room/floor label ( $\ell_i$ ), and frame visibility list ( $F_i$ ).
- Edges ( $r_{ij}$ ): Directed, labeled relations among objects (e.g., “on_top_of,” “contained_in,” “subpart_of,” “attached_to”), each with an associated justification string.
Navigation Log ( $\mathcal{L}$ ): For each RGB-D keyframe $f$ , stores semantic room ID ( $\rho_f$ ), FOV textual tag ( $\tau_f$ ), egocentric motion label ( $\Delta$ pose $_f$ ), and set of visible object nodes ( $V_f$ ). This acts as an index for frame-object association and targeted search.
Graphical Scratch-Pad: Per-object, free-form note fields [ $\mathrm{Notes}_i$ ] for task- or subquery-specific annotations, enabling sub-results or references to be accumulated in situ.
Frame Memory (Auxiliary): List of RGB-D keyframes and indexed camera poses, initialized with a sparse subset ( $n_\mathrm{img}$ frames), expandable through VLM-driven API calls as new task-relevant information is required.

2. Inference-Time Update Protocol and APIs

GraphPad is engineered for real-time, task-driven modification through a structured agentic reasoning loop mediated by a vision-LLM (VLM). At each step, the VLM serializes the current memory state (scene graph, navigation log, scratch-pad, frame memory) and, given a linguistic query, either produces a direct answer or issues API calls for environment-specific probing.

Update APIs (language-callable, all input: frame ID $f$ , subquery $s$ → JSON patch):
- find_objects( $f, s$ ): VLM detector identifies bounding boxes relevant to $s$ in frame $f$ , back-projects to generate $P_j$ , computes $V_j$ , $L_j$ , matches or spawns new $n_j$ .
- analyze_objects( $f, [n_{i1}, ...], s$ ): Given a set of nodes visible in $f$ , crops bboxes, queries VLM about property $s$ , and appends results to $\mathrm{Notes}_{i}$ ; defaults to find_objects if not visible.
- analyze_frame( $f, s$ ): Object discovery and annotation for $s$ over entire frame; used for joint search and explanation.

Update patches insert new nodes, edges, and scratch-pad notes, and may append frames to the memory buffer, synchronized via a well-defined patch-application protocol.

3. Integration with Perceptual and Reasoning Modules

GraphPad leverages a modular perception stack and a VLM to enable dynamic, evidence-driven memory refinement:

Perceptual Modules:
- Object detection and captioning: Gemini-2.0 Flash.
- Mask extraction: Segment Anything Model (SAM).
- Visual embedding: CLIP ViT-L/14 (pooled).
- Language embedding: BGE model.

Prompting at each step serializes the entire scene memory state in JSON (graph nodes, navigation log, scratch-pad) alongside image interleaves and camera poses. Chain-of-thought instructions explicitly require references to both visual (frame ID) and semantic (scratch-pad) evidence.

Batching and caching are employed for computational efficiency: embeddings ( $V_i$ , $L_i$ ) and point clouds ( $P_i$ ) are cached, and edge discovery is performed in batched increments every three frames. In the presented configuration, prompt length grows with memory but no eviction is performed.

4. Empirical Results and Comparative Performance

GraphPad was empirically evaluated on the OpenEQA benchmark for embodied question answering. Using $n_\mathrm{img}=5$ keyframes and $m=20$ maximum API calls, GraphPad attained an overall accuracy of 55.3%, outperforming the image-only Gemini-2.0 Flash baseline (52.3% with 25 frames) while processing five times fewer frames. The 3D-Mem (GPT-4V) system reported 57.2% (25 frames), with human performance at 86.8%.

Category-Level Performance (Accuracy %)

Category	GraphPad	Gemini
Attribute Recognition	66.8	46.5
Object State Recognition	69.6	66.5
Functional Reasoning	59.2	53.5
Spatial Understanding	47.7	52.4

GraphPad delivered pronounced gains in attribute and functional reasoning, with spatial understanding slightly trailing the prior baseline.

Component Ablation (Accuracy %)

Configuration	Accuracy
Frame Memory only	32.9
+ static Scene Graph	34.6
+ Navigation Log	42.5
+ Scene Graph + Navigation Log	46.9
+ Frame-Level API	50.5
+ Node-Level API	47.1

Navigation log inclusion yielded a $+9.6$ percentage point improvement over raw frames; inference-time APIs added $+3.6$ points relative to static scene graph alone.

This suggests that online, language-guided, object/attribute acquisition substantially increases reasoning utility beyond static representations.

5. Update Mechanism: Agentic Reasoning Loop

The agentic reasoning protocol cycles through serialization, VLM-driven action selection, and patch application, as outlined in the following algorithmic schema:

Input: Initial SSM = {G, 𝓛, ScratchPad, FrameMemory}, query q, max_calls m.
calls = 0
while calls < m:
  prompt_state = serialize(G, 𝓛, ScratchPad, FrameMemory)
  action = VLM_reason(prompt_state, q)
  if action.type == “answer”:
     return action.answer
  else:
     patch = call_API(action.api_name, frame_id, subquery)
     apply_patch(SSM, patch)
     calls += 1
return VLM_reason(final_state, q).answer

If the VLM identifies a knowledge gap, it triggers an API call (e.g., “analyze_frame”) corresponding to targeted discovery or annotation. Patches extend the scene graph and associated structures, progressively refining the agent’s internal workspace in line with task-driven demands.

6. Typical Use Cases and Episodic Flow

GraphPad is particularly suited for question answering tasks where the set of relevant scene graph elements cannot be known a priori. For example, in the episode “What white object is on top of the TV?”, the initial memory contains nodes for the TV, couch, and lamp, but not the queried object. The VLM issues an “analyze_frame” API call on the most promising frame, which, after augmentation, results in insertion of a new “air conditioner” node and a supporting “on_top_of” edge. The scratch-pad is simultaneously updated with evidence. Without such dynamic augmentation, legacy systems relying solely on the initial scene graph would fail to answer correctly (Ali et al., 1 Jun 2025).

7. Limitations and Prospects

Stated limitations include error propagation (irreversible mis-detections/annotations), API expressivity (restricted to three basic update functions), inference latency (2–3 seconds per API call), confinement to static, pre-recorded scenes, and issues of prompt scalability as the memory grows. Prospective extensions involve integration of low-confidence filtering modules, design of richer domain-adaptive APIs (e.g., “merge_objects,” “delete_edge”), incremental graph condensation strategies to control prompt size, and adaptation to dynamic tasks such as navigation planning or robotic manipulation. Joint training of VLM policies for graph updates is anticipated to mitigate inference cost and further improve applicability (Ali et al., 1 Jun 2025).

Markdown Report Issue Upgrade to Chat

References (1)

GraphPad: Inference-Time 3D Scene Graph Updates for Embodied Question Answering (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GraphPad.