Papers
Topics
Authors
Recent
Search
2000 character limit reached

Spatial Scratchpad: Geometry-Aware Memory

Updated 22 January 2026
  • Spatial Scratchpad is a structured working memory that encodes explicit object-level spatial relationships, geometric data, and semantic attributes for 2D/3D environments.
  • It enables agentic models to perform iterative spatial reasoning, scene planning, and geometry-aware content generation through a composite representation of scene, points, and relational hypergraphs.
  • Empirical results demonstrate improved scene synthesis, ergonomic layout planning, and robust visual-semantic performance in downstream tasks like interactive editing and zero-shot navigation.

A spatial scratchpad is a specialized, structured representation that serves as working memory for agentic models—particularly vision-LLMs (VLMs)—enabling iterative spatial reasoning, scene planning, and geometry-aware content generation. Unlike simple text or unstructured multimodal embeddings, a spatial scratchpad encodes explicit object-level spatial relationships, geometric data, and semantic attributes for downstream agentic operations in 2D or 3D environments (Liu et al., 26 May 2025). This construct provides the scaffolding for tasks such as 3D scene synthesis, interactive editing, navigation planning, and affordance reasoning, supporting both human-in-the-loop and fully autonomous workflows.

1. Core Components of a Spatial Scratchpad

Contemporary agentic frameworks structure the spatial scratchpad as a composite tuple: C=(S,P,G)C = (S, P, G) where:

  • SS (Scene Portrait): High-level semantic blueprint combining structured text (descriptions of layout, style, atmosphere, and entity lists) and reference images (either user-supplied or synthesized).
  • PP (Semantically Labeled Point Cloud): A collection P={(xi,ci,li)}i=1NP = \{(x_i, c_i, l_i)\}_{i=1}^N of 3D coordinates xiR3x_i \in \mathbb{R}^3, per-point RGB color ciR3c_i \in \mathbb{R}^3, and instance label lil_i generated from segmentation models (e.g., Grounded-SAM).
  • GG (Scene Hypergraph): A hypergraph G=(V,E)G = (V, E) with nodes V={vj}V = \{v_j\} representing object instances and edges E=kE(k)E = \bigcup_k E^{(k)} encoding kk-ary spatial relations (including unary: clearance, binary: contact/alignment, ternary+: symmetry, circularity) (Liu et al., 26 May 2025).

This triplet forms an explicit, continually evolving memory structure, which agentic VLMs read from and update during iterative reasoning and scene creation.

2. Functional Role in Agentic Pipelines

The spatial scratchpad is fundamental to agentic scene construction, supporting bidirectional interaction:

  • Readout: VLMs ingest SS directly; PP is visualized via rendered maps from canonical and input views; GG is serialized into textual relation descriptions.
  • Update: When object geometry or pose changes (e.g., mesh restoration, pose alignment), PP and GG are updated to maintain spatial and semantic consistency (Liu et al., 26 May 2025).

In agentic 3D pipelines, the scratchpad guides four coordinated stages:

  1. Asset generation and mesh restoration per object.
  2. Coarse layout planning (mesh-to-point cloud alignment).
  3. Environment setup with auto-verification via round-trip rendering, agentic feedback, and code correction.
  4. Ergonomic adjustment via constraint optimization over the hypergraph.

3. Spatial Constraints and Optimization

Spatial relationships in GG are enforced by solving: minRvSO(3),tvR3eEλreLre({Rv,tv}ve)\min_{R_v \in SO(3),\, t_v \in \mathbb{R}^3} \sum_{e \in E} \lambda_{r_e}\,L_{r_e}(\{R_v, t_v\}_{v \in e}) where RvR_v, tvt_v are rotation/translation for each object. LreL_{r_e} is a loss function specialized for relation rer_e, e.g.:

  • Contact: Lcontact=[minpMvi,qMvjRip+ti(Rjq+tj)ϵ]+2L_{contact} = [\min_{p \in M_{v_i}, q \in M_{v_j}}\|R_i p + t_i - (R_j q + t_j)\| - \epsilon]_+^2
  • Clearance: Lclearance(v)=vv[dmin(v)ovov]+2L_{clearance}(v) = \sum_{v' \ne v} [d_{min}(v) - \|o_v - o_{v'}\|]_+^2
  • Alignment/Symmetry: Quadratic losses along chosen axes or centers (Liu et al., 26 May 2025).

Navigation and path planning leverage PP and GG: the environment is discretized, costs for paths penalize clearance violations, and standard graph search (A*, Dijkstra) is applied.

4. Downstream Tasks Enabled by the Scratchpad

Injecting spatial context via the scratchpad unlocks a range of capabilities:

  • Interactive Scene Editing: Move, add, or remove objects and immediately update spatial relations.
  • Ergonomic Planning: Adjust layouts (e.g., furniture) for physical plausibility and functional comfort.
  • Zero-Shot Navigation: Compute collision-free motion plans taking into account object geometry and clearance (Liu et al., 26 May 2025).
  • Semantic Querying: Model can answer open-vocabulary questions about objects and relations (e.g., “Is the chair touching the desk?”).

5. Empirical Results and Metric Benchmarks

Experiments show significant gains over baselines on scene synthesis and manipulation by injecting spatial scratchpads:

  • CLIP/BLIP similarity: Agentic VLMs with scratchpad representation achieve up to 0.737 CLIP and 0.571 LPIPS on challenging inputs, outperforming prior models (Holodeck, DreamScene, ACDC) (Liu et al., 26 May 2025).
  • Functional Plausibility: Human/GPT-4o scores rank spatial scratchpad–enabled scenes highest on realness and ergonomic layout (AQ=1.0, FP=1.0).
  • Generalization: Framework generalizes robustly to diverse prompts and multi-view scenarios, supporting downstream editing and path planning.

Ablations demonstrate that geometric restoration and hypergraph-guided adjustment are critical—removing them increases collisions, reduces semantic accuracy, and degrades overall plausibility.

6. Applications and Limitations

Spatial scratchpads underpin key applications:

  • Embodied AI: Test environment generation for robot learning platforms (Habitat, SAPIEN, BEHAVIOR).
  • AR/VR/Game Prototyping: Semantic, editable 3D worlds synthesized from text.
  • Architectural Design: Automated ergonomic layout respecting spacing/symmetry constraints.
  • Scene Understanding Research: Precise, queryable memory for tool-using agents in IR3D-Bench tasks (Liu et al., 29 Jun 2025).

Limitations include sensitivity to segmentation quality (small or occluded objects may be missed), dependency on underlying geometry extraction tools, and current hypergraph support for only unary–ternary relations. Extending to richer constraints and object dynamics, as well as integration with physics engines, is an open avenue.

7. Conceptual and Historical Context

The spatial scratchpad synthesizes ideas from cognitive psychology (working memory, spatial reasoning), formal scene graphs in computer vision, and explicit state representations from robotics planning and agentic simulation. Its emergence addresses the limitations of unstructured multimodal embeddings, allowing agentic models to bridge the gap between abstract semantic intent and precise geometric arrangement. Contemporary frameworks (e.g., SceneWeaver (Yang et al., 24 Sep 2025), Scenethesis (Ling et al., 5 May 2025), Agentic Scene Policies (Morin et al., 23 Sep 2025)) converged on variants of the spatial scratchpad for robust planning, tool-use, and physical plausibility in 3D scene creation and manipulation.

In sum, the spatial scratchpad is an explicit, geometry- and relation-aware working memory construct foundational to agentic scene planning, supporting iterative reasoning, constraint enforcement, and generalizable scene synthesis and manipulation by vision-language agents.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Spatial Scratchpad.