Complex Spatial Reference

Updated 5 December 2025

Complex spatial reference is a framework for modeling spatial relationships using defined frames of reference and contextual cues across physical and abstract domains.
The methodology integrates geometric, semantic, and neural approaches to ground and resolve spatial ambiguities from language, vision, and sensor data.
Advances in this field improve spatial cognition tasks and human–robot interaction by addressing multi-scale integration, dynamic scene understanding, and perspective disambiguation.

Complex spatial reference encompasses the representation, interpretation, and computational modeling of spatial relationships in environments ranging from physical 3D space to abstract relational networks. It underlies tasks from natural language grounding and human–robot interaction to spatial cognition and spatial information retrieval. Complex spatial reference is characterized by (1) reference frame ambiguity, (2) multi-scale and multi-modal information integration, (3) compositional structures, and (4) dependence on context, perspective, and prior knowledge. This article surveys foundational formalisms, core computational mechanisms, key representational challenges, and recent advances in the modeling and evaluation of complex spatial reference across language, vision, and sensor-based applications.

1. Foundational Formalisms for Spatial Reference

At the core of spatial reference lies the notion of a frame of reference (FoR), a coordinate or perspective system for specifying how objects relate in space. Three principal FoR types, following Levinson (2003), are:

Absolute FoR: Fixed to global directions (e.g., north/south/up/down).
Intrinsic FoR: Aligned to the inherent axes of the relatum object (e.g., front, back, left, right of a car as defined by its canonical orientation).
Relative FoR: Determined by the observer (e.g., camera or speaker viewpoint).

A spatial reference instance is therefore not only a tuple of objects and relations but is always defined relative to a particular FoR—sometimes marked explicitly (e.g., “to my left”) and sometimes implicit, requiring disambiguation via multimodal cues or pragmatic inference (Premsri et al., 25 Feb 2025, Zhang et al., 22 Oct 2024).

Mathematically, spatial reference can be formalized as a quadruple $(L, R, rel, FoR)$ , where $L$ and $R$ are the locatum and relatum entities, $rel$ a spatial relation, and $FoR$ specifies the transformation matrix applied to positions. For example, the relative “left of” relation is computed as: $d_{rel} = R_{obs}(p_L - p_R)$ with $R_{obs}$ the observer's orientation matrix, and arrangement can be thresholded along axes to infer relation predicates (e.g., $d_x < -\epsilon$ for “left”) (Premsri et al., 25 Feb 2025).

When referencing spatial directions quantum-mechanically, transfer of a spatial reference frame (e.g., between distant observers sharing entangled singlet states) can be formalized in terms of mutual information or Bayesian inference over measurement results, with protocols guaranteeing transfer up to axis inversion and scaling errors as $O(1/\sqrt{N})$ with the number of quantum measurements (Bahder, 2014).

2. Semantic Representation and Composition

Complex spatial reference is not limited to simple pairwise relations. Representation schemes must handle:

Hierarchical and compositional spatial structures: Each spatial clause—static or dynamic—can be decomposed into modular configuration units (C), each capturing trajector, landmarks, path (if dynamic), spatial indicator(s), frame of reference, viewer perspective, and qualitative type (topological, directional, distance). For example, the extended AMR (Abstract Meaning Representation)-linked spatial configuration formalism defines:

$C = \langle tr, LMs, path, m, sp, FoR, v, QT \rangle$

allowing fine-grained grounding and composition (Dan et al., 2020).

Semantic decomposition axes: Topology (containment/adjacency), directionality (axes), distance (qualitative or metric), path segmentation (for dynamic phenomena), and frame of reference specification. Each axis can be reasoned over independently or in combination depending on the complexity of the spatial instruction or description.
AMR Integration: AMR graphs have been extended to encode spatial reference at clause, phrase, and discourse levels, including explicit roles for FoR, axes, spatial extent, and function tags, facilitating downstream parsing and task execution (e.g., in robotic instruction following or simulated environments).

3. Computational Models and Pipelines

Complex spatial reference in computational systems is addressed via layered models capable of mapping input signals (language, vision, time-series) to explicit spatial predicates and reference frames.

3.1 Geometric and Symbolic Factories

Spatial knowledge graphs: Nodes represent spatial entities; edges encode symbolic spatial predicates (e.g., “near”, “on”, “behind”, “aligned”); predicates are formally evaluated via geometric primitives (oriented 3D bounding boxes, pairwise distances, axis projections). For example, the Spatial Reasoner formalizes predicates such as
- $\text{behind}(A, B, \mathbf u, \delta): (c_B - c_A)\cdot\mathbf u < -\delta$
- $\text{on}(A, B, \delta)$ as vertical adjacency with horizontal overlap (Häsler et al., 25 Apr 2025).
Pipeline-based reasoning: Fact extraction $\to$ predicate deduction $\to$ rule application $\to$ dynamic evaluation, enabling robust, extensible inference over evolving scene graphs.

3.2 Multimodal and Embodied Approaches

Perspective-taking and embodied reference alignment: Recent approaches implement “view rotation” to re-center spatial data into the sender’s coordinate system, learning direction vectors (body orientation, gesture) and fusing with language for referent localization. Spatial attention maps (via cosine similarity in sender-centered coordinates) are layered with gesture and language attention to enable resolution of complex, multimodal references (e.g., “the mug behind the laptop, to the left of the keyboard, while pointing diagonally”) (Shi et al., 2023).
Complex spectro-spatial filtering: In audio (e.g., target speaker extraction), joint complex-valued spectral and spatial filtering enables the network to learn beampatterns (directive filters) for reference location disambiguation, with trainable injection of target direction (DoA) for context-sensitive extraction (Briegleb et al., 2022).

3.3 Reference Object Reasoning

Reference objects as scale anchors: Quantitative spatial reasoning (distances, areas) in vision-LLMs (VLMs) is enhanced by automatically identifying scene reference objects with canonical real-world dimensions, using these as anchors for pixel-to-metric conversion during numeric estimation (Liao et al., 15 Sep 2024). Prompting VLMs to find and use reference objects yields significant gains (>19 percentage points) in spatial QA accuracy.

4. Grounding and Disambiguation in Natural Language

Complex spatial reference emerges in free-form text, especially in natural language place descriptions and instructions. Formal approaches such as the place graph framework reconstruct explicit spatial relationship graphs from text, encompassing:

Directed multi-graphs of place nodes (each with possibly multiple surface references) and spatial relationships (edges labeled by standardized relations: cardinal directions, distance, topology, relative directions).
Constraint modeling: Each relationship is modeled as a geometric or topological constraint (e.g., “north_of” as a half-plane above the relatum’s centroid, “near” as an adaptive buffer, “inside” as region containment).
Incremental geo-referencing: Anchor places are matched to gazetteers; other places are localized via intersecting spatial constraints (approximate location regions); ambiguity is resolved by best-matching references and spatial similarity scores. Final outputs cover both gazetteered and vernacular/non-gazetteered places, with precision up to 90% for anchors and ≥80% for derived regions (Chen et al., 2017).

5. Evaluation of Frames of Reference in AI Models

Robust handling of FoRs is a central challenge and critical benchmark for LLMs and vision-LLMs (VLMs):

Benchmarking frameworks: COMFORT and FoREST systematically probe FoR comprehension via synthetic and real scene layouts, ambiguous versus disambiguated prompts, and cross-linguistic trials (Zhang et al., 22 Oct 2024, Premsri et al., 25 Feb 2025).
Key evaluation metrics:
- Region parsing errors ( $\varepsilon^\text{ref}$ ): deviation of model-assigned relations from analytic region boundaries (e.g., cosine or hemisphere).
- Consistency and robustness (symmetry, opposite labels, noise, standard deviation across variants).
- Frame bias and flexibility: Most current models strongly default to English-centric egocentric relative and reflected axes, failing to adapt flexibly to intrinsic, addressee, or non-English conventions—even when prompted explicitly.
- Spatial-guided prompting: Eliciting explicit FoR, topology, direction, and distance before QA or layout generation substantially improves performance (e.g., +7% on FoREST QA for GPT-4o) and layout-generation (+3.5% accuracy for left/right relations) (Premsri et al., 25 Feb 2025).

Benchmark	FoR Types Tested	Main Deficits Found	Best Mitigation
COMFORT (Zhang et al., 22 Oct 2024)	ego-rel, add-rel, intrinsic	Bias to English egocentric reflected FoR; poor adaption to explicit prompts	Cross-lingual data, perspective modules
FoREST (Premsri et al., 25 Feb 2025)	absolute, relative, intrinsic (+internal/external)	Systematic FoR bias, especially in ambiguous contexts; errors in perspective conversion	Spatial-Guided prompting

6. Manifold and Network-Based Complex Spatial Reference

Complex spatial relationships in large-scale environments (e.g., geographic, social, or traffic systems) are modeled by embedding interaction networks into low-dimensional metric spaces or “geographic manifolds”:

Inverse friction metric: Normalizes observed interaction flows between pairs by mass-like characteristics, yielding an impedance-measuring scalar $P_{ij}$ , which is mapped to a distance metric $d_{ij}=f(P_{ij})$ . The entire network is then embedded into a low-dimensional manifold via Isomap or t-SNE, allowing spatial analytics that reflect intrinsic geographic constraints (Jiang et al., 31 Oct 2024).
Local Euclideanity: Empirically, neighborhood size distributions and simplex occurrence rates demonstrate that most spatial networks are locally (and even globally) embeddable in 2-3 dimensions.
Applications: Location choice and dissemination/propagation can be reformulated as tessellations or diffusion on manifolds, leading to uniformity and isotropy unattainable on raw maps.

7. Implications, Limitations, and Future Research

Complex spatial reference is a cross-cutting concern in AI models, spatial analytics, and human–machine interaction:

Ambiguity and Flexibility: Human spatial cognition flexibly switches between FoRs, leverages reference objects, and composes relations hierarchically; most current ML models encode only a subset of these capabilities, with significant cross-lingual and context-generalization limitations (Zhang et al., 22 Oct 2024, Premsri et al., 25 Feb 2025).
Prompt Engineering and Model Design: Explicit prompting for spatial structure (FoR, topology, direction, distance) reliably enhances model performance, while training or architectural inclusion of explicit FoR and perspective-taking modules is a major future direction.
Evaluation protocols: Robust assessment requires not only accuracy but region/symmetry/error metrics and scenario coverage—spanning ambiguous, compositional, multimodal, and cross-cultural references.
Integration Needs: Extending symbolic spatial configuration schemes (e.g., AMR-based) to perceptual data and fusing manifold/network–based spatial references with geometric and LLMs remain active frontiers.

Complex spatial reference thus constitutes both a core challenge and a driver for theoretical, methodological, and applications-focused advances across spatial cognition, AI, and computational geography.