Geometric–Semantic Reasoner Model

Updated 3 March 2026

Geometric–Semantic Reasoner is a unified reasoning architecture that combines oriented 3D spatial data with symbolic predicates to support spatial inference and chain-of-thought reasoning.
The model employs a multi-stage pipeline that extracts geometric features, evaluates pairwise predicates, and constructs spatial knowledge graphs for rule-based inference.
Advanced multimodal extensions integrate vision encoders and language models to boost performance in applications such as XR, security inspection, and robotics.

The Geometric–Semantic Reasoner (GSR) model refers to a class of reasoning architectures that unify low-level geometric data with semantic and symbolic predicates, enabling spatial inference, chain-of-thought reasoning, and explicit scene-graph manipulation across domains such as XR (extended reality), robotics, security inspection, and 3D vision. Across foundational works, GSR models integrate parametric geometric representations, predicate calculus, graph-based abstraction, and multimodal learning to bridge perception and high-level reasoning.

1. Geometric Representations and Predicate Foundations

GSR frameworks encode the spatial state of environments via oriented 3D bounding boxes, point-based spatial tokens, or 6D pose graphs. In XR applications (Häsler et al., 25 Apr 2025), each object $o$ is formalized by an oriented box $B_o = (c_o, R_o, e_o)$ , where $c_o \in \mathbb{R}^3$ is the center, $R_o \in SO(3)$ is a rotation (usually yaw-only in practice), and $e_o$ are the half-extents along axes, defining

$B_o = \{ c_o + R_o \cdot \operatorname{diag}(e_o) \cdot u\ |\ u\in [-1,1]^3 \}$

Spatial relationships are captured through a family of binary predicates $P(o_i, o_j)$ spanning:

Topological relations: $Disjoint(o_i,o_j),\ Inside(o_i,o_j),\ Overlap(o_i,o_j)$
Connectivity: $On(o_i,o_j),\ In(o_i,o_j),\ By(o_i,o_j)$
Directional: $Left(o_i,o_j),\ Right(o_i,o_j),\ Ahead(o_i,o_j),\ Behind(o_i,o_j)$ (using local object-centric frames)
Proximity: $Near(o_i,o_j),\ Far(o_i,o_j)$
Orientation: $Aligned(o_i,o_j),\ Orthogonal(o_i,o_j)$

Explicit thresholds ( $\delta_{\max}, \rho, \theta_{\max}$ ) parameterize proximity and orientation conditions. Formally, predicates are ground facts derived from geometric computation (intersection, distance, angular comparison) on object pairs.

2. Symbolic Mapping, Rule Evaluation, and Knowledge Graphs

Geometric–semantic mapping proceeds as follows (Häsler et al., 25 Apr 2025):

Extraction: Compute geometry for all $n$ objects.
Pairwise Predicate Evaluation: For pairs $(i,j)$ , evaluate predicate templates (topology, touch, directionality) conditioned on distance prefiltering.
Fact Emission: A passing geometric test emits a symbolic fact $P(o_i,o_j)$ .
Rule-based Inference: Logical rules (e.g., $Touch(o_i,o_j) \wedge Above(o_i,o_j) \Rightarrow On(o_i,o_j)$ ) are applied for higher-order relational facts.
Spatial Knowledge Graph Construction: Build a directed, labeled multi-edge graph $G = (V, E)$ where $V$ are objects and $E \subseteq V \times V \times L$ are predicate-labeled relational edges.

These knowledge graphs support both forward-chaining and backward-chaining inference, and scale efficiently with spatial indexing and fact caching.

3. Multimodal and Chain-of-Thought Extensions

For multimodal security screening (Peng et al., 23 Nov 2025), GSR architectures are extended to reason over paired views (e.g., top- and side-view X-ray images), fusing geometric cues and semantic information through hierarchical token streams. This is operationalized by:

A vision encoder (e.g., ViT-L/14) producing separate patch token embeddings for each view.
A multimodal alignment MLP projects per-view embeddings to the LLM’s semantic space.
Sequence tokens $<$ top $>$ , $<$ side $>$ , and $<$ conclusion $>$ control the flow of cross-view geometric reasoning and semantic fusion.
A language reasoner (MoE-based) performs multi-head attention over these tokens, generating structured chain-of-thought (CoT) reasoning that explicitly decomposes spatial observations from each view before semantic integration.

Supervision leverages GSXray’s three-stage CoT annotations, training the model with targets:

$\mathcal{L}_{geo} = -\sum_{t=1}^{T_2}\log p(s_t|\cdots), \quad \mathcal{L}_{sem} = -\sum_{t=T_2+1}^{T_3}\log p(s_t|\cdots), \quad \mathcal{L} = \lambda_{geo}\mathcal{L}_{geo} + \lambda_{sem}\mathcal{L}_{sem}$

Explicit separation of geometric and semantic loss yields substantial improvements in multi-view diagnostic accuracy (e.g. +27% mIoU compared to single-view (Peng et al., 23 Nov 2025)), confirming the value of treating the second view as a “language-like” modality.

4. Reasoning over 3D Scene-Graphs in Embodied Manipulation

In embodied reasoning (Hu et al., 2 Feb 2026), GSR refers to Grounded Scene-graph Reasoning, where at each timestep $t$ , the world-state is a 3D scene graph $G_t=(O_t, E_t)$ :

Node features: $p_i=(x_i, y_i, z_i, \theta_i, \phi_i, \psi_i),\ s_i\in \mathbb{R}^{d^s}$
Edge features: $e_{ij}$ for predicates $r_{ij}$ (e.g., $on$ , $in$ , $adjacent$ , $is\_open$ )
Transition Function: Actions $a_t$ produce graph transitions $G_{t+1} = \Phi(G_t, a_t)$ .
Precondition Checking: Logical predicates evaluated over $G_t$ ensure feasibility.
Action Consequence Prediction: Graph neural modules (MLPs, relation-aware attention) predict edge/node transitions in response to actions.

Supervised objectives combine world-state reconstruction, action-prediction, and goal-state satisfaction losses, guided by the Manip-Cognition-1.6M dataset. Explicit world modeling and predicate-based planning yield significant gains in zero-shot task completion and long-horizon temporal coherence (e.g., RLBench KitchenOps 92.5%, Pick&Place 82.0%) (Hu et al., 2 Feb 2026).

5. Integration with ML, NLP, and Rule Systems

GSR models offer deep interoperability with upstream vision and downstream reasoning technologies:

ML Integration: Accept outputs from vision detectors (bounding boxes, confidences) to emit symbolic spatial relations, supporting closed-loop refinement with trackers or detection models (Häsler et al., 25 Apr 2025).
NLP/LLM Integration: Predicate naming schemes are near-natural-language, facilitating mappings from user utterances (e.g., “move the chair near the window”) to formal predicates and supporting seamless language-to-action pipelines (Häsler et al., 25 Apr 2025, Peng et al., 23 Nov 2025).
Rule and Ontology Integration: GSR can load custom ontologies (OWL/RDF), enabling taxonomic reasoning within knowledge graphs and export to logic programming (e.g., ASP, SPARQL endpoints).

6. Empirical Performance, Scalability, and Application Cases

Benchmark evaluations demonstrate efficient and scalable reasoning:

Client and mobile XR deployments: 100 objects × 12 relation categories can be processed at 30–60 Hz; scenes with 1,000+ objects achieve sub-second updates via spatial indexing (Häsler et al., 25 Apr 2025).
Multimodal inspection: GSR achieves 65.4% accuracy and 52.3% mIoU on DualXrayBench, substantially outstripping prior VLMs (Peng et al., 23 Nov 2025).
Robotics/embodiment: Scene-graph centric GSR shows robust long-horizon performance across RLBench and LIBERO task suites, maintains robustness to moderate edge noise, and supports complex manipulation involving precondition verification and explicit goal planning (Hu et al., 2 Feb 2026).

Representative applications include AR/VR scene understanding, X-ray threat detection, robotic assembly, tool retrieval, and semantic map generation.

7. Core Innovations and Theoretical Significance

GSR’s principal innovation lies in its formal unification of metric geometric state with symbolic predicate calculus, enabling spatial facts to propagate through dynamic, logically consistent knowledge graphs. By externalizing reasoning from latent representations and enforcing explicit precondition/action/result structure, GSR architectures are positioned to support real-time, interpretable, and scalable spatial intelligence within a broad range of perceptual and interactive systems (Häsler et al., 25 Apr 2025, Peng et al., 23 Nov 2025, Hu et al., 2 Feb 2026).

A plausible implication is that further advances in GSR-like architectures—especially those combining continuous geometric encoding, explicit predicate abstraction, multi-hop logical inference, and large language modeling—will continue to drive improvements in embodied agents, multimodal diagnostic systems, and natural task specification in human–machine collaboration.