Relative Move Embedding Overview

Updated 2 December 2025

Relative move embedding is a technique that encodes transitions by focusing on relative offsets, crucial for sequence modeling, spatial analysis, and robotic applications.
Kernelized formulations, such as logarithmic CPD kernels, enhance self-attention by reducing long-context perplexity while using fewer parameters.
Applications in spatial and generative domains demonstrate high cluster fidelity and effective metric preservation from sparse data through controlled embedding methods.

Relative move embedding refers to a family of techniques for representing and modeling relationships, transformations, or transitions—typically in spatial, temporal, or sequential domains—through vectorized encodings that make explicit the relative difference, offset, or displacement between entities or positions. Approaches based on relative move embeddings differ from standard absolute position encodings or location-only embeddings by concentrating on the topological or metric relationship between pairs rather than their coordinates in a global system. Relative move embedding has emerged as a central concept in sequence modeling, robotics, and spatial representation, supporting extrapolation, generalization, and efficient learning from sparse data.

1. Foundational Concepts and Definitions

Relative move embedding formally encodes the change or move between two objects or states, rather than their underlying positions. In sequential models, this often manifests through a symmetric or antisymmetric function of a positional (or state/action) difference; in spatial contexts, it may encode transition vectors or pairwise displacements.

In the context of self-attention models, relative move embeddings typically operate on index differences. If $m$ and $n$ are positions, only the relative offset $m-n$ is encoded, ensuring shift invariance. In spatial movement modeling (e.g., in DeepMove), embeddings correspond to transitions (origin $\to$ destination) and are parameterized and learned through co-occurrence or movement statistics, independent of absolute position (Zhou et al., 2018).

2. Kernelized Relative Move Embedding in Self-Attention

In transformer-based architectures, relative move embeddings are used to replace or augment absolute positional encodings. The KERPLE framework generalizes the construction of relative positional biases via the theory of conditionally positive definite (CPD) kernels (Chi et al., 2022). The central property is shift-invariance: the positional bias between two tokens at positions $i$ and $j$ depends only on $i-j$ ,

$\tilde k(i,j) = f(i-j).$

This allows a range of kernel functions, including:

Power-distance kernels: $\tilde k(x) = -a|x|^p$ , $a>0$ , $0 < p \leq 2$ .
Logarithmic-distance kernels: $\tilde k(x) = -b\log(1 + a|x|^p)$ .

The kernelized bias augments the pre-softmax logits in self-attention:

$A_{ij} = \frac{\exp\left(Q_i K_j^T/\sqrt{d} + \tilde k(i,j)/\sqrt{d}\right)}{\sum_{t} \exp\left(Q_i K_t^T/\sqrt{d} + \tilde k(i,t)/\sqrt{d}\right)},$

where $Q_i$ , $K_j$ are query and key projections. Because the softmax is shift-invariant, any constant offset needed to render the total kernel positive definite can be ignored at inference time.

Empirical studies demonstrate that logarithmic CPD kernels support strong length extrapolation, matching or outperforming bucketed baselines (T5, ALiBi)—with as little as two kernel parameters per attention head (Chi et al., 2022).

3. Relative Move Embedding in Spatial and Mobility Data

In spatial representation, relative move embeddings capture the structural information underlying the movement patterns between locations. DeepMove applies this principle by learning separate embedding matrices $V$ and $U$ for points-of-interest (POIs), treating each movement (trip) as an (origin, destination) pair analogous to (center, context) pairs in word2vec (Zhou et al., 2018). Given trip data $(o_i, d_i)$ and negative sampling, the model optimizes

$\ell(w, c) = -\log\sigma(u_c^T v_w) - \sum_{k=1}^{K} \mathbb{E}_{n_k \sim P_n}[ \log\sigma(-u_{n_k}^T v_w) ].$

The relative move embedding for a trip $o \to d$ may be computed as $v_o - v_d$ , or via concatenation or averaging. No additional parametric distance or temporal encoding is incorporated; the learned geometry arises entirely from the movement-induced co-occurrence graph.

DeepMove achieves high cluster fidelity: in New York City taxi data, the OD variant attains a 97% category match rate and a silhouette coefficient of $0.85$ for clusters, surpassing feature-based and frequency-based alternatives (Zhou et al., 2018).

4. Metric Preservation and Topology Discovery in Generative Models

In high-dimensional robotics and manipulation tasks, model collapse and poor coverage of the action space are common when only sparse observations are available. "Neural Embedding for Physical Manipulations" introduces a normalized pairwise distance constraint between latent and output/action spaces, enforcing that the diversity of generated actions matches the spread present in the latent space (Zhang et al., 2019). This is a form of relative move embedding where the relationship between samples $(z_i, z_j)$ in the latent and $(a_i, a_j)$ in the output enforces a metric-preserving mapping:

$L_{\text{ndiv}}(s_t, z) = \frac{1}{N^2-N} \sum_{i \neq j} \max(0, \alpha D_{ij}^z - D_{ij}^a)$

where $D_{ij}^z$ , $D_{ij}^a$ are per-sample normalized distance matrices, and $\alpha$ is a slack margin.

This constraint prevents "mode collapse" by penalizing mappings where large latent-space differences are not preserved in action space, ensuring that distinctities are respected and the full topology of the action manifold is discoverable—even with limited data (Zhang et al., 2019).

5. Empirical Findings and Comparative Performance

Relative move embedding techniques, when operationalized through appropriate kernel functions, loss constraints, or negative-sampling paradigms, deliver robust performance in extrapolative and data-sparse regimes.

In KERPLE, the logarithmic CPD kernel reduced long-context perplexity by 5–10% compared to the linear ALiBi and matched or outperformed bucketed log-bias T5 variants, at considerably lower parameter cost (Chi et al., 2022).
In DeepMove, OD-based relative embedding models achieved 97% categorical match rate and silhouette $0.85$, outperforming spatial proximity and check-in-based baselines, and retaining performance with only $1/12$ of the data volume (Zhou et al., 2018).
In metric-preserving generative models, normalized diversity losses yielded Fréchet distances of $3.5$ (vs. $21.6$–$26.8$ for VAEs and GANs) and Jensen-Shannon divergences of $0.02$ (vs. $0.05$–$0.16$), with visually superior coverage of action-space topologies (Zhang et al., 2019).

6. Implementation Guidelines and Extensions

Designing relative move embeddings requires matching the kernel function or structure to the domain's geometry and the desired generalization properties.

In sequence models, prefer CPD kernels with controllable tails (e.g., $-\log(1 + r_2|i - j|)$ ), limiting parameter count per head/layer to ensure scalability and regularize the learning process (Chi et al., 2022).
In spatial and trip embedding, extract movement-induced co-occurrence pairs; avoid explicit distance regularization unless justified by domain prior. Rely on negative sampling and distributional assumptions to learn a geometry reflecting movement structure (Zhou et al., 2018).
For generative coverage of unknown manifolds, enforce normalized diversity constraints at the latent–output mapping, tuning the margin parameter to balance precision and spread, and use uniform priors over bounded hypercubes to prevent drift and collapse (Zhang et al., 2019).

Extensions are directly realizable for learned transformations in continuous domains, including 3D pose interpolation, deformable object reshaping, and any sequence or action manifold where topology discovery is critical (Zhang et al., 2019). The relative move embedding paradigm, rooted in metric and kernel theory as well as deep learning practice, provides a robust methodology for modeling the structure of transitions and differences across a wide range of domains.