Hierarchical Continuous Coordinate Encoding

Updated 15 October 2025

Hierarchical Continuous Coordinate Encoding is a framework that uses continuous, multilevel codes to represent spatial coordinates with improved smoothness and precision.
It integrates techniques like multi-head attention and dynamic loss weighting to overcome binary encoding limits and boost accuracy in pose estimation and image super-resolution.
HCCE enables ultra-dense 2D–3D correspondence and efficient neural representation for tasks such as visual localization and adaptive image compression.

Hierarchical Continuous Coordinate Encoding (HCCE) is a mathematical framework and neural network strategy for encoding multidimensional positional or spatial information at multiple resolutions by leveraging continuous, multi-level representations. HCCE is utilized in computer vision, pose estimation, visual localization, image compression, and continuous image super-resolution. It overcomes limitations of strict binary encoding by employing a hierarchy of continuous codes that facilitate smoother prediction, greater precision, and enhanced stability during neural network training. HCCE can be integrated with dense correspondence construction, multi-head attention, and dynamic training loss weighting to further refine accuracy and applicability across various domains.

1. Fundamental Principles of Hierarchical Continuous Coordinate Encoding

HCCE represents each dimension of a coordinate (e.g., $x$ , $y$ , or $z$ ) with a succession of continuous values computed recursively in a hierarchical fashion. For a normalized coordinate $x \in [0, 1]$ , the encoding at the first level is simply $C_{x,1} = f_1(x) = x$ . Higher encoding levels ( $i > 1$ ) employ a mirroring operation, defined by: $C_{x,i} = f_i(x) = \begin{cases} f_{i-1}(2x), & x < 0.5, \ f_{i-1}(2 - 2x), & x \geq 0.5 \end{cases}$ This mirroring folds the coordinate space, ensuring continuity and avoiding abrupt transitions present in binary encodings. The continuous hierarchy can be converted to binary codes for decoding via: $g(t) = \begin{cases} 0, & t < 0.5, \ 1, & t \geq 0.5 \end{cases}$ For level $i$ , the binarization employs: $B_{x,i} = \begin{cases} g(C_{x,i}), & B_{x,i-1} = 0, \ 1 - g(C_{x,i}), & B_{x,i-1} = 1 \end{cases}$ The encoded value $x$ can then be approximated with a weighted sum of binary codes: $x \approx \sum_{i=1}^8 2^{-i} \cdot B_{x,i}$ Eight levels are typical, as additional hierarchies yield diminishing returns in precision.

2. Application in Ultra-Dense 2D–3D Correspondence Construction

HCCE has been employed critically in pose estimation frameworks, such as HccePose(BF) (Wang et al., 11 Oct 2025). Here, neural networks predict 3D correspondences for both front and back surfaces of objects from single 2D images. By encoding these coordinates with HCCE, predictions avoid the discontinuities associated with binary stripes and facilitate more stable, accurate learning.

For each image pixel, ultra-dense interior correspondences are created by interpolating between the front ( $q_1$ ) and back ( $q_2$ ) surface coordinates: $s(q_1, q_2, a) = a \cdot q_1 + (1-a) \cdot q_2, \qquad a \in \left\{ \frac{t}{n+1} \mid t=1,\ldots,n \right\}$ This generates a continuum of samples between the object surfaces. The resulting correspondences are fed into RANSAC-PnP algorithms to estimate object poses, resulting in increased pose precision across multiple datasets.

3. Training Loss Formulation and Dynamic Error Weighting

Training deep models to predict hierarchical continuous codes introduces challenges in balancing per-level errors. HCCE incorporates histogram-based dynamic loss weighting: $h_{f,x,i} = \exp\left( \sigma \cdot \min\{ r_{f,x,i}, 0.5 - r_{f,x,i} \} \right)$ Here, $r_{f,x,i}$ denotes the proportion of incorrect binary codes at hierarchy level $i$ , and $\sigma$ is a smoothness constant. The weights are normalized and integrated into the loss: $L_x^{Front} = \sum_{i=1}^8 \left[ w_{f,x,i} \cdot \sum_j |C_{x,f,i,j} - \hat{C}_{x,f,i,j}|_1 \right]$ This approach stabilizes learning by down-weighting loss from difficult levels, thereby improving convergence and the final accuracy of coordinate prediction.

4. HCCE in Hierarchical Scene Coordinate Regression and Localization

In the visual localization domain, hierarchical encoding (conceptually similar to HCCE) is employed to partition ground truth 3D coordinates via hierarchical clustering (e.g., k-means) and then regress fine-scale, continuously encoded residuals (Li et al., 2019). Multiple outputs are constructed in a coarse-to-fine hierarchy, with coarse labels conditioning fine stages. Conditioning is achieved using spatially varying FiLM-like layers: $f(x, \ell) = \gamma(\ell) \odot x + \beta(\ell)$ where $\gamma(\ell)$ and $\beta(\ell)$ are spatial maps generated from discrete labels, and $\odot$ denotes element-wise multiplication. This enables global context propagation and local refinement, resulting in improved disambiguation and scalability across large environments.

5. HCCE and Continuous Super-Resolution via Hierarchical Positional Encoding

In image super-resolution, HIIF (Jiang et al., 4 Dec 2024) introduces hierarchical positional encoding that re-parameterizes local coordinates at several scales. For pixel coordinates $(x, y)$ , the encoding at scale $l$ takes the form: $(x_\text{hier}, y_\text{hier}) = [(x_\text{local}, y_\text{local}) \cdot S^{l+1}] \bmod S$ where $S$ is a scaling factor, and $(x_\text{local}, y_\text{local})$ are normalized with respect to local features. This enables sharing of embeddings at coarse scales while allowing detailed differentiation at fine scales. Combined with implicit image functions and multi-head linear attention, HIIF demonstrates improved PSNR (up to 0.17 dB higher than prior methods), confirming the efficacy of hierarchical continuous coordinate encoding for flexible, precise super-resolution.

6. HCCE in Coordinate-Based Neural Representations for Compression

COOL-CHIC (Ladune et al., 2022) uses coordinate-based neural representations and hierarchical latent spaces for image compression. Here, images are mapped as

$x(i, j) = f_\theta(i, j)$

using a compact MLP. Hierarchical latent variables at multiple resolutions are upsampled and concatenated, providing $z = \text{upsample}(\{y_k\})$ as input to the decoder. This structure allows the codec to balance global and local detail adaptively, achieving competitive rate-distortion tradeoffs with only $\sim$ 629 parameters and 680 multiplications per decoded pixel. The HCCE context supports real-time decoded applications and flexible bitrate adaptation without architectural changes.

7. Theoretical Context and Relation to Hierarchical Multi-Scale Learning

The notion of continuous, hierarchical encoding for coordinates is comparable to the multi-scale memory structures in hierarchical multiscale neural networks (Wolf et al., 2018). In those models, representations are allocated at different time scales (short/medium/long-term), facilitated by online meta-learners and anchored by elastic weights consolidation. Both frameworks, HCCE and hierarchical multiscale neuro-symbolic models, formalize hierarchical dynamic representations and recurrent anchoring to stable reference points as fundamental to robust continuous learning, context adaptation, and catastrophic forgetting mitigation.

8. Practical Implications and Future Directions

HCCE is broadly applicable in any domain requiring high-precision spatial encoding, scalable learning, and smooth transition between representations. By employing continuous codes in hierarchical architectures, neural networks better exploit spatial locality, avoid discretization artifacts, and integrate global context. Methods such as distribution-aware coordinate decoding (e.g., Taylor expansion, as per DARK (Chatzitofis et al., 2021)) and non-quantized ground-truth heatmaps align well with the HCCE principle of unbiased supervision. A plausible implication is that further extensions in continuous, hierarchical coordinate representation could benefit 3D scene understanding, generative modeling, and systems that necessitate arbitrary scaling or precise localization under resource constraints.