Locality-Aware INR Decoder
- Locality-aware INR decoders are neural models that incorporate partitioned processing and localized attention to reconstruct continuous signals with high detail.
- They employ methods like selective token aggregation, local MLP partitions, and cluster-based coding to efficiently capture both global coherence and fine local structures.
- These architectures outperform global-only INRs in fidelity and efficiency, demonstrating superior PSNR, reduced latency, and adaptability in image, video, and 3D applications.
A locality-aware Implicit Neural Representation (INR) decoder is a neural architecture designed to leverage local contextual information when reconstructing signals from continuous coordinates, incorporating mechanisms such as partitioned processing, coordinate-wise attention, or local code modulation. This approach addresses the expressivity and inefficiency limitations of traditional global MLP-based INRs by enhancing the decoder with spatially or contextually localized representations, allowing for improved fidelity, granularity, and flexibility in applications involving images, videos, point clouds, and general continuous signals. Locality-aware INR decoders have emerged in multiple variants, including selective token attention, cluster-based context embedding, and regionally partitioned sub-networks, with representative instantiations in image, 3D geometry, and signal processing domains.
1. Architectural Principles of Locality-Aware INR Decoders
Locality-aware INR decoders diverge from monolithic global neural representations by explicitly conditioning either intermediate features or output predictions on local context. The primary design strategies reflected across current literature include:
- Local Feature Modulation via Latent Codes: A transformer or encoder extracts a set of latent tokens, each encoding localized information about a region or entity (e.g., pixels, rays, or 3D clusters). For each query coordinate, a decoder leverages selective aggregation (typically via cross-attention) over these tokens to generate a modulation vector, guiding the INR network to focus on appropriate local detail (Lee et al., 2023).
- Partitioned Local MLPs Fused with Global Context: The input domain is divided into axis-aligned partitions, with each handled by a distinct local MLP ("Local-Global SIREN"). These local MLPs are lightweight and process only their assigned partition, but each layer also accesses a global context vector generated by a shared global MLP, merged via a dedicated operator at each depth (Ashkenazi et al., 2024).
- Local Cluster-wise Code Modulation: For irregular domains such as point clouds, local clusters or patches are defined (e.g., via KNN over seed points), and a patch encoder produces local latent codes. Upon query, the local code corresponding to the parent cluster modulates a compact shared MLP that produces the residual or signal value, thereby enforcing locality (Xu et al., 2024).
These principles allow the INR decoder to specialize signal representation at varying granularities, balancing global coherence with fine-grained expressivity.
2. Mathematical Formulations and Computational Procedures
Prominent locality-aware INR decoder designs follow mathematically precise procedures for aggregating local context and decoding signals.
- Selective Token Aggregation via Cross-Attention (Lee et al., 2023):
Let , be Transformer-extracted latent tokens. The input query coordinate is Fourier-encoded:
A projected query initiates multi-head cross-attention:
Band-specific feature extraction and coarse-to-fine progressive decoding proceed as:
- Partitioned Forward-Pass in Local-Global SIREN (Ashkenazi et al., 2024):
For input mapping to partition , the global pathway is:
0
The local partition MLP operates as:
1
followed by the merge:
2
Final output is 3.
- Cluster-based Coding for Point Cloud INR (Xu et al., 2024):
A local cluster 4 is encoded into 5. At query 6 in cluster 7,
8
The network is trained to minimize
9
These methods implement locality both in the structure of latent space modulation and in the architectural routing of information for each coordinate query.
3. Spatial and Spectral Locality: Modulation and Attention
Spatial and spectral locality is realized via selective modulation and multi-band filtering. In image and video INRs, cross-attention over spatial latent tokens encodes local pixel or region information; downstream multi-band modulation parses the encoded signal into frequency bands, decoding coarse features before incrementally injecting finer detail. This coarse-to-fine strategy enhances the decoder's ability to preserve high-frequency information in the presence of globally ambiguous or aliased signals (Lee et al., 2023).
In 3D point clouds, spatial locality is defined via KNN clustering, allowing each latent code and its neighborhood context to tailor the refinement process, yielding higher rate-distortion efficiency and effective surface refinement even under variable resampling rates (Xu et al., 2024). Partitioned SIREN architectures instantiate locality spatially via axis-aligned subdomains; each partition is attended to by an independent local sub-MLP whose intermediate computations are fused with global information, permitting cropping or extension on demand (Ashkenazi et al., 2024).
A plausible implication is that spectral and spatial locality can be synergistically harnessed to produce both sharp reconstruction and spatial editability, as empirically seen in segmentation, cropping, and arbitrary-scale upsampling tasks.
4. Comparative Performance and Scalability
Locality-aware INR decoders consistently outperform global-only INR baselines in tasks demanding detail and efficiency. On image datasets (e.g., ImageNette, FFHQ), selective-token cross-attention with multi-band feature modulation yields substantial PSNR improvements: e.g., 3.3 dB gain over global-modulation on 256² FFHQ images (Lee et al., 2023). Training speed is also improved by an order of magnitude, with convergence to high-fidelity reconstructions in fewer epochs.
In point cloud geometry compression, the dual-layer approach (non-learning base layer plus locality-aware refinement INR) cuts both model size (0.056M vs. 21.4M parameters) and decoding latency (0.02s vs. 1.45s) by nearly two orders of magnitude compared to other SOTA methods, while uniquely offering arbitrary-scale upsampling (Xu et al., 2024).
In partitioned SIREN decoders, the Local-Global strategy attains higher PSNR/SSIM and accelerates training (e.g., for 16×16 splits on DIV2K images, 34.13 dB versus 33.57 dB for full SIREN, with significantly reduced iterations) (Ashkenazi et al., 2024). Batchwise and automatic partitioning strategies empirically deliver further gains.
5. Flexibility, Editing, and Practical Applications
The architectural structure of locality-aware INR decoders enables functionalities not achievable with global models. In Local-Global SIREN architectures, cropping is implemented as a direct weight removal operation—removing local-MLPs for selected partitions eliminates the associated region from the signal, precisely and with proportional parameter reduction, all without retraining (Ashkenazi et al., 2024). This facilitates real-time modification and extension of INR-encoded signals, including expansion to new regions by simply initializing and fine-tuning new partition modules.
Arbitrary-scale upsampling and adaptive density decoding are afforded in context-aware point cloud INRs. By using local latent codes tied to clusters and enabling generalization via shared MLPs, the decoder supports smooth surface interpolation, geometric editing, and efficient compression (Xu et al., 2024).
Applications span signal editing, image and video compression, super-resolution, denoising, computed tomography reconstruction, and generative modeling, with downstream tasks such as image generation demonstrating improvements from locality-aware representations (Lee et al., 2023, Ashkenazi et al., 2024).
6. Training Regimes and Implementation Practices
Typical training regimes for locality-aware INR decoders use Adam or AdamW optimizers, moderate batch sizes (16–32 for image tasks), and partitioned or patchified input sampling to enhance data diversity and regularize learning. Loss functions are task-dependent, with mean-squared error for continuous signal regression in images and videos, or L1 rate–distortion objectives for geometry compression in point clouds. Multi-band modulation often requires a small set (L = 2–4) of frequency bands, Fourier encoding of coordinates, and hidden sizes of 256–768 for latent and modulation vectors.
Partition granularity, local/global hidden dimensions, and merge operator widths are tuned according to task and parameter budget. Implementation frameworks exploit tensorized locally connected layers for efficiency, as well as automatic domain partitioning routines for scalable deployment (Lee et al., 2023, Ashkenazi et al., 2024). For 3D data, KNN clustering and graph-based aggregation are central to efficient local context definition (Xu et al., 2024).
All major published architectures provide code releases with fully specified hyperparameter configurations, ensuring reproducibility and adoption within the INR research community.