Neural Geometry Image-Based Representation
- Neural geometry image-based representation is a design pattern that re-encodes 3D geometry and materials into regular image grids to leverage efficient 2D convolutional processing.
- It spans various paradigms such as neural G-buffers, UV geometry images, graph signals, and image-conditioned implicit fields, each tailored to specific tasks like inverse rendering or mesh compression.
- The choice of method involves trade-offs in continuity, editability, and computational cost, directly impacting applications from photorealistic relighting to 3D object detection.
Neural geometry image-based representation denotes a family of representations in which geometry is encoded, queried, or processed through image-like neural structures rather than only through explicit meshes, voxels, or raw Euclidean grids. In current usage, the term spans at least three closely related formulations: neural fields that output per-pixel geometry and material buffers for subsequent shading; geometry images that map irregular surfaces to regular 2D grids so that standard image networks can reconstruct, compress, or generate meshes; and geometry-tied neural representations on non-Euclidean domains, where latent signals remain attached to meshes, graphs, or canonical 3D coordinates instead of being flattened into ordinary image lattices (Wang et al., 2023, Gao et al., 24 Nov 2025, Jiang et al., 2020, Wu et al., 15 May 2026). From a representation perspective, there is no unique representation that works well for all applications, and the choice of image-based geometry encoding is tightly coupled to the target task, the available supervision, and the trade-off among continuity, editability, and computational cost (Xiao et al., 2020).
1. Conceptual foundations and historical scope
Image-based 3D representation has long been attractive because 2D images live on regular grids and can therefore exploit the full ecosystem of 2D convolutional architectures. The representation survey on deep geometry learning treats depth and multi-view images as a primary family of 3D representations, precisely because regular image grids are efficient to process, even though some geometry features are inevitably lost relative to fully 3D representations (Xiao et al., 2020). In that sense, neural geometry image-based representation inherits a basic tension already visible in early image-based geometry learning: geometric structure is made compatible with image-space computation by projecting, parameterizing, or otherwise re-indexing geometry into regular arrays.
A non-neural precursor is the graph-based representation for multiview image coding, which stores one reference image together with graph links connecting pixels across views according to scene geometry. Its links describe the proximity between pixels in 3D space, adapt the transmitted geometry information to prediction complexity, and achieved a gain of $2$ dB in reconstructed quality over depth-based schemes operating at similar rates (Maugey et al., 2013). Although not learned, this formulation already exhibits a defining principle of the later neural literature: geometry need not be transmitted as raw depth if an image-indexed structure can encode exactly the correspondences required by the downstream task.
Modern work generalizes this principle in two directions. One direction treats images as supervision for continuous 3D fields, as in SDF- and NeRF-derived models; the other treats geometry itself as an image-like signal, such as UV geometry images, G-buffers, or semantic/stitch maps over a packed atlas (Schirmer et al., 10 Nov 2025, Elizarov et al., 2024, Pham et al., 19 Mar 2026).
2. Main representational paradigms
The literature does not converge on a single canonical data structure. Instead, several recurrent paradigms instantiate the same design idea: bind geometry to image-like organization, or bind image processing to geometry-aware coordinates.
| Paradigm | Core representation | Representative papers |
|---|---|---|
| Neural G-buffer / deferred shading | Per-pixel depth, normals, albedo, and materials predicted by a neural field | (Wang et al., 2023) |
| UV geometry images | Regular 2D grids storing XYZ, normals, masks, semantics, or seams | (Wang et al., 2020, Elizarov et al., 2024, Gao et al., 24 Nov 2025, Pham et al., 19 Mar 2026) |
| Graph/manifold signals | Latent signals attached to meshes or graph hierarchies, with geometry-dependent operators | (Jiang et al., 2020) |
| Image-conditioned implicit fields | Canonical SDF or radiance fields queried at arbitrary 3D points from image features | (Schirmer et al., 10 Nov 2025, Wu et al., 15 May 2026) |
| Geometry-aware voxel volumes | 3D voxel grids lifted from images and gated by learned surface probabilities | (Tu et al., 2023) |
| Explicit Gaussian scaffolds for images | Trainable Gaussian primitives used as continuous, editable image representations | (Zhang et al., 2024, Waczyńska et al., 2024, Jakubowska et al., 25 Nov 2025) |
A recurrent misconception is to equate the topic with only classical UV geometry images. The record is broader: some systems convert irregular surfaces into 2D atlases; some predict image-space geometry buffers from neural fields; some keep signals on graphs or voxel grids while still deriving geometry from images; and some use explicit Gaussian primitives to endow image representations with interpretable spatial structure (Wang et al., 2020, Wang et al., 2023, Jiang et al., 2020, Tu et al., 2023, Waczyńska et al., 2024). A plausible implication is that the term is best understood as a design pattern rather than a single architecture.
3. Hybrid neural fields and the neural G-buffer formulation
A particularly explicit use of the term appears in FEGR, which defines a neural geometry image-based representation for urban inverse rendering by coupling a neural intrinsic field, an explicit mesh, and a physically based renderer (Wang et al., 2023). The scene representation is
where is the signed distance, the surface normal, the diffuse or base color, and the roughness and metallic parameters of a Disney PBR BRDF. Outdoor lighting is represented by an HDR environment map at infinity, and the zero level set of the SDF is periodically converted to a triangle mesh with Marching Cubes.
The decisive split is between primary and secondary rays. Primary camera rays are handled through volumetric rendering of the neural intrinsic field, which produces a neural G-buffer containing a normal map, base color map, material map, and depth map. These per-pixel attributes are explicitly identified as “neural geometry images.” Secondary rays are then traced on the extracted mesh with OptiX to evaluate visibility, cast shadows, and specular transport using 512 secondary directions and multiple importance sampling. The resulting system keeps continuity and differentiability where image formation begins, but hands off high-order transport to an explicit geometric structure where ray tracing is efficient (Wang et al., 2023).
This architecture is not only representational but epistemic: it is designed to disentangle geometry, materials, and lighting. FEGR therefore augments its rendering loss with optional LiDAR depth supervision, an auxiliary radiance field for early geometry stabilization, a normal consistency term linking predicted normals to SDF gradients, and a semantic shading prior that constrains each semantic region to a limited albedo capacity so that shadows are forced into lighting rather than texture. The method is trained in two stages—geometry initialization followed by full inverse rendering—with the mesh rebuilt every 20 iterations as the SDF changes (Wang et al., 2023).
Within this formulation, the “image-based” aspect is not merely supervision from RGB images. It is the fact that the operative representation exposed to shading is an image-structured geometry buffer predicted by a neural field. FEGR’s relighting, virtual object insertion, and novel-view rendering all proceed from this intermediate geometry-image layer rather than from direct RGB regression (Wang et al., 2023).
4. Geometry images and UV atlases as neural surface coordinates
A second major lineage treats geometry itself as an image. In the geometry-image-based generator for point cloud generation, the generator maps a latent code to a geometry image whose three channels store , and the image is reshaped directly into a point cloud. The core argument is that such a geometry image is a completely regular 2D array that contains the surface points of the 3D object, simultaneously leveraging 2D regularity and the geodesic neighborhood of the 3D surface (Wang et al., 2020).
Subsequent work broadens both the scale and the role of the geometry image. Geometry Image Diffusion uses multi-chart geometry images built from UV atlases, with final geometry and texture images at 0. It trains a dedicated VAE on four channels 1 and couples a trainable geometry diffusion model to a frozen Stable Diffusion v2.1 albedo branch through Collaborative Control. The result is a text-to-3D pipeline that operates entirely in image space while generating meshes with semantically meaningful separate parts and internal structures at speeds comparable to current text-to-image models (Elizarov et al., 2024).
For mesh compression and restoration, another line of work transforms irregular meshes into regular geometry images and then applies standard image super-resolution networks. “Neural Geometry Image-Based Representations with Optimal Transport (OT)” uses Ricci-flow conformal initialization followed by Optimal Transport to obtain strictly area-preserving parameterizations, stores low-resolution position and normal geometry-image mipmaps, and reconstructs full-resolution 2 geometry images in a single forward pass. The paper reports compression ratios 3, 4, and 5 for stored resolutions 6, 7, and 8, respectively, and emphasizes decoder-free, continuous level of detail through standard GPU mipmapping (Gao et al., 24 Nov 2025).
SwiftTailor specializes the same principle to garments. Its Garment Geometry Image consists of three aligned images over a shared UV layout: a geometry image storing 3D coordinates, a semantic image storing panel types, and a stitching image storing seam IDs. A Dense Prediction Transformer with a ViT-L encoder predicts the geometry image from the semantic image, while post-processing reconstructs the mesh through UV-grid remeshing and dynamic stitching. The training objective combines an edge-aware regression loss, a stitch Chamfer loss, and a normal regularizer, and the resulting two-stage pipeline reduces 3D garment inference time by replacing a long simulation stage with dense image prediction plus algorithmic inverse mapping (Pham et al., 19 Mar 2026).
A broader generalization is Geometry Distributions, which explicitly positions itself relative to Geometry Images but replaces a regular UV grid with a diffusion-defined distribution of surface points. Its canonical domain is Gaussian noise space rather than a square atlas, and the learned ODE transports samples from a Gaussian source to a geometry distribution on the surface, producing a continuous, resolution-free, image-like representation of geometry (Zhang et al., 2024).
5. Beyond UV grids: graph, voxel, and canonical implicit variants
Neural geometry image-based representation is not restricted to UV atlases. In geometry-dependent inverse image reconstruction for ECGI, both unknown electrical potentials and measurements live on non-Euclidean domains: heart and torso meshes converted to graphs. Signals are processed by spatio-temporal graph CNNs using SplineCNN kernels, and the latent mapping between torso and heart is implemented as a spline convolution on a dense bipartite graph whose edge attributes encode relative geometry between the two surfaces. The representation remains attached to mesh hierarchies throughout encoding, latent physics, and decoding, and the authors explicitly frame this as a neural representation of an image sequence whose domain is a geometry rather than a regular grid (Jiang et al., 2020).
Image-based supervision can also drive continuous implicit fields directly. The survey on geometric implicit neural representations for signed distance functions characterizes geometric INRs as MLPs 9 trained to approximate SDFs with geometric losses such as Eikonal regularization, normal alignment, and curvature-related constraints. In the image-based setting, these methods use posed images, sample points along rays, convert SDF values to densities, and minimize a photometric data term of the form
0
This family includes NeuS, VolSDF, NeuS2, and Neuralangelo, and the survey explicitly treats them as central instances of neural geometry image-based representation (Schirmer et al., 10 Nov 2025).
IVGT extends that paradigm to pose-free multi-view images. It learns a continuous neural scene representation in the coordinate frame of the first input image, retrieves local features for arbitrary 3D query points by projecting them into multiple views, and predicts both SDF values and colors with lightweight decoders. Training combines RGB, depth, and normal supervision with Eikonal and smoothness regularization, and the model supports rendering RGB images, depth maps, and surface normal maps from arbitrary viewpoints, as well as mesh extraction via Marching Cubes (Wu et al., 15 May 2026).
At a more task-specific end of the spectrum, ImGeoNet builds a regular 3D voxel volume from multi-view image features, computes per-voxel mean and variance across views, and uses a 3D geometry-shaping network to predict a surface-probability field 1. The geometry-aware volume is then
2
so that free-space voxels are suppressed before 3D detection. This is not a mesh-reconstruction method, but it is still an image-induced geometry-aware representation in which geometry is learned explicitly inside an image-derived volumetric grid (Tu et al., 2023).
6. Applications, misconceptions, and persistent constraints
The application range is correspondingly broad. FEGR uses neural geometry images for inverse rendering, photorealistic relighting with specular and shadow effects, and virtual object insertion with ray-traced shadow casting (Wang et al., 2023). Geometry Image Diffusion uses geometry images for text-to-3D generation with semantically meaningful parts and internal structures (Elizarov et al., 2024). OT-based geometry-image representations target storage-efficient mesh restoration with single-pass reconstruction and continuous level of detail (Gao et al., 24 Nov 2025). SwiftTailor uses garment-specific geometry images to unify sewing-pattern reasoning and fast 3D garment synthesis (Pham et al., 19 Mar 2026). The ECGI formulation uses geometry-tied latent signals to generalize across rotations and new anatomies without retraining (Jiang et al., 2020). ImGeoNet uses an image-induced geometry-aware voxel representation to improve multi-view 3D object detection and attain superior detection accuracy than VoteNet in scenarios with sparse and noisy point clouds or many diverse small objects (Tu et al., 2023).
Parallel 2D work shows that the same representational logic also applies to continuous image models. Image-GS represents images by anisotropic 2D Gaussians and reports only 0.3K MACs to decode a pixel, together with a smooth level-of-detail hierarchy (Zhang et al., 2024). MiraGe embeds 2D images as flat objects in 3D using mirror reflections and flat-controlled Gaussians, enabling realistic image modifications and physics-based image manipulation (Waczyńska et al., 2024). GaINeR combines trainable Gaussian distributions with an INR decoder and reports 3 dB on Kodak and 4 dB on DIV2K, while exposing explicit geometric primitives for local editing (Jakubowska et al., 25 Nov 2025). These 2D systems are not surface-reconstruction methods, but they reinforce the general principle that geometry-aware primitives can replace monolithic coordinate MLPs when editability and localized control are primary objectives.
Several misconceptions can therefore be stated precisely. Neural geometry image-based representation is not synonymous with a single UV image of XYZ values; it may instead denote a neural G-buffer, a graph signal on a mesh hierarchy, a canonical SDF queried from image features, a geometry-aware voxel field, or an explicit Gaussian scaffold (Wang et al., 2023, Jiang et al., 2020, Wu et al., 15 May 2026, Tu et al., 2023, Waczyńska et al., 2024). It also does not imply a learned mesh decoder: the OT-based mesh representation is explicitly decoder-free, and SwiftTailor reconstructs meshes through deterministic remeshing and stitching rather than a mesh-specific neural decoder (Gao et al., 24 Nov 2025, Pham et al., 19 Mar 2026). Nor is every method aimed at full geometric fidelity; ImGeoNet is optimized for detection rather than for high-fidelity reconstruction (Tu et al., 2023).
The constraints are equally consistent across the literature. Image-based representations can lose geometry features relative to more direct 3D encodings (Xiao et al., 2020). FEGR remains ill-posed under single illumination, assumes static scenes, and uses approximate visibility gradients because mesh updates are discrete (Wang et al., 2023). Geometry-image pipelines may exhibit seams, small-feature failures, or UV-orientation artifacts (Elizarov et al., 2024, Gao et al., 24 Nov 2025). Geometry-aware graph models remain computationally heavy on large meshes and degrade when test geometries differ strongly from training (Jiang et al., 2020). SwiftTailor still lacks high-frequency wrinkles and depends on robust upstream pattern prediction (Pham et al., 19 Mar 2026). A plausible implication is that the central research problem is no longer whether geometry should be image-based, but which form of image-based geometry best matches a given downstream requirement: physically based rendering, generation, compression, detection, editing, or scientific inverse reconstruction.