O-Voxel: Advanced 3D Scene Encoding

Updated 18 December 2025

O-Voxel representation is a versatile 3D encoding approach that augments classical voxel grids with flexible, sparse structures supporting arbitrary topology and material fidelity.
It employs dual-grid constructions, Bayesian fusion, and neural latent compression to efficiently capture detailed geometric, semantic, and appearance data.
The method enables real-time scene mapping and 3D generative modeling by integrating probabilistic semantic queries and adaptive voxelization techniques.

O-Voxel representation is a family of advanced 3D scene encoding methods that generalize the classical voxel grid to support flexible, open-vocabulary semantics, arbitrary surface topology, high-fidelity material attributes, and efficient neural-network latent compression. These representations are pivotal in bridging geometric, semantic, and appearance information for 3D assets, robotics, and embodied intelligence. “O-Voxel” can denote: (1) explicit sparse, geometry+material representations for 3D generative modeling (Xiang et al., 16 Dec 2025); (2) probabilistic open-vocabulary semantic grids for real-time mapping (Deng et al., 23 Feb 2025); (3) hybrid vision-language pipelines for semantic voxel extraction (Dao et al., 27 Mar 2025); or (4) implicit fields with per-voxel language features for online interactive comprehension (Tie et al., 10 Apr 2024). This article details the mathematical definitions, representational structures, encoding and learning modalities, performance characteristics, and algorithmic implications of O-Voxel representations.

1. Mathematical Formulation and Representational Structures

O-Voxel representations extend standard voxel grids by attaching richer information to active voxels and by employing sparse or adaptive indexing for scalability.

1.1 Geometry and Material Encoding (Sparse Structured O-Voxel)

The omni-voxel (“O-Voxel”) defined in (Xiang et al., 16 Dec 2025) is a surface-tied, field-free, sparse representation at resolution $N\times N\times N$ , encoding only “active” voxels (those intersecting the 3D asset’s surface):

$\mathcal{F} = \{ (f^{\rm shape}_i,\, f^{\rm mat}_i,\, p_i) \}_{i=1}^L$

where $p_i\in\{0,\ldots,N-1\}^3$ is a 3D grid index, $f^{\rm shape}_i$ includes a local dual vertex $v_i\in[0,1]^3$ , edge-intersection flags $\delta_i\in\{0,1\}^3$ , and a quad-splitting weight $\gamma_i\in\mathbb{R}_+$ . $f^{\rm mat}_i$ stores PBR attributes: base color $c_i\in[0,1]^3$ , metallic $m_i$ , roughness $r_i$ , opacity $\alpha_i$ .

This formulation is field-free: no underlying scalar field (e.g., SDF or occupancy) is needed. The dual grid is dynamically built by minimization of a QEF, ensuring accurate surface placement, sharp features, and the support for open/non-manifold geometry.

1.2 Probabilistic Open-Vocabulary Semantic Voxels

For instance-level open-vocabulary mapping (Deng et al., 23 Feb 2025), the O-Voxel representation is:

$v^j = \big( \theta^j,\, p^{j}_{\rm occ} \big)$

where $p^j_{\rm occ}$ is occupancy probability (often $1$ for observed voxels), and $\theta^j$ is a categorical semantic distribution over instances $\Gamma$ , typically parameterized by Dirichlet counts:

$\theta^{j, \gamma}_t = \frac{\alpha^j_{t,\gamma}}{\sum_{\tau\in\Gamma}\alpha^j_{t,\tau}}$

Each instance $\gamma$ is associated with a high-dimensional language embedding $f^\gamma\in\mathbb{R}^D$ supporting free-form semantic queries.

1.3 Hybrid Vision-Language O-Voxel Grids

In the VoxRep pipeline (Dao et al., 27 Mar 2025), the voxel grid is defined as $V\in\mathbb{R}^{W\times H\times D\times C}$ , with $C=1$ (occupancy) or $C=4$ (occupancy+RGB). Slices along a chosen axis are normalized, encoded via a frozen or fine-tuned 2D vision-language encoder, and aggregated; the resulting feature vector is decoded by a LLM to extract semantic, color, count, and positional information for each 3D object.

1.4 Implicit Neural Field with Voxel-based Language Features

O2V-Mapping (Tie et al., 10 Apr 2024) maintains three parallel grids: geometric ( $\phi^d$ ), appearance ( $\phi^c$ ), and a queue $Q(v)$ of recent CLIP-based language features plus confidence scores per voxel. At query, interpolation and per-voxel voting enable crisp open-vocabulary 3D queries and semantic rendering.

2. Representation Construction, Encoding, and Learning

2.1 Sparse Dual-Grid Construction and Material Assignment

In (Xiang et al., 16 Dec 2025), O-Voxel is extracted from raw mesh+PBR-texture pairs, assigning dual vertices via Hermite data and QEF optimization, edge flags for surface intersection, and material features by projecting voxel centers to surface triangle textures. These features enable precise surface/topology capture and native PBR rendering support.

2.2 Probabilistic Incremental Fusion

OpenVox (Deng et al., 23 Feb 2025) operates by associating 2D segmentations and caption embeddings with 3D voxels via back-projection, performing instance association through joint geometric and feature similarity, and updating per-voxel categorical-Dirichlet parameters. Language embeddings are fused per instance via weighted averages, and the entire system employs a sparse hash-grid for memory efficiency.

2.3 2D Slice Encoding and Multi-Slice Aggregation

VoxRep (Dao et al., 27 Mar 2025) slices the voxel grid along the depth axis, pre-processes and encodes each slice with a vision-LLM, and aggregates slice embeddings (via self-attention or recurrent mechanism). The pooled feature is used for language generation tasks (e.g., object-list extraction, attribute summarization), trained with composite losses (cross-entropy for token sequences, $\ell_2$ for positions, $\ell_1$ for voxel counts).

2.4 Online Neural Field Training and Adaptive Voxelization

O2V-Mapping (Tie et al., 10 Apr 2024) performs per-frame updates by extracting semantic masks and language features, projecting to voxel queues, and updating neural field parameters via differentiable volume rendering losses (geometry, color, semantics). Adaptive voxel splitting is invoked at semantic boundaries, and multi-view voting yields robust and consistent feature aggregation.

3. Expressive Power, Topology, and Semantic Support

O-Voxel representations differ fundamentally from classical voxels by supporting:

Arbitrary topology: O-Voxel does not require a watertight volume or implicit field and directly enumerates intersected voxels, supporting open, non-manifold, and fully enclosed geometries (Xiang et al., 16 Dec 2025).
Material and appearance fidelity: Each active voxel can encode PBR attributes (base color, roughness, metallic, opacity), supporting high-fidelity rendering and transfer.
Semantic richness: Open-vocabulary instance labeling is supported via per-voxel categorical distributions and language embeddings (Deng et al., 23 Feb 2025, Tie et al., 10 Apr 2024).
Hierarchical and free-form semantics: By storing queues or distributions over features and supporting multi-scale segmentations, per-voxel semantics can reflect hierarchical or composite object categories.
Volume and surface mapping: O-Voxel supports both surface-focused parametrization (dual grid) and volumetric occupancy or semantic fields as needed by downstream applications.

4. Compression, Efficiency, and Scalability

Sparse O-Voxel implementations exploit massive sparsity—only storing features at active voxels (typically $L\ll N^3$ ). In (Xiang et al., 16 Dec 2025), a $1024^3$ asset is reduced to $\sim9.6$ K tokens via 16 $\times$ downsampling in a Sparse Compression VAE, greatly outperforming dense and prior sparse methods (e.g., $\sim225$ K tokens for SparseFlex 1024).

Probabilistic O-Voxel grids (OpenVox) use sparse hash grids for live memory efficiency, with practical memory requirements: a $10\times10\times3$ m scene at $4$ cm resolution fits $\sim1.9$ M voxels in $50$ MB (Deng et al., 23 Feb 2025). Online implicit approaches (Tie et al., 10 Apr 2024) combine adaptivity—finer voxels only at semantic boundaries—with online neural field optimization, yielding scalability suitable for real-time applications.

5. Key Algorithmic Properties: Aggregation, Fusion, and Query

Aggregation: Self-attention over slice features (VoxRep), Bayesian Dirichlet fusion (OpenVox), and multi-view voting with adaptive subdivision (O2V-Mapping) are all employed as aggregation mechanisms, depending on representational focus.
Fusion: Language and geometric features may be fused probabilistically (categorical-Dirichlet), averaged with view-dependent weights, or composed in queues for later querying.
Semantic Query: Open-vocabulary O-Voxel mapping supports natural language queries (“find the red toolbox”) by matching prompt embeddings to instance codebooks and per-voxel semantics (Deng et al., 23 Feb 2025, Tie et al., 10 Apr 2024). Hierarchical queries (“door handle” vs. “door”) are resolved by the multi-scale, feature-queue nature of semantic voxels.

6. Applications and Benchmarks

O-Voxel representations have been validated in multiple domains:

3D generative modeling: O-Voxel latent spaces, via Sparse Compression VAE and flow-matching transformers, allow high-fidelity 3D asset generation with compact representations—down to $\sim9.6$ K tokens for full PBR assets (Xiang et al., 16 Dec 2025).
Real-time mapping and robotics: OpenVox achieves state-of-the-art zero-shot 3D segmentation (e.g., mIoU $27.30\%$ [OpenVox] vs. $16.49\%$ [ConceptGraphs]), ontology retrieval R@1 $0.905$, and real-world, real-time operation (7–14 Hz) on commodity hardware (Deng et al., 23 Feb 2025).
Online open-vocabulary scene construction: O2V-Mapping demonstrates improvement in mIoU over LERF ($0.39$ vs. $0.35$) and OVSeg, with faster FPS and clearer object boundaries even for fine-grained hierarchical queries (Tie et al., 10 Apr 2024).
Graph contact representations: Classical voxel representations for graph embedding have provable bounds: $\Theta(n^2)$ for general graphs, refined to $\Theta(n\tau)$ for treewidth- $\tau$ graphs, and $O((g+1)^2 n \log^2 n)$ for genus- $g$ graphs; optimal representation remains NP-complete (Alam et al., 2015).

7. Comparative Analysis and Limitations

The table summarizes important differences among modern 3D representations:

Representation	Topology Support	Compression	Material Attributes	Open-Vocabulary Semantics
O-Voxel (Sparse, (Xiang et al., 16 Dec 2025))	Arbitrary; open/non-manifold	Sparse, $\ll N^3$	Full PBR	Not inherent (can add)
OpenVox (Deng et al., 23 Feb 2025)	Volumetric; instance-level	Sparse	Occupancy only	Yes (Dirichlet+embed)
VoxRep (Dao et al., 27 Mar 2025)	Volumetric	Dense (tiled)	RGB/color	Yes (language head)
O2V-Mapping (Tie et al., 10 Apr 2024)	Volumetric+field	Adaptive sparse	Color (implicit)	Yes (CLIP queue)
Mesh	Irregular	None	Texture/PBR	No
NeRF	Continuous, implicit	Latent	View-dependent	No (traditionally)

O-Voxel representations enable bridging the gap between structured geometry, open-category semantics, and differentiable neural compression. Dense approaches remain memory-bound for high resolutions; field-based methods struggle with open and non-manifold surfaces; mesh- and point-based methods lack latent regularity and semantic compositionality. O-Voxel’s field-free nature, sparsity, and extensibility to arbitrary semantics position it at the forefront of 3D scene understanding and generation research.

References

(Xiang et al., 16 Dec 2025): "Native and Compact Structured Latents for 3D Generation"
(Deng et al., 23 Feb 2025): "OpenVox: Real-time Instance-level Open-vocabulary Probabilistic Voxel Representation"
(Dao et al., 27 Mar 2025): "VoxRep: Enhancing 3D Spatial Understanding in 2D Vision-LLMs via Voxel Representation"
(Tie et al., 10 Apr 2024): "O2V-Mapping: Online Open-Vocabulary Mapping with Neural Implicit Representation"
(Alam et al., 2015): "Pixel and Voxel Representations of Graphs"