Papers
Topics
Authors
Recent
2000 character limit reached

O-Voxel: Advanced 3D Scene Encoding

Updated 18 December 2025
  • O-Voxel representation is a versatile 3D encoding approach that augments classical voxel grids with flexible, sparse structures supporting arbitrary topology and material fidelity.
  • It employs dual-grid constructions, Bayesian fusion, and neural latent compression to efficiently capture detailed geometric, semantic, and appearance data.
  • The method enables real-time scene mapping and 3D generative modeling by integrating probabilistic semantic queries and adaptive voxelization techniques.

O-Voxel representation is a family of advanced 3D scene encoding methods that generalize the classical voxel grid to support flexible, open-vocabulary semantics, arbitrary surface topology, high-fidelity material attributes, and efficient neural-network latent compression. These representations are pivotal in bridging geometric, semantic, and appearance information for 3D assets, robotics, and embodied intelligence. “O-Voxel” can denote: (1) explicit sparse, geometry+material representations for 3D generative modeling (Xiang et al., 16 Dec 2025); (2) probabilistic open-vocabulary semantic grids for real-time mapping (Deng et al., 23 Feb 2025); (3) hybrid vision-language pipelines for semantic voxel extraction (Dao et al., 27 Mar 2025); or (4) implicit fields with per-voxel language features for online interactive comprehension (Tie et al., 10 Apr 2024). This article details the mathematical definitions, representational structures, encoding and learning modalities, performance characteristics, and algorithmic implications of O-Voxel representations.

1. Mathematical Formulation and Representational Structures

O-Voxel representations extend standard voxel grids by attaching richer information to active voxels and by employing sparse or adaptive indexing for scalability.

1.1 Geometry and Material Encoding (Sparse Structured O-Voxel)

The omni-voxel (“O-Voxel”) defined in (Xiang et al., 16 Dec 2025) is a surface-tied, field-free, sparse representation at resolution N×N×NN\times N\times N, encoding only “active” voxels (those intersecting the 3D asset’s surface):

F={(fishape,fimat,pi)}i=1L\mathcal{F} = \{ (f^{\rm shape}_i,\, f^{\rm mat}_i,\, p_i) \}_{i=1}^L

where pi{0,,N1}3p_i\in\{0,\ldots,N-1\}^3 is a 3D grid index, fishapef^{\rm shape}_i includes a local dual vertex vi[0,1]3v_i\in[0,1]^3, edge-intersection flags δi{0,1}3\delta_i\in\{0,1\}^3, and a quad-splitting weight γiR+\gamma_i\in\mathbb{R}_+. fimatf^{\rm mat}_i stores PBR attributes: base color ci[0,1]3c_i\in[0,1]^3, metallic mim_i, roughness rir_i, opacity αi\alpha_i.

This formulation is field-free: no underlying scalar field (e.g., SDF or occupancy) is needed. The dual grid is dynamically built by minimization of a QEF, ensuring accurate surface placement, sharp features, and the support for open/non-manifold geometry.

1.2 Probabilistic Open-Vocabulary Semantic Voxels

For instance-level open-vocabulary mapping (Deng et al., 23 Feb 2025), the O-Voxel representation is:

vj=(θj,poccj)v^j = \big( \theta^j,\, p^{j}_{\rm occ} \big)

where poccjp^j_{\rm occ} is occupancy probability (often $1$ for observed voxels), and θj\theta^j is a categorical semantic distribution over instances Γ\Gamma, typically parameterized by Dirichlet counts:

θtj,γ=αt,γjτΓαt,τj\theta^{j, \gamma}_t = \frac{\alpha^j_{t,\gamma}}{\sum_{\tau\in\Gamma}\alpha^j_{t,\tau}}

Each instance γ\gamma is associated with a high-dimensional language embedding fγRDf^\gamma\in\mathbb{R}^D supporting free-form semantic queries.

1.3 Hybrid Vision-Language O-Voxel Grids

In the VoxRep pipeline (Dao et al., 27 Mar 2025), the voxel grid is defined as VRW×H×D×CV\in\mathbb{R}^{W\times H\times D\times C}, with C=1C=1 (occupancy) or C=4C=4 (occupancy+RGB). Slices along a chosen axis are normalized, encoded via a frozen or fine-tuned 2D vision-language encoder, and aggregated; the resulting feature vector is decoded by a LLM to extract semantic, color, count, and positional information for each 3D object.

1.4 Implicit Neural Field with Voxel-based Language Features

O2V-Mapping (Tie et al., 10 Apr 2024) maintains three parallel grids: geometric (ϕd\phi^d), appearance (ϕc\phi^c), and a queue Q(v)Q(v) of recent CLIP-based language features plus confidence scores per voxel. At query, interpolation and per-voxel voting enable crisp open-vocabulary 3D queries and semantic rendering.

2. Representation Construction, Encoding, and Learning

2.1 Sparse Dual-Grid Construction and Material Assignment

In (Xiang et al., 16 Dec 2025), O-Voxel is extracted from raw mesh+PBR-texture pairs, assigning dual vertices via Hermite data and QEF optimization, edge flags for surface intersection, and material features by projecting voxel centers to surface triangle textures. These features enable precise surface/topology capture and native PBR rendering support.

2.2 Probabilistic Incremental Fusion

OpenVox (Deng et al., 23 Feb 2025) operates by associating 2D segmentations and caption embeddings with 3D voxels via back-projection, performing instance association through joint geometric and feature similarity, and updating per-voxel categorical-Dirichlet parameters. Language embeddings are fused per instance via weighted averages, and the entire system employs a sparse hash-grid for memory efficiency.

2.3 2D Slice Encoding and Multi-Slice Aggregation

VoxRep (Dao et al., 27 Mar 2025) slices the voxel grid along the depth axis, pre-processes and encodes each slice with a vision-LLM, and aggregates slice embeddings (via self-attention or recurrent mechanism). The pooled feature is used for language generation tasks (e.g., object-list extraction, attribute summarization), trained with composite losses (cross-entropy for token sequences, 2\ell_2 for positions, 1\ell_1 for voxel counts).

2.4 Online Neural Field Training and Adaptive Voxelization

O2V-Mapping (Tie et al., 10 Apr 2024) performs per-frame updates by extracting semantic masks and language features, projecting to voxel queues, and updating neural field parameters via differentiable volume rendering losses (geometry, color, semantics). Adaptive voxel splitting is invoked at semantic boundaries, and multi-view voting yields robust and consistent feature aggregation.

3. Expressive Power, Topology, and Semantic Support

O-Voxel representations differ fundamentally from classical voxels by supporting:

  • Arbitrary topology: O-Voxel does not require a watertight volume or implicit field and directly enumerates intersected voxels, supporting open, non-manifold, and fully enclosed geometries (Xiang et al., 16 Dec 2025).
  • Material and appearance fidelity: Each active voxel can encode PBR attributes (base color, roughness, metallic, opacity), supporting high-fidelity rendering and transfer.
  • Semantic richness: Open-vocabulary instance labeling is supported via per-voxel categorical distributions and language embeddings (Deng et al., 23 Feb 2025, Tie et al., 10 Apr 2024).
  • Hierarchical and free-form semantics: By storing queues or distributions over features and supporting multi-scale segmentations, per-voxel semantics can reflect hierarchical or composite object categories.
  • Volume and surface mapping: O-Voxel supports both surface-focused parametrization (dual grid) and volumetric occupancy or semantic fields as needed by downstream applications.

4. Compression, Efficiency, and Scalability

Sparse O-Voxel implementations exploit massive sparsity—only storing features at active voxels (typically LN3L\ll N^3). In (Xiang et al., 16 Dec 2025), a 102431024^3 asset is reduced to 9.6\sim9.6K tokens via 16×\times downsampling in a Sparse Compression VAE, greatly outperforming dense and prior sparse methods (e.g., 225\sim225K tokens for SparseFlex 1024).

Probabilistic O-Voxel grids (OpenVox) use sparse hash grids for live memory efficiency, with practical memory requirements: a 10×10×310\times10\times3 m scene at $4$ cm resolution fits 1.9\sim1.9M voxels in $50$ MB (Deng et al., 23 Feb 2025). Online implicit approaches (Tie et al., 10 Apr 2024) combine adaptivity—finer voxels only at semantic boundaries—with online neural field optimization, yielding scalability suitable for real-time applications.

5. Key Algorithmic Properties: Aggregation, Fusion, and Query

  • Aggregation: Self-attention over slice features (VoxRep), Bayesian Dirichlet fusion (OpenVox), and multi-view voting with adaptive subdivision (O2V-Mapping) are all employed as aggregation mechanisms, depending on representational focus.
  • Fusion: Language and geometric features may be fused probabilistically (categorical-Dirichlet), averaged with view-dependent weights, or composed in queues for later querying.
  • Semantic Query: Open-vocabulary O-Voxel mapping supports natural language queries (“find the red toolbox”) by matching prompt embeddings to instance codebooks and per-voxel semantics (Deng et al., 23 Feb 2025, Tie et al., 10 Apr 2024). Hierarchical queries (“door handle” vs. “door”) are resolved by the multi-scale, feature-queue nature of semantic voxels.

6. Applications and Benchmarks

O-Voxel representations have been validated in multiple domains:

  • 3D generative modeling: O-Voxel latent spaces, via Sparse Compression VAE and flow-matching transformers, allow high-fidelity 3D asset generation with compact representations—down to 9.6\sim9.6K tokens for full PBR assets (Xiang et al., 16 Dec 2025).
  • Real-time mapping and robotics: OpenVox achieves state-of-the-art zero-shot 3D segmentation (e.g., mIoU 27.30%27.30\% [OpenVox] vs. 16.49%16.49\% [ConceptGraphs]), ontology retrieval R@1 $0.905$, and real-world, real-time operation (7–14 Hz) on commodity hardware (Deng et al., 23 Feb 2025).
  • Online open-vocabulary scene construction: O2V-Mapping demonstrates improvement in mIoU over LERF ($0.39$ vs. $0.35$) and OVSeg, with faster FPS and clearer object boundaries even for fine-grained hierarchical queries (Tie et al., 10 Apr 2024).
  • Graph contact representations: Classical voxel representations for graph embedding have provable bounds: Θ(n2)\Theta(n^2) for general graphs, refined to Θ(nτ)\Theta(n\tau) for treewidth-τ\tau graphs, and O((g+1)2nlog2n)O((g+1)^2 n \log^2 n) for genus-gg graphs; optimal representation remains NP-complete (Alam et al., 2015).

7. Comparative Analysis and Limitations

The table summarizes important differences among modern 3D representations:

Representation Topology Support Compression Material Attributes Open-Vocabulary Semantics
O-Voxel (Sparse, (Xiang et al., 16 Dec 2025)) Arbitrary; open/non-manifold Sparse, N3\ll N^3 Full PBR Not inherent (can add)
OpenVox (Deng et al., 23 Feb 2025) Volumetric; instance-level Sparse Occupancy only Yes (Dirichlet+embed)
VoxRep (Dao et al., 27 Mar 2025) Volumetric Dense (tiled) RGB/color Yes (language head)
O2V-Mapping (Tie et al., 10 Apr 2024) Volumetric+field Adaptive sparse Color (implicit) Yes (CLIP queue)
Mesh Irregular None Texture/PBR No
NeRF Continuous, implicit Latent View-dependent No (traditionally)

O-Voxel representations enable bridging the gap between structured geometry, open-category semantics, and differentiable neural compression. Dense approaches remain memory-bound for high resolutions; field-based methods struggle with open and non-manifold surfaces; mesh- and point-based methods lack latent regularity and semantic compositionality. O-Voxel’s field-free nature, sparsity, and extensibility to arbitrary semantics position it at the forefront of 3D scene understanding and generation research.

References

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to O-Voxel Representation.