Graph-Based Scene Representations

Updated 25 August 2025

Graph-based scene representations are a method of modeling scenes as graphs where nodes represent entities and edges capture spatial, semantic, or functional relationships.
They enable efficient multiview image coding by using sparse color and geometry matrices to capture photometric details and integer disparity links for accurate view synthesis.
Their design offers controlled geometry loss and improved reconstruction quality, achieving up to a 2 dB PSNR gain over traditional depth-based methods.

A graph-based scene representation encodes the semantic, geometric, and relational structure of a scene as a graph whose nodes correspond to entities (such as pixels, objects, or image regions) and whose edges capture context-sensitive relationships (such as spatial proximity, disparity, or temporal correspondence). This formalism permits both efficient storage and precise manipulation of complex scene information, supporting a range of tasks from multi-view image coding to high-level semantic reasoning. The following sections provide a comprehensive account of the principles, methodologies, advantages, and implications of graph-based scene representations, with particular emphasis on their technical construction and performance characteristics.

1. Fundamentals of Graph-Based Scene Representations

At their core, graph-based scene representations model the composition and structure of a scene using a graph $G = (V, E)$ , where each node $v \in V$ encodes a scene entity (which could be as fine as a pixel or as coarse as an object), and each edge $e \in E$ expresses a specific geometric, semantic, or functional relation. In multiview image coding, for example, a row-wise graph is constructed where:

Each node stores a color (luminance) value, formalized in a color matrix $\Gamma_r = [\gamma^r_{i,j}]$ , with $i$ as the view (or level) index and $j$ as the column (pixel position) index.
The node connections are determined by a geometry matrix $\Lambda_r = [\lambda^r_{i,j}]$ , where each entry either indicates the integer-valued horizontal disparity to a child or encodes jump/discontinuity information when visibility changes occur (e.g., at occlusion/disocclusion boundaries).

This two-matrix formalism simultaneously encodes photometric data and geometric scene structure, with edge semantics adapting dynamically to scene complexity and viewpoint variation.

2. Construction Methodology and Algorithmic Workflow

The process of constructing a graph-based scene representation for multiview images operates as follows:

Reference View Initialization:
- All pixel color values $I_1(r, j)$ from the reference view are stored in $\gamma^r_{1, j}$ , forming the first row of the color matrix, and $\lambda^r_{1, j} = 0$ .
Depth Map to Disparity Conversion:
- For subsequent views, geometric correspondences are derived from a rectified depth map, with disparities calculated using camera intrinsics and extrinsics:
$D(r, c) = \lceil \frac{f \delta}{Z(r, c)} + 0.5 \rceil$

where $f$ is the focal length, $\delta$ the camera baseline, and $Z(r, c)$ the depth at pixel $(r,c)$ . Rounding to integer disparities facilitates compact link encoding at the minor cost of quantization error.
Graph Expansion (Row-by-Row and View-by-View):
- For each row and new view, the algorithm identifies pixel categories—appearing, disoccluded, occluded, and disappearing—and records only the new (“non-inherited”) pixel values in higher matrix levels. The geometry matrix is sparsely populated: at each discontinuity or boundary, an explicit link (disparity “jump”) is stored, capturing both smooth correspondences and complex occlusion/disocclusion transitions.
Chained Graph Traversal for Reconstruction:
- At the decoder, each target view $I_{i}$ is reconstructed by recursively tracing links down the matrix levels, using integer disparities in $\lambda^r_{i,j}$ to identify the parent pixel in the previous view, and copying or correcting values as required. Residual images may be generated and transmitted at boundaries to compensate for errors stemming from disparity rounding or occlusion complexity.

The outcome is a highly sparse, multi-level graph that encodes only the data needed for lossless (or near-lossless) view synthesis, with explicit handling of complex 3D scene phenomena.

3. Compactness, Rate-Distortion Performance, and Adaptability

Graph-based scene representations achieve marked compression efficiency by coding only new, non-inherited pixels and encoding geometric links with succinct integer values. For example, in a typical “squares 1” multi-view dataset, the non-zero entries of $\Lambda$ require merely 2.2 kb for lossless geometry description, compared to 4.7 kb for a full depth image.

Rate-Distortion Trade-Off:
- Adjusting the number of levels ( $L$ ) in the graph (i.e., choosing which and how many views to send explicitly) directly controls the bitrate versus the reconstruction accuracy. Omitting levels and interpolating at the receiver saves bandwidth at the expense of larger residual corrections.
- Experimental results on several datasets (“squares,” “venus,” “sawtooth”) show that—at fixed total bitrates—graph-based representations can achieve a 2 dB gain in output image PSNR compared to traditional depth-based coding using JPEG2000-compressed depth maps.
Residual Coding:
- The introduction of residual frames ensures error control at geometric discontinuities or where disparity quantization introduces visible artifacts. The bit cost of these corrections is generally low, thanks to the underlying precision of the graph links.

Adaptability is a core design feature: the method can be tuned to selectively increase or reduce the detail and rate by controlling the number of transmitted graph levels and the granularity of disparity quantization.

4. Comparison with Depth-Based Coding and Evaluation

Depth-based scene representations encode geometry using compressed depth maps. While this approach is flexible, lossy depth compression introduces errors that non-locally propagate, particularly at occlusion/disocclusion boundaries, producing artifacts in synthesized views that are costly to repair.

Graph-based coding contrasts with this in several respects:

Controlled Geometry Loss: Only integer disparities are transmitted. Any approximation error is explicitly tracked and compensated by residual images rather than accumulating reveal errors globally.
Sparse, Scene-Driven Encoding: No redundant pixel encoding—only those pixels visible in new views but absent from previous are transmitted.

In systematic experiments, the graph-based approach yields both:

Higher reconstruction quality (2 dB or more improvement for the same bitrate across various multiview datasets).
Reduced complexity in residual signals (i.e., smaller residual frame bitrates), reflecting tighter control over geometry-induced artifacts.

5. Implementation Considerations and Limitations

Practical implementation of graph-based scene representations as proposed in (Maugey et al., 2013) involves:

Memory Layout: The per-row, per-view color and geometry matrices ( $\Gamma$ , $\Lambda$ ) should be stored in sparse formats for efficient encoding and decoding.
Integer Disparity Quantization: The design is best suited to rectified, purely horizontally translated view sequences; more complex camera arrangements may require generalizing the link encoding.
Residual Handling: Boundary cases—e.g., around thin structures or sharp depth discontinuities—necessitate careful management of the residual pipeline to meet quality requirements.

Potential limitations include:

Rounding of disparities, though typically small in error, can introduce visible artifacts at rare but critical scene locations.
The approach assumes the availability of accurate per-pixel depth maps or equivalent geometry prior for graph construction, which may not be available in all acquisition scenarios.

6. Applications and Broader Impact

Graph-based scene representations are effective in multiview image compression and coding for scenarios where geometry-driven redundancy can be exploited, such as:

Interactive multi-view video systems.
Free-viewpoint video applications.
Systems requiring scalable or adaptive geometry transmission (e.g., variable-rate streaming).

Beyond compression, the explicit, scene-graph formalism can facilitate downstream tasks such as view synthesis, object tracking across views, or structure-from-motion pipelines requiring efficient scene abstractions.

In summary, graph-based scene representations formalize the joint encoding of scene geometry and appearance using sparse, hierarchical, and adaptive graphs whose links precisely track inter-view correspondences and discontinuities. This approach achieves notable gains in both compression efficiency and reconstruction accuracy, offers explicit rate-control via graph structure, and exhibits strong adaptability to scene complexity—establishing it as a compelling alternative to traditional depth-based methods, particularly in bandwidth- or quality-critical contexts (Maugey et al., 2013).

PDF Markdown Chat (Pro)

References (1)

Graph-based representation for multiview image coding (2013)

Follow Topic

Get notified by email when new papers are published related to Graph-Based Scene Representations.