Papers
Topics
Authors
Recent
2000 character limit reached

Content-Promoted Scene Layers (CPSL)

Updated 22 November 2025
  • Content-Promoted Scene Layers (CPSL) is a compact 2.5D video representation that partitions each frame into depth-ordered RGBA layers to achieve immersive free-viewpoint exploration.
  • It leverages per-frame depth data, learned saliency, and an edge–depth cache to maintain occlusion order and boundary continuity, ensuring smooth view synthesis.
  • CPSL demonstrates high efficiency with real-time playback at low bitrates, outperforming traditional 3D methods in PSNR, SSIM, and LPIPS metrics on benchmark datasets.

Content-Promoted Scene Layers (CPSL) constitute a compact 2.5D video representation aimed at delivering the perceptual benefits of volumetric video—including free viewpoint exploration and realistic motion parallax—using lightweight, 2D-encodable assets. The approach decomposes each video frame into a small set of geometry-consistent, content-aware RGBA layers, augmented with soft alpha bands and a sparse edge–depth cache to jointly preserve occlusion ordering and boundary continuity. Guided by per-frame depth and learned saliency, CPSL supports interactive, parallax-corrected novel-view synthesis without the computational overhead of conventional 3D reconstruction or per-scene neural optimization. Temporal coherence is maintained through motion-guided propagation, and all asset streams are compatible with standard video codecs, enabling real-time playback and scalable streaming (Hu et al., 18 Nov 2025).

1. Motivation and Context

Volumetric video encodes real-world scenes so that free-viewpoint navigation and authentic parallax can be achieved during playback. Traditional explicit geometric representations such as point clouds and meshes directly store 3D coordinates and color attributes but demand dense multi-view capture, large storage bandwidth, and specialized decoders. Implicit neural-field approaches (e.g., NeRF and its dynamic variants) optimize continuous 4D radiance fields per scene, yielding high visual fidelity but imposing prohibitive costs for training, inference, and memory. Layered 2.5D methods such as multiplane images (MPI) and layered depth images (LDI) represent content using stacks of fronto-parallel RGBA planes. These are lightweight and stream-friendly, yet suffer from cracks, double edges, or blurred boundaries, especially under large viewpoint shifts or dynamic content, limitations that are especially pronounced in monocular video setups.

These trade-offs highlight the need for an efficient, real-time, and codec-compatible framework that attains volumetric cues (depth, occlusion, parallax) without the cost and complexity of full 3D modeling. CPSL addresses this by partitioning each video frame into a compact set of depth-ordered, semantically aligned layers that directly encode visually salient content (Hu et al., 18 Nov 2025).

2. Layered Scene Decomposition

Given an input frame of dimensions H×WH \times W, CPSL partitions the image into KK depth-ordered layers:

L={Lk}k=1K,Lk=(Ck,αk,Dk,EDCk),\mathcal{L} = \{ L_{k} \}_{k=1}^K, \quad L_{k} = (C_{k}, \alpha_{k}, D_{k}, \mathrm{EDC}_{k}),

where Ck(x,y)C_{k}(x, y) holds the premultiplied RGB color, αk(x,y)\alpha_{k}(x, y) is a soft alpha mask, Dk(x,y)D_{k}(x, y) is the per-pixel reference depth, and EDCk\mathrm{EDC}_{k} is a sparse Edge–Depth Cache. Partitioning is determined by minimizing an energy function that integrates depth-fidelity, semantic consistency, instance affinity, and smoothness terms:

E(S)=x[wd(x)ρ(z(x)zˉS(x))+ws(x)ϕ(c(x),S(x))+wi(x)ψ(ι(x),S(x))]+λb(x,y)Nωxy1[S(x)S(y)].\mathcal{E}(S) = \sum_{x} \Bigl[ w_d(x)\,\rho(z(x) - \bar{z}_{S(x)}) + w_s(x)\,\phi(c(x), S(x)) + w_i(x)\,\psi(\iota(x), S(x)) \Bigr] + \lambda_b \sum_{(x, y) \in \mathcal{N}} \omega_{xy}\, \mathbf{1}[S(x) \neq S(y)].

After solving for the labeling S(x)S(x), salient object instances are promoted to dedicated layers, and background regions are aggregated based on depth variation and texture clustering.

Soft alpha bands are derived from hard masks by exponential feathering based on signed contour distance and local depth gradient:

αk(x)=exp([d(x,Lk)/w(x)]2),\alpha_{k}(x) = \exp\left(- \left[ d(x, \partial L_{k}) / w(x) \right]^2 \right),

with w(x)w(x) adapting to local geometric and uncertainty cues. At boundaries, the Edge–Depth Cache sparsely stores quantized offsets Dk+1(x)Dk(x)D_{k+1}(x) - D_k(x) for robust occlusion handling during novel view synthesis.

3. Saliency-Guided Layer Formation

Saliency in CPSL integrates two principal components: content saliency, derived from learned salient-object detectors, normalized via sigmoid to obtain ws(x)w_s(x), and depth saliency wd(x)w_d(x), based on local inverse depth variance ςz(x)\varsigma_z(x). These are combined with user-controlled weights (λc,λd\lambda_c, \lambda_d) to yield a composite guidance signal:

S(x)=λcws(x)+λdwd(x).S(x) = \lambda_c\, w_s(x) + \lambda_d\, w_d(x).

This composite saliency influences both the unary terms of layer partitioning and downstream encoding strategies, concentrating resources on perceptually and geometrically critical regions.

4. View Synthesis and Occlusion Management

Novel-view synthesis proceeds by depth-weighted warping and front-to-back compositing. Each layer is re-projected from source to target camera pose as follows:

Xs=Di(p)Ks1p,Xt=RtsXs+tts,pi=KtXt.X_s = D_i(p)\,K_s^{-1}\,p, \hspace{0.5cm} X_t = R_{t \leftarrow s}\,X_s + t_{t \leftarrow s}, \hspace{0.5cm} p'_i = K_t\,X_t.

Colors and opacities are resampled using bilinear interpolation. The rendered pixel intensity at each location is given by:

It(x,y)=i=1K[αiw(x,y)Ciw(x,y)j<i(1αjw(x,y))].I_t(x, y) = \sum_{i=1}^K \left[ \alpha_i^w(x, y)\,C_i^w(x, y)\, \prod_{j < i}(1-\alpha_j^w(x, y)) \right].

At occlusion boundaries where cracks or double edges may develop (due to imperfect warping), the Edge–Depth Cache enables the construction of a dynamic pixel strip (DPS): a narrow transition band where depth and color are adaptively blended between foreground and background for seamless occlusion continuity. The DPS blend factor γ(x)\gamma(x) is adaptively determined by geometric cues.

5. Temporal Coherence and Codec Compatibility

CPSL structures video streams into Groups of Pictures (GOPs) with full decompositions in I-frames and propagated layers in P-frames. Motion-guided propagation employs optical flow (e.g., RAFT) to warp masks, colors, and alpha mattes forward in time, while a temporal confidence metric—based on Intersection over Union (IoU) and mask stability—governs refresh triggers for new I-frames.

For storage and streaming, each layer is encoded as an independent RGBA sequence using standard codecs such as H.265 or AV1. Given a total bitrate RR, per-layer bit-allocation is optimized for perceptual quality by minimizing a saliency-weighted sum of the per-layer LPIPS distortion, subject to the total bitrate constraint:

min{rk}k=1KwkLPIPSk(rk)s.t.k=1KrkR,\min_{\{ r_k \}} \sum_{k=1}^K w_k\,\mathrm{LPIPS}_k(r_k) \quad \text{s.t.} \quad \sum_{k=1}^K r_k \leq R,

where each wkw_k increases with cumulative saliency and boundary sharpness for its layer.

6. Efficiency and Implementation

CPSL achieves real-time streaming of full-scene videos at 2.3 Mbps, compared to 18 Mbps for depth-based point clouds and 9.6 Mbps for adaptive MPI, representing a 7× and 4× reduction in bitrate, respectively. Rendering cost is linear in the number of layers and pixels (O(KHW)O(KHW)), with practical implementations exceeding 60 FPS on consumer GPUs. The RGBA and side-channel streams are fully compatible with existing 2D video infrastructure; no specialized 3D decoders are required and supplemental metadata such as EDC are negligible in size.

7. Benchmarks and Comparative Performance

On the DyCheck dynamic-scene benchmark with novel-view offsets from 00^{\circ}3030^{\circ}, CPSL yields:

  • PSNR: 29.6 dB (vs. 26.5 MPI, 24.7 DIBR)
  • SSIM: 0.94 (vs. 0.87 MPI, 0.82 DIBR)
  • LPIPS: 0.11 (vs. 0.18 MPI, 0.24 DIBR)
  • Boundary Crack: 0.05 (vs. 0.11 MPI, 0.14 DIBR)

Ablation studies show superior temporal stability and boundary coherence: frame-wise depth baseline exhibits a boundary variance of 1.00, F-SSIM 0.910, flicker 0.128; CPSL with full refinement achieves 0.47, 0.948, and 0.059, respectively. On the full-scene volumetric video dataset, CPSL attains PSNR 31.2, SSIM 0.956, LPIPS 0.083 at 2.3 Mbps, consistently outperforming depth-based and MPI baselines both in quality and efficiency.

The evidence supports that CPSL provides a practical, scalable path from 2D video to immersive 2.5D media, rivaling neural-field methods in perceptual fidelity while maintaining low computational and bandwidth requirements (Hu et al., 18 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Content-Promoted Scene Layers (CPSL).