Papers
Topics
Authors
Recent
Search
2000 character limit reached

Cuboid-Net: Cuboid Neural Architectures

Updated 3 December 2025
  • Cuboid-Net is a neural architecture that uses cuboid representations to model spatio-temporal dynamics for video super-resolution and 3D shape abstraction.
  • Its multi-branch design decomposes input data into orthogonal slices, employing convolutional, residual, and attention modules to extract and fuse features.
  • Experimental results demonstrate significant improvements in PSNR, SSIM, and geometric fitting metrics across video datasets and 3D benchmarks.

Cuboid-Net refers to distinct neural architectures that leverage the parametric cuboid structure for either video super-resolution or 3D shape abstraction and fitting. In each case, the central methodology is to encode, decompose, and reconstruct data by exploiting cuboidal or related volumetric primitives, either in spatial-temporal or geometric domains.

1. Cuboid Representation and Slicing in Video Super-Resolution

Cuboid-Net for space-time video super-resolution (Fu et al., 2024) models an input low-resolution, low-frame-rate video sequence as a cuboid tensor Vin∈RN×H×WV^{in} \in \mathbb{R}^{N \times H \times W}, where NN is the number of frames, HH the height, and WW the width. The intensity at spatio-temporal index (t,h,w)(t, h, w) is Vt,h,winV^{in}_{t, h, w}. This cuboidal formulation exposes both spatial and temporal data correlations.

A core innovation is slicing this cuboid along three orthogonal axes:

  • Temporal slices: St1=Vin[t,:,:]S^1_t = V^{in}[t,:,:] (an H×WH \times W image for each tt)
  • Horizontal spatial–time slices: Sw2=Vin[:,:,w]S^2_w = V^{in}[:,:,w] (an NN0 matrix for each NN1)
  • Vertical spatial–time slices: NN2 (an NN3 matrix for each NN4)

Each of these slice sets is fed into a dedicated branch of the network, enabling the extraction and integration of spatial and temporal features.

2. Multi-Branch Network Architectures

Multi-Branch Hybrid Feature Extraction (MBFE)

Within each branch, a sequence of multi-feature blocks (MFBs) operates on the slices. Processing steps:

  • Bicubic upsampling (e.g., scale factor NN5 spatially) for spatial super-resolution
  • Shallow feature extraction: two consecutive NN6D conv + ReLU operations
  • Deep feature extraction: a cascade of NN7 residual-dense blocks (ResDB), defined by

NN8

with NN9 as densely-connected HH0D conv + ReLU layers and HH1 channel reduction via HH2 conv

  • Prediction and fusion of feature outputs through additional conv layers, finalized with a residual connection to bicubic upsampled input.

Multi-Branch Reconstruction (MBR)

Outputs from all three branches, HH3, are reconstructed using identical HH4D convolutional blocks:

  • Stack of HH5 HH6DConv + ReLU layers yields a feature tensor HH7
  • HH8D transposed-convolution for time-space upsampling, followed by Leaky ReLU
  • Final HH9D convolution produces WW0; outputs from all branches are concatenated and fused by a WW1D conv, yielding the first-stage space-time super-resolved video WW2.

Quality Enhancement Modules

A two-stage enhancement pipeline further refines the output:

  • First-stage Quality Enhancement (QE): Per-frame residual network (SRCNN-inspired) with skip connection
  • Second-stage Cross-Frame Quality Enhancement (CFQE): For interpolated frames, a WW3-layer WW4D conv net with interleaved CBAM (Convolutional Block Attention Module) attention modules addresses motion artifacts. CBAM applies channel and spatial attention:
    • Channel: WW5
    • Spatial: WW6

3. Cuboid Primitive Fitting to 3D Data

In geometric modeling, Cuboid-Net denotes a paradigm for 3D shape fitting and abstraction using volumetric cuboids (Kluger et al., 2021, Kobsik et al., 3 Feb 2025).

Parametric Cuboid Representation

The canonical parameterization is:

WW7

where WW8 (center), WW9 (axes-aligned scale), and (t,h,w)(t, h, w)0 (rotation). Eight corner points are constructed by

(t,h,w)(t, h, w)1

with (t,h,w)(t, h, w)2 running through all sign combinations.

Network-Guided Primitive Fitting

  • RGB/depth images are encoded into 3D feature maps using a BTS depth CNN (DenseNet-161 encoder + multi-scale decoder) (Kluger et al., 2021).
  • Sampling-weight networks produce weighted point selections for RANSAC; iterations fit cuboid hypotheses and score them using an occlusion-aware inlier count metric, resolving ambiguities from occlusions by a custom distance function:

(t,h,w)(t, h, w)3

where (t,h,w)(t, h, w)4 is binary visibility.

Fine-to-Coarse Cuboid Abstraction in 3D Shape Modeling

Learning-based abstraction (Kobsik et al., 3 Feb 2025) starts from a large set of surface points:

  • Local PointNet encoders aggregate (t,h,w)(t, h, w)5NN-patch features for (t,h,w)(t, h, w)6 sampled centers
  • Global shape features and cuboid latents are integrated using Vision Transformers
  • Each latent predicts a cuboid parameter vector (t,h,w)(t, h, w)7, where (t,h,w)(t, h, w)8 uses unit quaternions and (t,h,w)(t, h, w)9 is a primitive existence probability

The fine-to-coarse training schedule progressively prunes redundant primitives using an abstraction loss:

Vt,h,winV^{in}_{t, h, w}0

where Vt,h,winV^{in}_{t, h, w}1 is a binary mask based on primitive ranking by Vt,h,winV^{in}_{t, h, w}2.

4. Loss Functions and Training Strategies

Video Super-Resolution

Cuboid-Net employs a mean-squared error loss over all output frames (spatial and interpolated):

Vt,h,winV^{in}_{t, h, w}3

Alternatively, it can be split into Vt,h,winV^{in}_{t, h, w}4 and Vt,h,winV^{in}_{t, h, w}5 over SSR and TSR targets (Vt,h,winV^{in}_{t, h, w}6).

Training details:

  • Adam optimizer (Vt,h,winV^{in}_{t, h, w}7, Vt,h,winV^{in}_{t, h, w}8), initial LR Vt,h,winV^{in}_{t, h, w}9, halved every St1=Vin[t,:,:]S^1_t = V^{in}[t,:,:]0 epochs
  • Batch size St1=Vin[t,:,:]S^1_t = V^{in}[t,:,:]1, crop size St1=Vin[t,:,:]S^1_t = V^{in}[t,:,:]2 (time St1=Vin[t,:,:]S^1_t = V^{in}[t,:,:]3 H St1=Vin[t,:,:]S^1_t = V^{in}[t,:,:]4 W)

3D Shape Abstraction

The objective combines reconstruction (surface and volume) and abstraction losses:

St1=Vin[t,:,:]S^1_t = V^{in}[t,:,:]5

where

St1=Vin[t,:,:]S^1_t = V^{in}[t,:,:]6

with explicit bidirectional Chamfer-style losses for surface and volume (Kobsik et al., 3 Feb 2025).

Optimization:

  • AdamW, LR St1=Vin[t,:,:]S^1_t = V^{in}[t,:,:]7, cosine-annealing schedule, St1=Vin[t,:,:]S^1_t = V^{in}[t,:,:]8 epochs, batch size St1=Vin[t,:,:]S^1_t = V^{in}[t,:,:]9

5. Experimental Results

Space-Time Video Super-Resolution

Cuboid-Net exceeds all tested baselines on the standard datasets:

  • Vimeo-90K (test): ST-SR PSNR H×WH \times W0 dB, SSIM H×WH \times W1 (TMNet: H×WH \times W2 dB, H×WH \times W3)
  • Vid4: ST-SR PSNR H×WH \times W4 dB, SSIM H×WH \times W5
  • SSR-only: H×WH \times W6 dB (compared to BasicVSR H×WH \times W7 dB) despite lower frame-rate input

Ablation indicates increasing ResDB blocks (H×WH \times W8) improves PSNR, with diminishing returns beyond H×WH \times W9; increasing tt0DConv layers in RB (tt1) gives tt2 dB; QE and CFQE improve frame quality incrementally.

The model has tt3 M parameters, tt4s runtime per clip on 2080Ti, smaller and faster than most two-stage pipelines.

3D Cuboid Abstraction and Fitting

Robustness on NYU Depth v2 is demonstrated:

  • RGB input: AUC@10 cm tt5\% (vs.\ tt6\% prior), mean occlusion-aware tt7 cm (vs.\ tt8 cm)
  • Depth input: AUC@10 cm tt9\%
  • Cuboid-Net abstracts complex indoor scenes into interpretable cuboids, outperforming prior superquadric-based approaches (Kluger et al., 2021)

ShapeNet and DFAUST results (Kobsik et al., 3 Feb 2025) show improved compactness and fidelity across categories (planes, chairs, tables, humans):

class Method Num ↓ CD ↓ IoU (%) ↑
plane Ours 6.03 0.026 56.0
chair Ours 8.37 0.036 54.9
table Ours 5.67 0.035 45.1
human Ours 6.02 0.032 58.8

6. Downstream Applications and Extensions

Cuboid-based abstractions produced by Cuboid-Net can be directly repurposed for:

  • Shape co-segmentation (semantic part labeling via primitive assignment)
  • Shape clustering and structural retrieval using cuboid-parameter vectors
  • Partial symmetry detection through pairwise ICP alignment of primitive-enclosed point sets

A plausible implication is that compact, accurate cuboid abstractions may facilitate interpretable analysis, efficient data compression, and improved downstream geometric reasoning (Kobsik et al., 3 Feb 2025).

7. Significance, Limitations, and Interpretation

Cuboid-Net’s explicit cuboidal decomposition—whether for video or geometric modeling—enables single-network, end-to-end architectures capable of joint spatial-temporal reasoning and geometric abstraction. By fusing multi-directional information at both feature extraction and reconstruction stages, these frameworks outperform prior methods in accuracy, compactness, and operational efficiency.

A notable limitation is that the cuboid representation, while interpretable and structurally compact, may struggle with non-cuboidal or highly irregular objects which require alternative or hybrid primitive sets. In video, motion artifacts remain a challenge for interpolated frames, partially mitigated by attention-driven enhancement.

Collectively, Cuboid-Net advances the state of the art in both video super-resolution and 3D shape abstraction, serving as an architectural reference point for interpretable, multi-branch, cuboid-centric neural modeling (Fu et al., 2024, Kluger et al., 2021, Kobsik et al., 3 Feb 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cuboid-Net.