Papers
Topics
Authors
Recent
2000 character limit reached

Cuboid-Net: Cuboid Neural Architectures

Updated 3 December 2025
  • Cuboid-Net is a neural architecture that uses cuboid representations to model spatio-temporal dynamics for video super-resolution and 3D shape abstraction.
  • Its multi-branch design decomposes input data into orthogonal slices, employing convolutional, residual, and attention modules to extract and fuse features.
  • Experimental results demonstrate significant improvements in PSNR, SSIM, and geometric fitting metrics across video datasets and 3D benchmarks.

Cuboid-Net refers to distinct neural architectures that leverage the parametric cuboid structure for either video super-resolution or 3D shape abstraction and fitting. In each case, the central methodology is to encode, decompose, and reconstruct data by exploiting cuboidal or related volumetric primitives, either in spatial-temporal or geometric domains.

1. Cuboid Representation and Slicing in Video Super-Resolution

Cuboid-Net for space-time video super-resolution (Fu et al., 24 Jul 2024) models an input low-resolution, low-frame-rate video sequence as a cuboid tensor VinRN×H×WV^{in} \in \mathbb{R}^{N \times H \times W}, where NN is the number of frames, HH the height, and WW the width. The intensity at spatio-temporal index (t,h,w)(t, h, w) is Vt,h,winV^{in}_{t, h, w}. This cuboidal formulation exposes both spatial and temporal data correlations.

A core innovation is slicing this cuboid along three orthogonal axes:

  • Temporal slices: St1=Vin[t,:,:]S^1_t = V^{in}[t,:,:] (an H×WH \times W image for each tt)
  • Horizontal spatial–time slices: Sw2=Vin[:,:,w]S^2_w = V^{in}[:,:,w] (an N×HN \times H matrix for each ww)
  • Vertical spatial–time slices: Sh3=Vin[:,h,:]S^3_h = V^{in}[:,h,:] (an N×WN \times W matrix for each hh)

Each of these slice sets is fed into a dedicated branch of the network, enabling the extraction and integration of spatial and temporal features.

2. Multi-Branch Network Architectures

Multi-Branch Hybrid Feature Extraction (MBFE)

Within each branch, a sequence of multi-feature blocks (MFBs) operates on the slices. Processing steps:

  • Bicubic upsampling (e.g., scale factor s=4s=4 spatially) for spatial super-resolution
  • Shallow feature extraction: two consecutive $2$D conv + ReLU operations
  • Deep feature extraction: a cascade of RR residual-dense blocks (ResDB), defined by

F(l)=F(l1)+R(D(F(l1)))F^{(l)} = F^{(l-1)} + R(D(F^{(l-1)}))

with D()D(\cdot) as densely-connected $2$D conv + ReLU layers and R()R(\cdot) channel reduction via 1×11 \times 1 conv

  • Prediction and fusion of feature outputs through additional conv layers, finalized with a residual connection to bicubic upsampled input.

Multi-Branch Reconstruction (MBR)

Outputs from all three branches, VmV'_m, are reconstructed using identical $3$D convolutional blocks:

  • Stack of KK $3$DConv + ReLU layers yields a feature tensor F3DF_{3D}
  • $3$D transposed-convolution for time-space upsampling, followed by Leaky ReLU
  • Final $3$D convolution produces VmV''_m; outputs from all branches are concatenated and fused by a $3$D conv, yielding the first-stage space-time super-resolved video V^ST(0)R(2N1)×(sH)×(sW)\hat V_{ST}^{(0)} \in \mathbb{R}^{(2N-1) \times (sH) \times (sW)}.

Quality Enhancement Modules

A two-stage enhancement pipeline further refines the output:

  • First-stage Quality Enhancement (QE): Per-frame residual network (SRCNN-inspired) with skip connection
  • Second-stage Cross-Frame Quality Enhancement (CFQE): For interpolated frames, a $7$-layer $2$D conv net with interleaved CBAM (Convolutional Block Attention Module) attention modules addresses motion artifacts. CBAM applies channel and spatial attention:
    • Channel: Mc(F)=σ(MLP(AvgPool(F))+MLP(MaxPool(F)))M_c(F) = \sigma(\text{MLP}(\text{AvgPool}(F)) + \text{MLP}(\text{MaxPool}(F)))
    • Spatial: Ms(F)=σ(Conv7×7([AvgPoolc(F);MaxPoolc(F)]))M_s(F) = \sigma(\text{Conv}_{7 \times 7}([\text{AvgPool}_c(F); \text{MaxPool}_c(F)]))

3. Cuboid Primitive Fitting to 3D Data

In geometric modeling, Cuboid-Net denotes a paradigm for 3D shape fitting and abstraction using volumetric cuboids (Kluger et al., 2021, Kobsik et al., 3 Feb 2025).

Parametric Cuboid Representation

The canonical parameterization is:

θ=(c,s,R)\theta = (c, s, R)

where cR3c \in \mathbb{R}^3 (center), s=(w,h,d)R3s = (w, h, d) \in \mathbb{R}^3 (axes-aligned scale), and RSO(3)R \in SO(3) (rotation). Eight corner points are constructed by

Xi(θ)=c+R(12diag(s)εi)X_i(\theta) = c + R\,\bigl(\tfrac{1}{2} \, \mathrm{diag}(s)\, \varepsilon_i\bigr)

with εi\varepsilon_i running through all sign combinations.

Network-Guided Primitive Fitting

  • RGB/depth images are encoded into 3D feature maps using a BTS depth CNN (DenseNet-161 encoder + multi-scale decoder) (Kluger et al., 2021).
  • Sampling-weight networks produce weighted point selections for RANSAC; iterations fit cuboid hypotheses and score them using an occlusion-aware inlier count metric, resolving ambiguities from occlusions by a custom distance function:

D(P,h)=I(P,h)minffaces(h)ρ(P,f)D(P,h) = I(P,h) \min_{f \in \mathrm{faces}(h)} \rho(P, f)

where I(P,h)I(P,h) is binary visibility.

Fine-to-Coarse Cuboid Abstraction in 3D Shape Modeling

Learning-based abstraction (Kobsik et al., 3 Feb 2025) starts from a large set of surface points:

  • Local PointNet encoders aggregate KKNN-patch features for NN sampled centers
  • Global shape features and cuboid latents are integrated using Vision Transformers
  • Each latent predicts a cuboid parameter vector pm=[rmR4,tmR3,smR3,γm[0,1]]p_{m} = [r_{m} \in \mathbb{R}^{4}, t_{m} \in \mathbb{R}^{3}, s_{m} \in \mathbb{R}^{3}, \gamma_{m} \in [0,1]], where rmr_{m} uses unit quaternions and γm\gamma_{m} is a primitive existence probability

The fine-to-coarse training schedule progressively prunes redundant primitives using an abstraction loss:

Labs=m=1M[tmlogγm(1tm)log(1γm)]\mathcal{L}_{\rm abs} = \sum_{m=1}^{M} \left[-t_{m} \log \gamma_{m} - (1-t_{m}) \log(1-\gamma_{m})\right]

where tmt_{m} is a binary mask based on primitive ranking by γm\gamma_{m}.

4. Loss Functions and Training Strategies

Video Super-Resolution

Cuboid-Net employs a mean-squared error loss over all output frames (spatial and interpolated):

Ltotal=1Ωt=12N1I^t(2)ItGT22L_{\rm total} = \frac{1}{|\Omega|} \sum_{t=1}^{2N-1} \| \hat{I}_t^{(2)} - I_t^{GT} \|^2_2

Alternatively, it can be split into LspatialL_{\rm spatial} and LtemporalL_{\rm temporal} over SSR and TSR targets (λ1=λ2=1\lambda_1 = \lambda_2 = 1).

Training details:

  • Adam optimizer (β1=0.5\beta_1=0.5, β2=0.99\beta_2=0.99), initial LR =1e4=1\mathrm{e}{-4}, halved every $60$ epochs
  • Batch size $8$, crop size 32×32×432 \times 32 \times 4 (time ×\times H ×\times W)

3D Shape Abstraction

The objective combines reconstruction (surface and volume) and abstraction losses:

L=Lrec+λabsLabs,λabs=103\mathcal{L} = \mathcal{L}_{\rm rec} + \lambda_{\rm abs} \mathcal{L}_{\rm abs}, \quad \lambda_{\rm abs}=10^{-3}

where

Lrec=λvolLvol+λsurfLsurf\mathcal{L}_{\rm rec} = \lambda_{\rm vol}\, \mathcal{L}_{\rm vol} + \lambda_{\rm surf}\, \mathcal{L}_{\rm surf}

with explicit bidirectional Chamfer-style losses for surface and volume (Kobsik et al., 3 Feb 2025).

Optimization:

  • AdamW, LR 1×1031 \times 10^{-3}, cosine-annealing schedule, 1,0001{,}000 epochs, batch size $16$

5. Experimental Results

Space-Time Video Super-Resolution

Cuboid-Net exceeds all tested baselines on the standard datasets:

  • Vimeo-90K (test): ST-SR PSNR =31.08=31.08 dB, SSIM =0.931=0.931 (TMNet: $30.92$ dB, $0.928$)
  • Vid4: ST-SR PSNR =29.69=29.69 dB, SSIM =0.882=0.882
  • SSR-only: $32.81$ dB (compared to BasicVSR $33.02$ dB) despite lower frame-rate input

Ablation indicates increasing ResDB blocks (RR) improves PSNR, with diminishing returns beyond R=7R=7; increasing $3$DConv layers in RB (KK) gives +0.33+0.33 dB; QE and CFQE improve frame quality incrementally.

The model has $18.1$ M parameters, $13.6$s runtime per clip on 2080Ti, smaller and faster than most two-stage pipelines.

3D Cuboid Abstraction and Fitting

Robustness on NYU Depth v2 is demonstrated:

  • RGB input: AUC@10 cm =18.9=18.9\% (vs.\ $4.3$\% prior), mean occlusion-aware L2=34.5L_2=34.5 cm (vs.\ $65.9$ cm)
  • Depth input: AUC@10 cm =49.1=49.1\%
  • Cuboid-Net abstracts complex indoor scenes into interpretable cuboids, outperforming prior superquadric-based approaches (Kluger et al., 2021)

ShapeNet and DFAUST results (Kobsik et al., 3 Feb 2025) show improved compactness and fidelity across categories (planes, chairs, tables, humans):

class Method Num ↓ CD ↓ IoU (%) ↑
plane Ours 6.03 0.026 56.0
chair Ours 8.37 0.036 54.9
table Ours 5.67 0.035 45.1
human Ours 6.02 0.032 58.8

6. Downstream Applications and Extensions

Cuboid-based abstractions produced by Cuboid-Net can be directly repurposed for:

  • Shape co-segmentation (semantic part labeling via primitive assignment)
  • Shape clustering and structural retrieval using cuboid-parameter vectors
  • Partial symmetry detection through pairwise ICP alignment of primitive-enclosed point sets

A plausible implication is that compact, accurate cuboid abstractions may facilitate interpretable analysis, efficient data compression, and improved downstream geometric reasoning (Kobsik et al., 3 Feb 2025).

7. Significance, Limitations, and Interpretation

Cuboid-Net’s explicit cuboidal decomposition—whether for video or geometric modeling—enables single-network, end-to-end architectures capable of joint spatial-temporal reasoning and geometric abstraction. By fusing multi-directional information at both feature extraction and reconstruction stages, these frameworks outperform prior methods in accuracy, compactness, and operational efficiency.

A notable limitation is that the cuboid representation, while interpretable and structurally compact, may struggle with non-cuboidal or highly irregular objects which require alternative or hybrid primitive sets. In video, motion artifacts remain a challenge for interpolated frames, partially mitigated by attention-driven enhancement.

Collectively, Cuboid-Net advances the state of the art in both video super-resolution and 3D shape abstraction, serving as an architectural reference point for interpretable, multi-branch, cuboid-centric neural modeling (Fu et al., 24 Jul 2024, Kluger et al., 2021, Kobsik et al., 3 Feb 2025).

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Cuboid-Net.