Cuboid-Net: Cuboid Neural Architectures

Updated 3 December 2025

Cuboid-Net is a neural architecture that uses cuboid representations to model spatio-temporal dynamics for video super-resolution and 3D shape abstraction.
Its multi-branch design decomposes input data into orthogonal slices, employing convolutional, residual, and attention modules to extract and fuse features.
Experimental results demonstrate significant improvements in PSNR, SSIM, and geometric fitting metrics across video datasets and 3D benchmarks.

Cuboid-Net refers to distinct neural architectures that leverage the parametric cuboid structure for either video super-resolution or 3D shape abstraction and fitting. In each case, the central methodology is to encode, decompose, and reconstruct data by exploiting cuboidal or related volumetric primitives, either in spatial-temporal or geometric domains.

1. Cuboid Representation and Slicing in Video Super-Resolution

Cuboid-Net for space-time video super-resolution (Fu et al., 24 Jul 2024) models an input low-resolution, low-frame-rate video sequence as a cuboid tensor $V^{in} \in \mathbb{R}^{N \times H \times W}$ , where $N$ is the number of frames, $H$ the height, and $W$ the width. The intensity at spatio-temporal index $(t, h, w)$ is $V^{in}_{t, h, w}$ . This cuboidal formulation exposes both spatial and temporal data correlations.

A core innovation is slicing this cuboid along three orthogonal axes:

Temporal slices: $S^1_t = V^{in}[t,:,:]$ (an $H \times W$ image for each $t$ )
Horizontal spatial–time slices: $S^2_w = V^{in}[:,:,w]$ (an $N \times H$ matrix for each $w$ )
Vertical spatial–time slices: $S^3_h = V^{in}[:,h,:]$ (an $N \times W$ matrix for each $h$ )

Each of these slice sets is fed into a dedicated branch of the network, enabling the extraction and integration of spatial and temporal features.

2. Multi-Branch Network Architectures

Multi-Branch Hybrid Feature Extraction (MBFE)

Within each branch, a sequence of multi-feature blocks (MFBs) operates on the slices. Processing steps:

Bicubic upsampling (e.g., scale factor $s=4$ spatially) for spatial super-resolution
Shallow feature extraction: two consecutive $2$D conv + ReLU operations
Deep feature extraction: a cascade of $R$ residual-dense blocks (ResDB), defined by

$F^{(l)} = F^{(l-1)} + R(D(F^{(l-1)}))$

with $D(\cdot)$ as densely-connected $2$D conv + ReLU layers and $R(\cdot)$ channel reduction via $1 \times 1$ conv

Prediction and fusion of feature outputs through additional conv layers, finalized with a residual connection to bicubic upsampled input.

Multi-Branch Reconstruction (MBR)

Outputs from all three branches, $V'_m$ , are reconstructed using identical $3$D convolutional blocks:

Stack of $K$ $3$DConv + ReLU layers yields a feature tensor $F_{3D}$
$3$D transposed-convolution for time-space upsampling, followed by Leaky ReLU
Final $3$D convolution produces $V''_m$ ; outputs from all branches are concatenated and fused by a $3$D conv, yielding the first-stage space-time super-resolved video $\hat V_{ST}^{(0)} \in \mathbb{R}^{(2N-1) \times (sH) \times (sW)}$ .

Quality Enhancement Modules

A two-stage enhancement pipeline further refines the output:

First-stage Quality Enhancement (QE): Per-frame residual network (SRCNN-inspired) with skip connection
Second-stage Cross-Frame Quality Enhancement (CFQE): For interpolated frames, a $7$-layer $2$D conv net with interleaved CBAM (Convolutional Block Attention Module) attention modules addresses motion artifacts. CBAM applies channel and spatial attention:
- Channel: $M_c(F) = \sigma(\text{MLP}(\text{AvgPool}(F)) + \text{MLP}(\text{MaxPool}(F)))$
- Spatial: $M_s(F) = \sigma(\text{Conv}_{7 \times 7}([\text{AvgPool}_c(F); \text{MaxPool}_c(F)]))$

3. Cuboid Primitive Fitting to 3D Data

In geometric modeling, Cuboid-Net denotes a paradigm for 3D shape fitting and abstraction using volumetric cuboids (Kluger et al., 2021, Kobsik et al., 3 Feb 2025).

Parametric Cuboid Representation

The canonical parameterization is:

$\theta = (c, s, R)$

where $c \in \mathbb{R}^3$ (center), $s = (w, h, d) \in \mathbb{R}^3$ (axes-aligned scale), and $R \in SO(3)$ (rotation). Eight corner points are constructed by

$X_i(\theta) = c + R\,\bigl(\tfrac{1}{2} \, \mathrm{diag}(s)\, \varepsilon_i\bigr)$

with $\varepsilon_i$ running through all sign combinations.

Network-Guided Primitive Fitting

RGB/depth images are encoded into 3D feature maps using a BTS depth CNN (DenseNet-161 encoder + multi-scale decoder) (Kluger et al., 2021).
Sampling-weight networks produce weighted point selections for RANSAC; iterations fit cuboid hypotheses and score them using an occlusion-aware inlier count metric, resolving ambiguities from occlusions by a custom distance function:

$D(P,h) = I(P,h) \min_{f \in \mathrm{faces}(h)} \rho(P, f)$

where $I(P,h)$ is binary visibility.

Fine-to-Coarse Cuboid Abstraction in 3D Shape Modeling

Learning-based abstraction (Kobsik et al., 3 Feb 2025) starts from a large set of surface points:

Local PointNet encoders aggregate $K$ NN-patch features for $N$ sampled centers
Global shape features and cuboid latents are integrated using Vision Transformers
Each latent predicts a cuboid parameter vector $p_{m} = [r_{m} \in \mathbb{R}^{4}, t_{m} \in \mathbb{R}^{3}, s_{m} \in \mathbb{R}^{3}, \gamma_{m} \in [0,1]]$ , where $r_{m}$ uses unit quaternions and $\gamma_{m}$ is a primitive existence probability

The fine-to-coarse training schedule progressively prunes redundant primitives using an abstraction loss:

$\mathcal{L}_{\rm abs} = \sum_{m=1}^{M} \left[-t_{m} \log \gamma_{m} - (1-t_{m}) \log(1-\gamma_{m})\right]$

where $t_{m}$ is a binary mask based on primitive ranking by $\gamma_{m}$ .

4. Loss Functions and Training Strategies

Video Super-Resolution

Cuboid-Net employs a mean-squared error loss over all output frames (spatial and interpolated):

$L_{\rm total} = \frac{1}{|\Omega|} \sum_{t=1}^{2N-1} \| \hat{I}_t^{(2)} - I_t^{GT} \|^2_2$

Alternatively, it can be split into $L_{\rm spatial}$ and $L_{\rm temporal}$ over SSR and TSR targets ( $\lambda_1 = \lambda_2 = 1$ ).

Training details:

Adam optimizer ( $\beta_1=0.5$ , $\beta_2=0.99$ ), initial LR $=1\mathrm{e}{-4}$ , halved every $60$ epochs
Batch size $8$, crop size $32 \times 32 \times 4$ (time $\times$ H $\times$ W)

3D Shape Abstraction

The objective combines reconstruction (surface and volume) and abstraction losses:

$\mathcal{L} = \mathcal{L}_{\rm rec} + \lambda_{\rm abs} \mathcal{L}_{\rm abs}, \quad \lambda_{\rm abs}=10^{-3}$

where

$\mathcal{L}_{\rm rec} = \lambda_{\rm vol}\, \mathcal{L}_{\rm vol} + \lambda_{\rm surf}\, \mathcal{L}_{\rm surf}$

with explicit bidirectional Chamfer-style losses for surface and volume (Kobsik et al., 3 Feb 2025).

Optimization:

AdamW, LR $1 \times 10^{-3}$ , cosine-annealing schedule, $1{,}000$ epochs, batch size $16$

5. Experimental Results

Space-Time Video Super-Resolution

Cuboid-Net exceeds all tested baselines on the standard datasets:

Vimeo-90K (test): ST-SR PSNR $=31.08$ dB, SSIM $=0.931$ (TMNet: $30.92$ dB, $0.928$)
Vid4: ST-SR PSNR $=29.69$ dB, SSIM $=0.882$
SSR-only: $32.81$ dB (compared to BasicVSR $33.02$ dB) despite lower frame-rate input

Ablation indicates increasing ResDB blocks ( $R$ ) improves PSNR, with diminishing returns beyond $R=7$ ; increasing $3$DConv layers in RB ( $K$ ) gives $+0.33$ dB; QE and CFQE improve frame quality incrementally.

The model has $18.1$ M parameters, $13.6$s runtime per clip on 2080Ti, smaller and faster than most two-stage pipelines.

3D Cuboid Abstraction and Fitting

Robustness on NYU Depth v2 is demonstrated:

RGB input: AUC@10 cm $=18.9$ \% (vs.\ $4.3$\% prior), mean occlusion-aware $L_2=34.5$ cm (vs.\ $65.9$ cm)
Depth input: AUC@10 cm $=49.1$ \%
Cuboid-Net abstracts complex indoor scenes into interpretable cuboids, outperforming prior superquadric-based approaches (Kluger et al., 2021)

ShapeNet and DFAUST results (Kobsik et al., 3 Feb 2025) show improved compactness and fidelity across categories (planes, chairs, tables, humans):

class	Method	Num ↓	CD ↓	IoU (%) ↑
plane	Ours	6.03	0.026	56.0
chair	Ours	8.37	0.036	54.9
table	Ours	5.67	0.035	45.1
human	Ours	6.02	0.032	58.8

6. Downstream Applications and Extensions

Cuboid-based abstractions produced by Cuboid-Net can be directly repurposed for:

Shape co-segmentation (semantic part labeling via primitive assignment)
Shape clustering and structural retrieval using cuboid-parameter vectors
Partial symmetry detection through pairwise ICP alignment of primitive-enclosed point sets

A plausible implication is that compact, accurate cuboid abstractions may facilitate interpretable analysis, efficient data compression, and improved downstream geometric reasoning (Kobsik et al., 3 Feb 2025).

7. Significance, Limitations, and Interpretation

Cuboid-Net’s explicit cuboidal decomposition—whether for video or geometric modeling—enables single-network, end-to-end architectures capable of joint spatial-temporal reasoning and geometric abstraction. By fusing multi-directional information at both feature extraction and reconstruction stages, these frameworks outperform prior methods in accuracy, compactness, and operational efficiency.

A notable limitation is that the cuboid representation, while interpretable and structurally compact, may struggle with non-cuboidal or highly irregular objects which require alternative or hybrid primitive sets. In video, motion artifacts remain a challenge for interpolated frames, partially mitigated by attention-driven enhancement.

Collectively, Cuboid-Net advances the state of the art in both video super-resolution and 3D shape abstraction, serving as an architectural reference point for interpretable, multi-branch, cuboid-centric neural modeling (Fu et al., 24 Jul 2024, Kluger et al., 2021, Kobsik et al., 3 Feb 2025).

PDF Markdown Chat (Pro)

References (3)

Cuboid-Net: A Multi-Branch Convolutional Neural Network for Joint Space-Time Video Super Resolution (2024)

Cuboids Revisited: Learning Robust 3D Shape Fitting to Single RGB Images (2021)

Learning Fine-to-Coarse Cuboid Shape Abstraction (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Cuboid-Net.