Cuboid-Net: Cuboid Neural Architectures
- Cuboid-Net is a neural architecture that uses cuboid representations to model spatio-temporal dynamics for video super-resolution and 3D shape abstraction.
- Its multi-branch design decomposes input data into orthogonal slices, employing convolutional, residual, and attention modules to extract and fuse features.
- Experimental results demonstrate significant improvements in PSNR, SSIM, and geometric fitting metrics across video datasets and 3D benchmarks.
Cuboid-Net refers to distinct neural architectures that leverage the parametric cuboid structure for either video super-resolution or 3D shape abstraction and fitting. In each case, the central methodology is to encode, decompose, and reconstruct data by exploiting cuboidal or related volumetric primitives, either in spatial-temporal or geometric domains.
1. Cuboid Representation and Slicing in Video Super-Resolution
Cuboid-Net for space-time video super-resolution (Fu et al., 24 Jul 2024) models an input low-resolution, low-frame-rate video sequence as a cuboid tensor , where is the number of frames, the height, and the width. The intensity at spatio-temporal index is . This cuboidal formulation exposes both spatial and temporal data correlations.
A core innovation is slicing this cuboid along three orthogonal axes:
- Temporal slices: (an image for each )
- Horizontal spatial–time slices: (an matrix for each )
- Vertical spatial–time slices: (an matrix for each )
Each of these slice sets is fed into a dedicated branch of the network, enabling the extraction and integration of spatial and temporal features.
2. Multi-Branch Network Architectures
Multi-Branch Hybrid Feature Extraction (MBFE)
Within each branch, a sequence of multi-feature blocks (MFBs) operates on the slices. Processing steps:
- Bicubic upsampling (e.g., scale factor spatially) for spatial super-resolution
- Shallow feature extraction: two consecutive $2$D conv + ReLU operations
- Deep feature extraction: a cascade of residual-dense blocks (ResDB), defined by
with as densely-connected $2$D conv + ReLU layers and channel reduction via conv
- Prediction and fusion of feature outputs through additional conv layers, finalized with a residual connection to bicubic upsampled input.
Multi-Branch Reconstruction (MBR)
Outputs from all three branches, , are reconstructed using identical $3$D convolutional blocks:
- Stack of $3$DConv + ReLU layers yields a feature tensor
- $3$D transposed-convolution for time-space upsampling, followed by Leaky ReLU
- Final $3$D convolution produces ; outputs from all branches are concatenated and fused by a $3$D conv, yielding the first-stage space-time super-resolved video .
Quality Enhancement Modules
A two-stage enhancement pipeline further refines the output:
- First-stage Quality Enhancement (QE): Per-frame residual network (SRCNN-inspired) with skip connection
- Second-stage Cross-Frame Quality Enhancement (CFQE): For interpolated frames, a $7$-layer $2$D conv net with interleaved CBAM (Convolutional Block Attention Module) attention modules addresses motion artifacts. CBAM applies channel and spatial attention:
- Channel:
- Spatial:
3. Cuboid Primitive Fitting to 3D Data
In geometric modeling, Cuboid-Net denotes a paradigm for 3D shape fitting and abstraction using volumetric cuboids (Kluger et al., 2021, Kobsik et al., 3 Feb 2025).
Parametric Cuboid Representation
The canonical parameterization is:
where (center), (axes-aligned scale), and (rotation). Eight corner points are constructed by
with running through all sign combinations.
Network-Guided Primitive Fitting
- RGB/depth images are encoded into 3D feature maps using a BTS depth CNN (DenseNet-161 encoder + multi-scale decoder) (Kluger et al., 2021).
- Sampling-weight networks produce weighted point selections for RANSAC; iterations fit cuboid hypotheses and score them using an occlusion-aware inlier count metric, resolving ambiguities from occlusions by a custom distance function:
where is binary visibility.
Fine-to-Coarse Cuboid Abstraction in 3D Shape Modeling
Learning-based abstraction (Kobsik et al., 3 Feb 2025) starts from a large set of surface points:
- Local PointNet encoders aggregate NN-patch features for sampled centers
- Global shape features and cuboid latents are integrated using Vision Transformers
- Each latent predicts a cuboid parameter vector , where uses unit quaternions and is a primitive existence probability
The fine-to-coarse training schedule progressively prunes redundant primitives using an abstraction loss:
where is a binary mask based on primitive ranking by .
4. Loss Functions and Training Strategies
Video Super-Resolution
Cuboid-Net employs a mean-squared error loss over all output frames (spatial and interpolated):
Alternatively, it can be split into and over SSR and TSR targets ().
Training details:
- Adam optimizer (, ), initial LR , halved every $60$ epochs
- Batch size $8$, crop size (time H W)
3D Shape Abstraction
The objective combines reconstruction (surface and volume) and abstraction losses:
where
with explicit bidirectional Chamfer-style losses for surface and volume (Kobsik et al., 3 Feb 2025).
Optimization:
- AdamW, LR , cosine-annealing schedule, epochs, batch size $16$
5. Experimental Results
Space-Time Video Super-Resolution
Cuboid-Net exceeds all tested baselines on the standard datasets:
- Vimeo-90K (test): ST-SR PSNR dB, SSIM (TMNet: $30.92$ dB, $0.928$)
- Vid4: ST-SR PSNR dB, SSIM
- SSR-only: $32.81$ dB (compared to BasicVSR $33.02$ dB) despite lower frame-rate input
Ablation indicates increasing ResDB blocks () improves PSNR, with diminishing returns beyond ; increasing $3$DConv layers in RB () gives dB; QE and CFQE improve frame quality incrementally.
The model has $18.1$ M parameters, $13.6$s runtime per clip on 2080Ti, smaller and faster than most two-stage pipelines.
3D Cuboid Abstraction and Fitting
Robustness on NYU Depth v2 is demonstrated:
- RGB input: AUC@10 cm \% (vs.\ $4.3$\% prior), mean occlusion-aware cm (vs.\ $65.9$ cm)
- Depth input: AUC@10 cm \%
- Cuboid-Net abstracts complex indoor scenes into interpretable cuboids, outperforming prior superquadric-based approaches (Kluger et al., 2021)
ShapeNet and DFAUST results (Kobsik et al., 3 Feb 2025) show improved compactness and fidelity across categories (planes, chairs, tables, humans):
| class | Method | Num ↓ | CD ↓ | IoU (%) ↑ |
|---|---|---|---|---|
| plane | Ours | 6.03 | 0.026 | 56.0 |
| chair | Ours | 8.37 | 0.036 | 54.9 |
| table | Ours | 5.67 | 0.035 | 45.1 |
| human | Ours | 6.02 | 0.032 | 58.8 |
6. Downstream Applications and Extensions
Cuboid-based abstractions produced by Cuboid-Net can be directly repurposed for:
- Shape co-segmentation (semantic part labeling via primitive assignment)
- Shape clustering and structural retrieval using cuboid-parameter vectors
- Partial symmetry detection through pairwise ICP alignment of primitive-enclosed point sets
A plausible implication is that compact, accurate cuboid abstractions may facilitate interpretable analysis, efficient data compression, and improved downstream geometric reasoning (Kobsik et al., 3 Feb 2025).
7. Significance, Limitations, and Interpretation
Cuboid-Net’s explicit cuboidal decomposition—whether for video or geometric modeling—enables single-network, end-to-end architectures capable of joint spatial-temporal reasoning and geometric abstraction. By fusing multi-directional information at both feature extraction and reconstruction stages, these frameworks outperform prior methods in accuracy, compactness, and operational efficiency.
A notable limitation is that the cuboid representation, while interpretable and structurally compact, may struggle with non-cuboidal or highly irregular objects which require alternative or hybrid primitive sets. In video, motion artifacts remain a challenge for interpolated frames, partially mitigated by attention-driven enhancement.
Collectively, Cuboid-Net advances the state of the art in both video super-resolution and 3D shape abstraction, serving as an architectural reference point for interpretable, multi-branch, cuboid-centric neural modeling (Fu et al., 24 Jul 2024, Kluger et al., 2021, Kobsik et al., 3 Feb 2025).