DCT-Based Mask Representation

Updated 20 October 2025

DCT-based mask representation is a technique that transforms spatial masks into the frequency domain using the DCT, concentrating energy in low-frequency coefficients.
It achieves high compression efficiency by retaining only key coefficients, which reduces storage and computational complexity while preserving critical mask details.
The approach is applied across instance segmentation, video tracking, compressed sensing, and CNN compression, demonstrating improved performance and reduced redundancy.

A DCT-based mask representation is a technique that leverages the discrete cosine transform (DCT) to encode spatial, temporal, or spatiotemporal masks (typically dense binary or continuous-valued support regions) as compact frequency-domain vectors. By exploiting the energy compaction property of the DCT—where signal energy is largely concentrated in the low-frequency components—these approaches enable efficient, low-redundancy storage, manipulation, and transmission of mask-like structures across computer vision, signal processing, compression, and machine learning systems.

1. Mathematical Foundations and Core Principles

At its core, a DCT-based mask representation transforms a mask $M(x, y)$ (or higher-dimensional annotation) from the spatial domain into the frequency domain. For a 2D mask of size $K \times K$ , the forward DCT (type-II) is expressed as: $M_{DCT}(u,v) = \frac{2}{K} C(u)C(v) \sum_{x=0}^{K-1}\sum_{y=0}^{K-1} M(x,y) \cos\left(\frac{(2x+1)u\pi}{2K}\right)\cos\left(\frac{(2y+1)v\pi}{2K}\right)$ where $C(w) = 1/\sqrt{2}$ if $w=0$ , otherwise $C(w) = 1$ . This generates a $K\times K$ frequency coefficient matrix. Only the coefficients in the low-frequency region (e.g., zig-zag scanning order) are retained, reflecting the “energy compaction”—most mask signal energy captured in a small subset of coefficients.

In 3D applications (video, volumetric segmentation), masks $x[m,n,p]$ are encoded using the 3D-DCT: $T_x[u,v,w] = a(u)a(v)a(w)\!\!\sum_{m=0}^{N-1}\!\sum_{n=0}^{N-1}\!\sum_{p=0}^{N-1}x[m,n,p]\cos\!\Big[\frac{\pi(2m+1)u}{2N}\Big]\cos\!\Big[\frac{\pi(2n+1)v}{2N}\Big]\cos\!\Big[\frac{\pi(2p+1)w}{2N}\Big]$ where $a(k)$ is a normalization factor as in the 1D and 2D-DCT.

Crucially, the DCT is invertible; spatial masks are reconstructed via the inverse DCT (IDCT), with the caveat that truncating frequency coefficients yields approximate, compressed reconstructions.

2. Algorithms and Architectural Integration

DCT-based mask representation has been incorporated into diverse frameworks spanning segmentation, video tracking, compressed sensing, deep neural network compression, and advanced compression architectures.

Instance Segmentation: In DCT-Mask (Shen et al., 2020), high-resolution masks (typically $128\times 128$ ) are DCT-encoded and only the leading $N$ coefficients are predicted by the neural network, replacing the cumbersome per-pixel binary grid paradigm. Decoding is performed via IDCT, possibly followed by resizing or thresholding.
Patchwise Refinement: PatchDCT (Wen et al., 2023) addresses the global nature of DCT basis functions—where an adjustment in one coefficient affects the entire reconstructed mask—by dividing the spatial mask into non-overlapping patches, DCT-encoding each, and selectively refining “mixed” (i.e., boundary) patches via targeted regression.
Compressed Sensing: In MRI reconstruction, masking in the DCT domain selects a subset of $y = M \odot \mathrm{DCT}(x)$ as compressed measurements, exploiting the clustering of energy in the low-frequency corner of the coefficient matrix for improved CS recovery (Hot et al., 2015).
CNN Activation Compression: DCT-CM (Shi et al., 2021) applies 1D-DCT along the channel dimension of neural activations, followed by a mask that zeros out high-frequency components, boosting sparsity and compression without substantial accuracy loss.

3. Efficiency, Compression, and Compactness

DCT-based approaches offer substantial benefits in reducing storage and computational complexity, owing to the natural sparsity of low-frequency concentrated representations.

Approach	Main Compression Mechanism	Output Dimensionality	Key Result
DCT-Mask	Truncate DCT vector	$N \ll K^2$	$+1.3\%$ AP on COCO
PatchDCT	Patchwise DCT + local refinement	Local $N_p$ per patch	$+2.0\%$ AP (COCO)
DCT-CM	1D-DCT + mask on channel dim.	Masked vector	$2.9\times$ compression
TextDCT	DCT vector for text shapes	$N \ll K^2$	$>85\%$ F-measure

The use of masks in DCT-based representations—beyond representing spatial support—often refers to binary or soft masks that select a subset of coefficients, reflecting either hard thresholding (as in DCT-Mask, PatchDCT) or soft regularization (as in compressed sensing or CNN compression). Practical implementations consistently retain only tens to hundreds of coefficients for masks of thousands of pixels with minimal loss in boundary fidelity (IoU drops of 1–2% for more than 90% reduction in representation dimensionality).

4. Applications Across Domains

DCT-based mask representations have been adopted in:

Instance and Semantic Segmentation: DCT-Mask and PatchDCT encode object masks as DCT vectors, with significant AP gains on COCO, LVIS, and Cityscapes over binary grid approaches (Shen et al., 2020, Wen et al., 2023).
Text Detection: TextDCT utilizes DCT to compactly encode arbitrary-shape text masks for real-time scene text detection, achieving state-of-the-art accuracy/FPS trade-offs (Su et al., 2022).
Video and 3D Medical Imaging: 3D-DCT enables efficient coding of segmentation masks and volumetric annotations, with high PSNR and substantial compression ratios reported in CT scan applications (Martin-Rodriguez et al., 2023).
Steganography: DCT-masked insertion of hidden images in color subband components (HH, after DWT) enables imperceptible and secure multi-image hiding; masks describe the coefficient selection indices (Bhattacharya et al., 2012).
Lidar Mapping: DCT maps for lidar encoding offer a dense, differentiable, memory-efficient alternative to grid-based or GP-maps, with the DCT basis defining spatial mask-like permeability fields (Schaefer et al., 2019).
Compressed Sensing and MRI: Masked DCT selection in frequency space yields higher-quality and more compressible acquisition strategies for MRI, outperforming traditional center-oriented DFT masks in some regimes (Hot et al., 2015).
Face Recognition: Zonal masks applied to 2D-DCT allow selection of the most informative coefficients for robust, illumination-invariant recognition (Faundez-Zanuy, 2022).

5. Limitations, Trade-offs, and Implementation Constraints

Global Support: Changes to any DCT coefficient globally affect the spatial reconstruction. This is mitigated in PatchDCT by localizing DCT operations and refinement (Wen et al., 2023).
Complexity of High-Resolution or Volumetric Data: 3D-DCT incurs higher computational cost than 2D-DCT; separable algorithms and precomputation are necessary for practical deployment (Martin-Rodriguez et al., 2023, Li et al., 2012).
Coefficient Selection/Mask Design: The optimal number and pattern (e.g., zig-zag, square, sector) of coefficients—which maximizes information retention and task-specific fidelity—is nontrivial and often data- or task-dependent (Shen et al., 2020, Faundez-Zanuy, 2022).
Quantization and Scaling: Careful design of quantization matrices/cubes is needed to avoid artifacts and manage dynamic range, especially in 3D applications or lossy compression (Martin-Rodriguez et al., 2023).

6. Advanced Considerations: Differentiability, Optimization, and Hardware

DCT-based mask representations possess several desirable optimization and hardware properties:

Differentiability: The DCT and its continuous extensions are differentiable, enabling gradient-based optimization (e.g., in DCT map encoding for SLAM or lidar reflectivity mapping) (Schaefer et al., 2019).
Arithmetic Efficiency and Multiplierless Approximations: Current research focuses on low-complexity approximate DCTs for resource-constrained environments, producing efficient hardware-friendly mask encoding without a significant loss in coding gain or representation quality (Silveira et al., 2022).
Integration with Convolutional Operations: DCT basis transformations can sometimes be absorbed into convolutional weight matrices to eliminate explicit inverse transforms during inference (e.g., $w^* = WA^T$ in DCT-CM (Shi et al., 2021)), enabling efficient end-to-end deployment on hardware accelerators.

7. Outlook and Research Directions

Recent developments underscore the versatility of DCT-based mask representations for scalable, high-fidelity, and computationally tractable modeling of spatial, temporal, and volumetric regions of interest. Notably, multi-stage frameworks such as PatchDCT (Wen et al., 2023) have demonstrated substantial improvements in boundary localization, while generalized DCT-masked compressed sensing and CNN activations compression promise further cross-domain impact.

A plausible implication is that continued refinement of coefficient selection criteria, integration with other basis transforms (e.g., wavelets, learned basis), and hardware co-design will yield further advances in both accuracy and efficiency for mask-centric vision and signal processing tasks.

Key references: (Shen et al., 2020, Wen et al., 2023, Su et al., 2022, Schaefer et al., 2019, Li et al., 2012, Martin-Rodriguez et al., 2023, Shi et al., 2021, Hot et al., 2015, Amin-Naji et al., 2017, Faundez-Zanuy, 2022, Silveira et al., 2022, Bhattacharya et al., 2012).