- The paper demonstrates that integrating classical transforms, especially WHT, into ResNet-50 improves accuracy by over 13 percentage points compared to the baseline.
- It employs custom TensorFlow/Keras layers at multiple stages to enable efficient frequency-domain feature extraction while maintaining low memory overhead.
- The WHT model notably achieves 79.3% accuracy with drastic energy savings, highlighting its suitability for resource-constrained deep learning applications.
The paper systematically examines the integration of classical signal processing transformations—Fast Fourier Transform (FFT), Discrete Cosine Transform (DCT), and Walsh-Hadamard Transform (WHT)—into the ResNet-50 architecture, with a focus on assessing computational efficiency, energy consumption, and classification accuracy on the CIFAR-100 benchmark.
Motivation and Context
The study is situated within the context of increasingly stringent computational and energy constraints inherent in modern deep learning applications, particularly in large-scale models and embedded AI systems. The authors emphasize that reducing the computational and energy footprint of CNNs has critical implications for the sustainability and deployability of AI solutions, especially as energy becomes a primary operational bottleneck.
Implementation Details
The methodology is characterized by the systematic implementation of each transformation as a custom TensorFlow/Keras layer. The transforms are incorporated at three distinct locations in the ResNet-50 pipeline: after the input layer, following early convolutional layers, and at both early and later convolutional stages, enabling both direct and hierarchical feature transformation in the frequency domain.
FFT Layer
Implemented using TensorFlow's tf.signal.fft2d, the FFT is applied on the last two axes (height, width) after casting the input to complex values. Only the magnitude is propagated to subsequent layers, discarding the phase information. Empirically, this information loss impacts learning capacity, particularly for natural images, where phase is critical for reconstructing spatial structure.
DCT Layer
The custom 2D DCT layer leverages sequential applications of tf.signal.dct on the width and height axes. This approach preserves orthogonality and compactness representations, focusing energy into low-frequency coefficients—an effect exploited in compression scenarios. Tests confirmed energy preservation using Parseval's theorem, indicating mathematical correctness.
WHT Layer
As there is no direct WHT support in TensorFlow, the authors implemented a recursive Hadamard matrix construction using Sylvester's method, normalized by matrix dimensions, and applied via batched matrix multiplication. The transform is executed separately across the spatial axes, and implemented efficiently via batch reshaping and broadcasting. Inverse transforms are implemented analogously for correctness assessment.
Model Variants
Three model architectures are evaluated for each transformation:
- Transform after Input: Acts as preprocessing, transforming images before any convolution.
- Transform after Early Convolutions: Intended to allow initial spatial feature learning, followed by frequency-domain processing.
- Transforms at Both Early and Late Stages: Seeks to combine effects, capturing both low-level and high-level frequency features.
All models are trained on CIFAR-100; full fine-tuning is conducted due to the incompatibility of pre-trained weights with transformed representations.
Experimental Environment
Experiments are restricted to the NVIDIA A100 GPU, driven by significant memory requirements of ResNet-50 and high-resolution transformations. Resource utilization (GPU power, memory, and time) is monitored using NVML, allowing accurate measurement of efficiency.
Empirical Results
Accuracy
Comparative accuracy for each configuration is summarized below:
| Model |
Test Accuracy (%) |
Power (kJ) |
| ResNet-50 (Baseline) |
66.1 |
25,606 |
| FFT (best variant) |
61.1 |
101,254 |
| DCT (best variant) |
62.6 |
474 |
| WHT (best variant) |
79.3 |
26.4 |
- The WHT-augmented model (with transformations at both early and late layers) achieved 79.3% test accuracy, representing a substantial gain over baseline ResNet-50 and all other transformed model variants.
- DCT and FFT models failed to match baseline accuracy and, in many configurations (especially FFT at input), exhibited unstable convergence or outright failure to learn.
Computational Efficiency
- Power Consumption: The WHT model consumed only ~39 kJ per inference on average, an extreme reduction relative to all other variants (including a >600x reduction compared to the baseline in some implementations).
- Memory Utilization and Training Time: WHT layers incur negligible additional memory overhead and, due to their O(nlogn) computational complexity, do not hinder throughput.
- Overfitting Mitigation: WHT integration (especially with dual placement) led to improved generalization and less overfitting, as measured by reduced train-test gap.
Analysis and Interpretation
The WHT outperforms both FFT and DCT when tightly integrated into modern CNNs for vision, offering superior energy efficiency and a significant accuracy boost—even outperforming the conventional spatial-domain ResNet-50 on CIFAR-100 by more than 13 percentage points. This result is especially notable given the general trend that dimensionality-reducing or frequency-domain transformations tend to harm accuracy on “natural” image data, unless coupled with meticulously tuned architectures.
The merits of WHT likely arise from its ability to capture both global (low-frequency) and local (high-frequency) features in a computationally inexpensive manner. Unlike FFT, the WHT does not introduce complex numbers nor does it discard phase information, which may account for superior information retention and feature locality in deep CNNs. The lack of additional memory consumption and the efficient implementation also make it highly compelling for edge and embedded applications.
Practical Implications
The demonstrated combination of accuracy improvement and drastic power reduction highlights the suitability of WHT-augmented CNNs for deployment in resource-constrained scenarios, such as on-device inference, robotics, autonomous vehicles, and IoT applications, where power, speed, and model size are primary constraints.
In practical terms, integrating a WHT layer can be accomplished as a simple wrapper at the model’s input pipeline (using the provided recursive Hadamard construction), requiring minimal modification to existing training and inference code. For TensorFlow-based pipelines, the transformation can be encapsulated in a custom tf.keras.layers.Layer as per the provided algorithms.
Limitations
Several constraints remain:
- Generality: While results are compelling for CIFAR-100 and ResNet-50, transferability to high-resolution, large-scale datasets (e.g., ImageNet) or to architectures optimized for different properties (ViT, MobileNet, etc.) remains to be systematically demonstrated.
- Data Modality: All experiments are on image data; efficacy for audio, text, and multimodal inputs is speculative but plausible given the generality of the WHT.
- Specialization: Substantial accuracy improvements are also likely due to the alignment of WHT’s transform properties with the chosen dataset and architecture. Hyperparameter tuning, transformation placement, and domain adaptation require further study for general adoption.
Future Directions
Potential avenues for extending this line of research include:
- Application to Lighter Models: Porting WHT integration to compact architectures (e.g., MobileNet, EfficientNet) for ultra-low-power scenarios.
- Extension to Other Data Modalities: Evaluation of WHT layers in audio spectrogram analysis, graph neural networks, and time-series forecasting.
- Investigation of Fast Walsh-Hadamard Transform (FWHT): Assessing further optimization for real-time and on-device deployment.
- Joint Optimization: Learning to adapt transform parameters (e.g., block size, domain adaptation) jointly with the network via gradient-based optimization.
Theoretical Implications
The results provide empirical evidence that classic orthogonal transforms can be more tightly coupled with modern deep learning pipelines beyond traditional preprocessing, challenging the prevailing assumption that such transforms merely substitute for learnable convolutions. The demonstrated performance differential between WHT and DCT/FFT motivates further theoretical work on analyzing what aspects of transformation orthogonality, sparsity, and numerical properties most benefit data-starved or efficiency-oriented deep learning regimes.
Conclusion
The integration of the Walsh-Hadamard Transform at both early and late stages of a ResNet-50 architecture constitutes a highly effective mechanism for improving both the accuracy and computational efficiency of CNNs on complex vision benchmarks. These findings have immediate practical applicability for the design of power- and resource-aware deep learning systems and open several avenues for further research into domain-informed network design.