Efficient Transformations in Deep Learning Convolutional Neural Networks

Published 19 Jun 2025 in cs.CV, cs.AI, eess.IV, and eess.SP | (2506.16418v1)

Abstract: This study investigates the integration of signal processing transformations -- Fast Fourier Transform (FFT), Walsh-Hadamard Transform (WHT), and Discrete Cosine Transform (DCT) -- within the ResNet50 convolutional neural network (CNN) model for image classification. The primary objective is to assess the trade-offs between computational efficiency, energy consumption, and classification accuracy during training and inference. Using the CIFAR-100 dataset (100 classes, 60,000 images), experiments demonstrated that incorporating WHT significantly reduced energy consumption while improving accuracy. Specifically, a baseline ResNet50 model achieved a testing accuracy of 66%, consuming an average of 25,606 kJ per model. In contrast, a modified ResNet50 incorporating WHT in the early convolutional layers achieved 74% accuracy, and an enhanced version with WHT applied to both early and late layers achieved 79% accuracy, with an average energy consumption of only 39 kJ per model. These results demonstrate the potential of WHT as a highly efficient and effective approach for energy-constrained CNN applications.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that integrating classical transforms, especially WHT, into ResNet-50 improves accuracy by over 13 percentage points compared to the baseline.
It employs custom TensorFlow/Keras layers at multiple stages to enable efficient frequency-domain feature extraction while maintaining low memory overhead.
The WHT model notably achieves 79.3% accuracy with drastic energy savings, highlighting its suitability for resource-constrained deep learning applications.

Efficient Transformations in Deep Learning Convolutional Neural Networks

The paper systematically examines the integration of classical signal processing transformations—Fast Fourier Transform (FFT), Discrete Cosine Transform (DCT), and Walsh-Hadamard Transform (WHT)—into the ResNet-50 architecture, with a focus on assessing computational efficiency, energy consumption, and classification accuracy on the CIFAR-100 benchmark.

Motivation and Context

The study is situated within the context of increasingly stringent computational and energy constraints inherent in modern deep learning applications, particularly in large-scale models and embedded AI systems. The authors emphasize that reducing the computational and energy footprint of CNNs has critical implications for the sustainability and deployability of AI solutions, especially as energy becomes a primary operational bottleneck.

Implementation Details

The methodology is characterized by the systematic implementation of each transformation as a custom TensorFlow/Keras layer. The transforms are incorporated at three distinct locations in the ResNet-50 pipeline: after the input layer, following early convolutional layers, and at both early and later convolutional stages, enabling both direct and hierarchical feature transformation in the frequency domain.

FFT Layer

Implemented using TensorFlow's tf.signal.fft2d, the FFT is applied on the last two axes (height, width) after casting the input to complex values. Only the magnitude is propagated to subsequent layers, discarding the phase information. Empirically, this information loss impacts learning capacity, particularly for natural images, where phase is critical for reconstructing spatial structure.

DCT Layer

The custom 2D DCT layer leverages sequential applications of tf.signal.dct on the width and height axes. This approach preserves orthogonality and compactness representations, focusing energy into low-frequency coefficients—an effect exploited in compression scenarios. Tests confirmed energy preservation using Parseval's theorem, indicating mathematical correctness.

WHT Layer

As there is no direct WHT support in TensorFlow, the authors implemented a recursive Hadamard matrix construction using Sylvester's method, normalized by matrix dimensions, and applied via batched matrix multiplication. The transform is executed separately across the spatial axes, and implemented efficiently via batch reshaping and broadcasting. Inverse transforms are implemented analogously for correctness assessment.

Model Variants

Three model architectures are evaluated for each transformation:

Transform after Input: Acts as preprocessing, transforming images before any convolution.
Transform after Early Convolutions: Intended to allow initial spatial feature learning, followed by frequency-domain processing.
Transforms at Both Early and Late Stages: Seeks to combine effects, capturing both low-level and high-level frequency features.

All models are trained on CIFAR-100; full fine-tuning is conducted due to the incompatibility of pre-trained weights with transformed representations.

Experimental Environment

Experiments are restricted to the NVIDIA A100 GPU, driven by significant memory requirements of ResNet-50 and high-resolution transformations. Resource utilization (GPU power, memory, and time) is monitored using NVML, allowing accurate measurement of efficiency.

Empirical Results

Accuracy

Comparative accuracy for each configuration is summarized below:

Model	Test Accuracy (%)	Power (kJ)
ResNet-50 (Baseline)	66.1	25,606
FFT (best variant)	61.1	101,254
DCT (best variant)	62.6	474
WHT (best variant)	79.3	26.4

The WHT-augmented model (with transformations at both early and late layers) achieved 79.3% test accuracy, representing a substantial gain over baseline ResNet-50 and all other transformed model variants.
DCT and FFT models failed to match baseline accuracy and, in many configurations (especially FFT at input), exhibited unstable convergence or outright failure to learn.

Computational Efficiency

Power Consumption: The WHT model consumed only ~39 kJ per inference on average, an extreme reduction relative to all other variants (including a >600x reduction compared to the baseline in some implementations).
Memory Utilization and Training Time: WHT layers incur negligible additional memory overhead and, due to their $\mathcal{O}(n \log n)$ computational complexity, do not hinder throughput.
Overfitting Mitigation: WHT integration (especially with dual placement) led to improved generalization and less overfitting, as measured by reduced train-test gap.

Analysis and Interpretation

The WHT outperforms both FFT and DCT when tightly integrated into modern CNNs for vision, offering superior energy efficiency and a significant accuracy boost—even outperforming the conventional spatial-domain ResNet-50 on CIFAR-100 by more than 13 percentage points. This result is especially notable given the general trend that dimensionality-reducing or frequency-domain transformations tend to harm accuracy on “natural” image data, unless coupled with meticulously tuned architectures.

The merits of WHT likely arise from its ability to capture both global (low-frequency) and local (high-frequency) features in a computationally inexpensive manner. Unlike FFT, the WHT does not introduce complex numbers nor does it discard phase information, which may account for superior information retention and feature locality in deep CNNs. The lack of additional memory consumption and the efficient implementation also make it highly compelling for edge and embedded applications.

Practical Implications

The demonstrated combination of accuracy improvement and drastic power reduction highlights the suitability of WHT-augmented CNNs for deployment in resource-constrained scenarios, such as on-device inference, robotics, autonomous vehicles, and IoT applications, where power, speed, and model size are primary constraints.

In practical terms, integrating a WHT layer can be accomplished as a simple wrapper at the model’s input pipeline (using the provided recursive Hadamard construction), requiring minimal modification to existing training and inference code. For TensorFlow-based pipelines, the transformation can be encapsulated in a custom tf.keras.layers.Layer as per the provided algorithms.

Limitations

Several constraints remain:

Generality: While results are compelling for CIFAR-100 and ResNet-50, transferability to high-resolution, large-scale datasets (e.g., ImageNet) or to architectures optimized for different properties (ViT, MobileNet, etc.) remains to be systematically demonstrated.
Data Modality: All experiments are on image data; efficacy for audio, text, and multimodal inputs is speculative but plausible given the generality of the WHT.
Specialization: Substantial accuracy improvements are also likely due to the alignment of WHT’s transform properties with the chosen dataset and architecture. Hyperparameter tuning, transformation placement, and domain adaptation require further study for general adoption.

Future Directions

Potential avenues for extending this line of research include:

Application to Lighter Models: Porting WHT integration to compact architectures (e.g., MobileNet, EfficientNet) for ultra-low-power scenarios.
Extension to Other Data Modalities: Evaluation of WHT layers in audio spectrogram analysis, graph neural networks, and time-series forecasting.
Investigation of Fast Walsh-Hadamard Transform (FWHT): Assessing further optimization for real-time and on-device deployment.
Joint Optimization: Learning to adapt transform parameters (e.g., block size, domain adaptation) jointly with the network via gradient-based optimization.

Theoretical Implications

The results provide empirical evidence that classic orthogonal transforms can be more tightly coupled with modern deep learning pipelines beyond traditional preprocessing, challenging the prevailing assumption that such transforms merely substitute for learnable convolutions. The demonstrated performance differential between WHT and DCT/FFT motivates further theoretical work on analyzing what aspects of transformation orthogonality, sparsity, and numerical properties most benefit data-starved or efficiency-oriented deep learning regimes.

Conclusion

The integration of the Walsh-Hadamard Transform at both early and late stages of a ResNet-50 architecture constitutes a highly effective mechanism for improving both the accuracy and computational efficiency of CNNs on complex vision benchmarks. These findings have immediate practical applicability for the design of power- and resource-aware deep learning systems and open several avenues for further research into domain-informed network design.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Efficient Transformations in Deep Learning Convolutional Neural Networks

Summary

Efficient Transformations in Deep Learning Convolutional Neural Networks

Motivation and Context

Implementation Details

FFT Layer

DCT Layer

WHT Layer

Model Variants

Experimental Environment

Empirical Results

Accuracy

Computational Efficiency

Analysis and Interpretation

Practical Implications

Limitations

Future Directions

Theoretical Implications

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Authors (3)

Collections

Efficient Transformations in Deep Learning Convolutional Neural Networks

Summary

Efficient Transformations in Deep Learning Convolutional Neural Networks

Motivation and Context

Implementation Details

FFT Layer

DCT Layer

WHT Layer

Model Variants

Experimental Environment

Empirical Results

Accuracy

Computational Efficiency

Analysis and Interpretation

Practical Implications

Limitations

Future Directions

Theoretical Implications

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (3)

Collections