Classifying Sequences of Extreme Length with Constant Memory Applied to Malware Detection

Published 17 Dec 2020 in stat.ML, cs.AI, and cs.LG | (2012.09390v1)

Abstract: Recent works within machine learning have been tackling inputs of ever-increasing size, with cybersecurity presenting sequence classification problems of particularly extreme lengths. In the case of Windows executable malware detection, inputs may exceed $100$ MB, which corresponds to a time series with $T=100,000,000$ steps. To date, the closest approach to handling such a task is MalConv, a convolutional neural network capable of processing up to $T=2,000,000$ steps. The $\mathcal{O}(T)$ memory of CNNs has prevented further application of CNNs to malware. In this work, we develop a new approach to temporal max pooling that makes the required memory invariant to the sequence length $T$. This makes MalConv $116\times$ more memory efficient, and up to $25.8\times$ faster to train on its original dataset, while removing the input length restrictions to MalConv. We re-invest these gains into improving the MalConv architecture by developing a new Global Channel Gating design, giving us an attention mechanism capable of learning feature interactions across 100 million time steps in an efficient manner, a capability lacked by the original MalConv CNN. Our implementation can be found at https://github.com/NeuromorphicComputationResearchProgram/MalConv2

Abstract PDF Upgrade to Chat

Citations (46)

View on Semantic Scholar

Summary

Classifying Sequences of Extreme Length with Constant Memory Applied to Malware Detection

The paper "Classifying Sequences of Extreme Length with Constant Memory Applied to Malware Detection" addresses the challenge of detecting malware within Windows executables that may contain sequences of up to 100 million time steps. Previous approaches, such as MalConv—a Convolutional Neural Network (CNN) designed for malware detection—were limited by substantial memory requirements, restricting their applicability to smaller sequences, specifically up to 2 million steps.

Key Contributions and Numerical Results

The paper innovatively develops a methodology for temporal max pooling that renders the memory cost invariant to sequence length (T), thus allowing convolutional architectures like MalConv to efficiently process sequences over 100 million time steps. This advancement significantly improves computational efficiency, making the updated MalConv architecture (116) times more memory efficient and up to (25.8) times faster in training compared to its original implementation.

Moreover, the research introduces a novel Global Channel Gating (GCG) design, enhancing the attention mechanism of MalConv. This feature facilitates learning feature interactions across very lengthy sequences efficiently, a capability that was absent in the original architecture. The GCG model utilizes sparse gradients inherent in temporal max pooling to achieve this while sustaining computational tractability.

Implications and Future Directions

The implications of this research are profound for the field of cybersecurity, particularly in malware detection. By eradicating the input length limitations faced by earlier models, this approach mitigates one of the significant vulnerabilities—namely, the evasion possibilities when malicious payloads are positioned beyond the model’s sequence processing threshold. The enhanced capacity to discern feature correlations throughout an entire file may potentially harden detection against adversarial attacks, thereby improving robustness.

From a theoretical standpoint, the paper extends the exploration of sequence classification tasks into sequences of unprecedented length, hinting at applications beyond cybersecurity. Domains such as genomics—where Genome Wide Association Studies (GWAS) confront similar sequence processing challenges—could benefit from this advancement.

Looking ahead, the approach outlined in this study could stimulate further exploration into efficient architecture designs for long sequence data beyond the malware domain. Additionally, integrating more sophisticated attention mechanisms, possibly derived from extension or hybridization with Transformer models, might enhance representational capacity without offsetting computational benefits, advancing deep learning’s application in handling extensive data sets.

Overall, this paper contributes both practical advancements and theoretical insights, promising significant improvements in handling lengthy sequences in neural networks, serving as a solid foundation for future research in sequence-based classification problems.