Holographic Global Convolutional Networks for Long-Range Prediction Tasks in Malware Detection (2403.17978v1)
Abstract: Malware detection is an interesting and valuable domain to work in because it has significant real-world impact and unique machine-learning challenges. We investigate existing long-range techniques and benchmarks and find that they're not very suitable in this problem area. In this paper, we introduce Holographic Global Convolutional Networks (HGConv) that utilize the properties of Holographic Reduced Representations (HRR) to encode and decode features from sequence elements. Unlike other global convolutional methods, our method does not require any intricate kernel computation or crafted kernel design. HGConv kernels are defined as simple parameters learned through backpropagation. The proposed method has achieved new SOTA results on Microsoft Malware Classification Challenge, Drebin, and EMBER malware benchmarks. With log-linear complexity in sequence length, the empirical results demonstrate substantially faster run-time by HGConv compared to other methods achieving far more efficient scaling even with sequence length $\geq 100,000$.
- Openfold: Retraining alphafold2 yields new insights into its learning mechanisms and capacity for generalization. bioRxiv, pages 2022–11.
- Recasting self-attention with holographic reduced representations. arXiv preprint arXiv:2305.19534.
- Towards generalization in subitizing with neuro-symbolic loss using holographic reduced representations.
- Deploying convolutional networks on untrusted platforms using 2D holographic reduced representations. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S., editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 367–393. PMLR.
- Ember: an open dataset for training static pe malware machine learning models. arXiv preprint arXiv:1804.04637.
- Drebin: Effective and explainable detection of android malware in your pocket. In Ndss, volume 14, pages 23–26.
- Effective gene expression prediction from sequence by integrating long-range interactions. Nature methods, 18(10):1196–1203.
- Layer normalization. arXiv preprint arXiv:1607.06450.
- Understanding uses and misuses of similarity hashing functions for malware detection and family clustering in actual scenarios. Forensic Science International: Digital Investigation, 38:301220.
- mvhash-b - a new approach for similarity preserving hashing. In Proceedings of the 2013 Seventh International Conference on IT Security Incident Management and IT Forensics, IMF ’13, page 33–44, Washington, DC, USA. IEEE Computer Society.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595.
- Rethinking attention with performers. arXiv preprint arXiv:2009.14794.
- The nucleotide transformer: Building and evaluating robust foundation models for human genomics. bioRxiv, pages 2023–01.
- Dao, T. (2023). Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359.
- Language modeling with gated convolutional networks. In International conference on machine learning, pages 933–941. PMLR.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Spiking structured state space model for monaural speech enhancement. arXiv preprint arXiv:2309.03641.
- PyLZJD: An Easy to Use Tool for Machine Learning. In Chris Calloway, David Lippa, Dillon Niederhut, and David Shupe, editors, Proceedings of the 18th Python in Science Conference, pages 101 – 106.
- A framework for few-shot language model evaluation.
- Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752.
- Hippo: Recurrent memory with optimal polynomial projections. Advances in neural information processing systems, 33:1474–1487.
- Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396.
- Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415.
- Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456. pmlr.
- Alphafold 2. Fourteenth Critical Assessment of Techniques for Protein Structure Prediction; DeepMind: London, UK.
- Transformers are rnns: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning, pages 5156–5165. PMLR.
- Fnet: Mixing tokens with fourier transforms. arXiv preprint arXiv:2105.03824.
- What makes convolutional models great on long sequence modeling? arXiv preprint arXiv:2210.09298.
- Expediting mrsh-v2 approximate matching with hierarchical bloom filter trees. In 9th EAI International Conference on Digital Forensics and Cyber Crime (ICDF2C 2017), Prague, Czechia. Springer.
- Language models of protein sequences at the scale of evolution enable accurate structure prediction. BioRxiv, 2022:500902.
- Learning long-range spatial dependencies with horizontal gated recurrent units. Advances in neural information processing systems, 31.
- Structured state space models for in-context reinforcement learning. arXiv preprint arXiv:2303.03982.
- Luna: Linear unified nested attention. Advances in Neural Information Processing Systems, 34:2441–2453.
- Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pages 142–150.
- Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426.
- Mimonets: Multiple-input-multiple-output neural networks exploiting computation in superposition. Advances in Neural Information Processing Systems (NeurIPS), 36.
- Crosslingual generalization through multitask finetuning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
- Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution. arXiv preprint arXiv:2306.15794.
- Bringing umap closer to the speed of light with gpu acceleration. Proceedings of the AAAI Conference on Artificial Intelligence, 35(1):418–426.
- Tlsh – a locality sensitive hash. In 2013 Fourth Cybercrime and Trustworthy Computing Workshop, page 7–13. IEEE.
- Microsoft malware classification challenge (big 2015).
- Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048.
- Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071.
- Plate, T. A. (1995). Holographic reduced representations. IEEE Transactions on Neural networks, 6(3):623–641.
- Hyena hierarchy: Towards larger convolutional language models. arXiv preprint arXiv:2302.10866.
- The acl anthology network corpus. Language Resources and Evaluation, 47(4):919–944.
- Classifying sequences of extreme length with constant memory applied to malware detection. Proceedings of the AAAI Conference on Artificial Intelligence, 35(11):9386–9394.
- An alternative to ncd for large sequences, lempel-ziv jaccard distance. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’17, page 1007–1015, New York, NY, USA. Association for Computing Machinery.
- Malware classification and class imbalance via stochastic hashed lzjd. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pages 111–120.
- Lempel-ziv jaccard distance, an effective alternative to ssdeep and sdhash. Digital Investigation, 24:34–49.
- A survey of machine learning methods and challenges for windows malware classification. In NeurIPS 2020 Workshop: ML Retrospectives, Surveys & Meta-Analyses (ML-RSA).
- A New Burrows Wheeler Transform Markov Distance. In The Thirty-Fourth AAAI Conference on Artificial Intelligence.
- Lempel-ziv jaccard distance, an effective alternative to ssdeep and sdhash. Digital Investigation.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
- Ai and the everything in the whole wide world benchmark. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
- Dnarch: Learning convolutional neural architectures by backpropagation. arXiv preprint arXiv:2302.05400.
- Roussev, V. (2009). Building a better similarity trap with statistically improbable features. In Proceedings of the 42Nd Hawaii International Conference on System Sciences, HICSS ’09, page 1–10, Washington, DC, USA. IEEE Computer Society.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
- Lempel-ziv networks. In Antorán, J., Blaas, A., Feng, F., Ghalebikesabi, S., Mason, I., Pradier, M. F., Rohde, D., Ruiz, F. J. R., and Schein, A., editors, Proceedings on ”I Can’t Believe It’s Not Better! - Understanding Deep Learning Through Empirical Falsification” at NeurIPS 2022 Workshops, volume 187 of Proceedings of Machine Learning Research, pages 1–11. PMLR.
- Long range arena: A benchmark for efficient transformers. arXiv preprint arXiv:2011.04006.
- Team, M. N. (2023). Introducing mpt-30b: Raising the bar for open-source foundation models. Accessed: 2023-06-22.
- Attention is all you need. Advances in neural information processing systems, 30.
- Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768.
- F2s2: Fast forensic similarity search through indexing piecewise hash signatures. Digital Investigation, 10(4):361–371.
- Deepspeed-visualchat: Multi-round multi-image interleave chat via multi-modal causal attention. arXiv preprint arXiv:2309.14327.
- Big bird: Transformers for longer sequences. Advances in Neural Information Processing Systems, 33:17283–17297.
- Ecg synthesis via diffusion-based state space augmented transformer. Sensors, 23(19):8328.
- H-transformer-1d: Fast one-dimensional hierarchical attention for sequences. arXiv preprint arXiv:2107.11906.