Papers
Topics
Authors
Recent
Search
2000 character limit reached

MalConv Architecture: Byte-Level Malware Detection

Updated 16 November 2025
  • MalConv architecture is a deep neural model that detects malware from raw byte sequences without relying on manual feature engineering.
  • It employs gated 1-D convolution with parallel base and gate filters and uses constant-memory max-pooling to efficiently process extremely long files.
  • The integration of content-dependent channel gating enhances contextual feature interactions, yielding higher accuracy and significant reductions in GPU memory and compute time.

MalConv refers to a family of neural network architectures developed to detect malware directly from raw byte sequences of executable files, without domain-specific preprocessing or manual feature engineering. These models are specifically designed to address the extreme length and structureless nature of binary executables, processing inputs of up to hundreds of megabytes (hundreds of millions of bytes) with linear computational and memory requirements. The architectural innovations center on byte-level embedding, gated convolutional feature extraction, efficient temporal pooling, and, in later variants, content-dependent channel attention mechanisms.

1. Problem Definition and Input Representation

MalConv architectures treat an executable file as a long sequence of raw bytes x=[x1,x2,...,xN]Tx = [x_1, x_2, ..., x_N]^T, where each xi∈{0,1,...,255}x_i \in \{0, 1, ..., 255\} and NN can be as large as 2⋅1082 \cdot 10^8 (for files up to ∼\sim271 MB). No domain-specific structure from the PE file format, assembly, or other syntactic levels is used.

Each possible byte value, plus an EOF symbol in later variants (forming a 257-symbol alphabet), is mapped to a low-dimensional real-valued vector via an embedding matrix E∈RK×dE \in \mathbb{R}^{K \times d}, with K=256K=256 or $257$ and d=8d=8. For input sequence xx, the embedding produces xi∈{0,1,...,255}x_i \in \{0, 1, ..., 255\}0, and the model input is the matrix xi∈{0,1,...,255}x_i \in \{0, 1, ..., 255\}1.

2. Gated 1-D Convolutional Feature Extraction

A single layer of 1-D convolutional filters is applied to the byte embeddings, with two parallel convolutions: a "base" conv and a "gate" conv. In the original MalConv (Raff et al., 2017), both convolutions use xi∈{0,1,...,255}x_i \in \{0, 1, ..., 255\}2 filters, kernel width xi∈{0,1,...,255}x_i \in \{0, 1, ..., 255\}3, stride xi∈{0,1,...,255}x_i \in \{0, 1, ..., 255\}4, and zero padding to ensure each output covers non-overlapping 500-byte regions. MalConv2 (Raff et al., 2020) increases the number of filters to xi∈{0,1,...,255}x_i \in \{0, 1, ..., 255\}5, shrinks the kernel width to xi∈{0,1,...,255}x_i \in \{0, 1, ..., 255\}6, and uses a finer stride of xi∈{0,1,...,255}x_i \in \{0, 1, ..., 255\}7.

For input embedding xi∈{0,1,...,255}x_i \in \{0, 1, ..., 255\}8:

  • The base convolution outputs xi∈{0,1,...,255}x_i \in \{0, 1, ..., 255\}9.
  • The gate convolution outputs NN0.

Where NN1 is the number of convolutional windows. The gating mechanism combines these via pointwise multiplication of the base conv output with a sigmoid nonlinearity applied to the gate conv:

NN2

No additional nonlinearity is applied.

Architecture Filters (F or C) Kernel Width (W or NN3) Stride (s or S)
MalConv 128 500 500
MalConv2 256 256 64

3. Temporal Aggregation and Constant-Memory Max-Pooling

Temporal aggregation is achieved via global max-pooling, collapsing the T-step convolutional output per channel to a fixed-size vector:

NN4

Max-pooling is favored over average-pooling because malware-relevant signatures may be highly localized to specific byte regions and would be diluted by averaging.

MalConv2 introduces a fixed-memory temporal max-pooling algorithm that scans the convolutional output in overlapping chunks (of length NN5, step NN6). For each filter/channel, the maximum activation (and index) is tracked per chunk and then compared globally, reducing overall activation storage requirements from NN7 to NN8, where NN9 is the reduced sequence length post-convolution. The two-phase algorithm (chunk-wise winner finding without gradients, followed by a small tensor backpropagation step) enables training on sequences exceeding 100 million bytes using under 2.2 GB of GPU memory.

4. Attention and Channel Gating

While original MalConv lacked inter-channel or spatial attention, MalConv2 augments the model with Global Channel Gating (GCG). This mechanism introduces a content-dependent gating scalar for each time step and channel, allowing the network to model contextual interactions across distant regions.

For convolutional output 2â‹…1082 \cdot 10^80:

  • Compute a global context vector 2â‹…1082 \cdot 10^81 via a small "context" sub-network (e.g., 1x1 conv or fully-connected layer over pooled H).
  • For each time step 2â‹…1082 \cdot 10^82, compute

2â‹…1082 \cdot 10^83

2â‹…1082 \cdot 10^84

2â‹…1082 \cdot 10^85

In equation form:

2â‹…1082 \cdot 10^86

This mechanism operates with 2â‹…1082 \cdot 10^87 parameters and memory, functioning as an attention gate across the entire sequence, but with computation scaling linearly in input length.

5. Classification and Training Procedure

After pooling (and optional GCG gating), the fixed-size feature vector 2â‹…1082 \cdot 10^88 is processed by a classification head:

  • Dropout with rate 2â‹…1082 \cdot 10^89 is applied to ∼\sim0.
  • The resulting vector is projected via a single fully-connected layer to a scalar logit:

∼\sim1

  • The predicted label is ∼\sim2, trained with binary cross-entropy loss:

∼\sim3

MalConv omits Batch Normalization due to empirically observed slow or hindered convergence, attributed to highly non-Gaussian, multi-modal pre-activation distributions. Instead, DeCov regularization (∼\sim4) is employed on the dropout-masked pooled vector.

Optimization is conducted via stochastic gradient descent with Nesterov momentum 0.9, initial learning rate 0.01 (decayed by 0.95 per epoch), mini-batch size 256 (data-parallel across GPUs), and optional L2 weight decay (∼\sim5). Training on 400K files converges in 3–5 epochs; larger datasets (up to 2M files) further improve generalization.

6. Empirical Performance and Resource Requirements

Performance on the Ember2018 dataset (up to 271 MB files) shows:

  • MalConv (original, 2 MB files): Accuracy 91.27%, AUC 97.19%
  • MalConv (full-file, fixed-pool): Accuracy 91.14%, AUC 97.29%
  • MalConv2 with GCG: Accuracy 93.29%, AUC 98.04%

Memory and computational optimizations yield a 116× reduction in GPU memory usage (from 128 GB to ~1 GB for MalConv with fixed-pool) and a 25.8× speedup per byte (training 21 hr/epoch reduced to 1 hr/epoch). MalConv2 with GCG, which incurs higher compute for attention, reports 4 hr/epoch and ≈2.2 GB RAM.

7. Architectural Summary and Implementation

The end-to-end flow comprises:

  1. Embedding: Raw bytes ∼\sim6 mapped to ∼\sim7.
  2. Gated 1D Conv Block: Parallel convolutions with gating:

∼\sim8

∼\sim9

E∈RK×dE \in \mathbb{R}^{K \times d}0

  1. (Optional, MalConv2) GCG: Gated attention across channel-time using a global context vector.
  2. Global max-pooling (constant-memory in MalConv2): E∈RK×dE \in \mathbb{R}^{K \times d}1
  3. Dropout, fully connected layer, and classification via sigmoid output.

The MalConv family of models demonstrates the feasibility of end-to-end learnable malware detection at byte-level granularity using architectures optimized for extremely long sequences, introducing effective memory and compute reductions to process realistic executable sizes at scale. The adoption of gating and channel attention mechanisms allows for content-dependent feature interactions and efficient global context integration, yielding improved detection performance on large, diverse datasets (Raff et al., 2017, Raff et al., 2020).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MalConv Architecture.