HD Spam Filtering Accelerator

Updated 30 December 2025

HD Spam Filtering Accelerator is a system that uses hyperdimensional n-gram encoding to represent spam-related textual features efficiently.
It employs random bipolar vectors with cyclic permutations to capture order and frequency, offering significant speed and memory improvements over traditional methods.
Empirical results show competitive filtering accuracy alongside substantial computational and memory savings, making it ideal for large-scale spam detection.

N-gram HD (Hyperdimensional) Encoders are methods for representing the distributional statistics of contiguous character or token sequences (n-grams) in fixed-length, high-dimensional vectors. Such encoders leverage principles from hyperdimensional computing and vector-symbolic architectures to yield compact representations that capture order, frequency, and context, providing scalable alternatives to traditional n-gram histograms. Recent advances, exemplified by HyperEmbed and ZEN 2.0, demonstrate that n-gram HD encoders support efficient memory and computation trade-offs while achieving competitive or state-of-the-art results across domains, languages, and tasks (Alonso et al., 2020, Song et al., 2021).

1. Foundations of Hyperdimensional N-gram Encoding

Hyperdimensional computing (HDC) leverages random vectors in very high-dimensional spaces ( $D=10^3$ – $10^5$ ) where base vectors are generated i.i.d. and possess near-orthogonality properties. In n-gram encoding, each symbol $S\in\Sigma$ is assigned a random bipolar vector $v_S\in\{+1,-1\}^D$ stored in an item memory. Positional information is encoded via cyclic permutations $\rho^j(v_S)$ where $j$ indexes the position within an n-gram. Binding is performed as element-wise multiplication:

$v_w = \bigodot_{j=1}^{n} \rho^j(v_{S_j}),\quad w=S_1S_2\cdots S_n$

All observed n-grams in a document are bundled via weighted summation, resulting in a single D-dimensional vector $V$ :

$V = \sum_{w\in\Sigma^n} c(w)\,v_w$

Normalization (e.g., $l_2$ ) is typical prior to downstream task application (Alonso et al., 2020).

2. Algorithmic Implementations and Pseudocode

The standard sliding-window algorithm processes text to construct HD representations for all character n-grams. For document $D$ , alphabet $\Sigma$ , n-gram length $n$ , and HD dimension $D$ :

Initialize: For each $S\in\Sigma$ , generate random $v_S$ .
Process: For each position $i$ , bind and permute vectors for the n-gram $w=D[i...i+n-1]$ .
Bundle: Accumulate the resulting $h$ into $V$ .
Normalize: $\widehat V = V / \|V\|_2$ .

Time complexity for document encoding is $O(|D| n D)$ (Alonso et al., 2020):

for S ∈ Σ:
    v_S ← random bipolar vector in {+1,−1}^D
V ← 0
for each n-gram w = D[i…i+n−1]:
    h ← v_{D[i]}
    for j = 2 to n:
        h ← h ⊙ ρ^{j}(v_{D[i + j − 1]})
    V ← V + h
return V / ||V||_2

3. Trade-offs: Dimensionality, Fidelity, and Resource Efficiency

Classical n-gram histograms require $a^n$ counters for alphabet size $a$ . HD encoders collapse all n-gram statistics into a single $D$ -dimensional vector— $D$ is independent of $n$ . Increasing $D$ improves representational fidelity and classification accuracy, while reducing $D$ offers substantial savings in memory and computation.

Empirically, accuracy (micro-averaged $F_1$ ) grows with $D$ until a saturation point ( $D^*$ ): $\sim$ 512 for small corpora, $\sim$ 4096 for large. Beyond $D^*$ , additional gains are negligible or negative. Speed-up and memory reduction scale approximately as $\text{dim}_{\text{orig}}/D$ (Alonso et al., 2020). In practical terms, for AskUbuntu with MLP ( $n=3$ , $D=512$ ), $F_1 = 0.91$ vs. baseline $0.92$, with $4.62\times$ faster training, $3.84\times$ faster testing, and $6.18\times$ smaller memory footprint.

4. Integration in Deep Architectures: ZEN 2.0

ZEN 2.0 advances n-gram HD encoding by marrying external n-gram features with Transformer-based models. Extraction proceeds via PMI thresholds on large corpora (Chinese, Arabic), yielding lexicons ( $|\mathcal{V}_{ng}|$ = 261K for Chinese, 194K for Arabic). For each n-gram $g$ , a learnable embedding $\mathbf{e}_g$ is provided and contextualized via a 6-layer Transformer (no positional encoding). At every main encoder layer, token states $\mathbf{v}_i^{(l)}$ are fused by summation with weighted n-gram representations:

$\mathbf{v}_i^{(l)*} = \mathbf{v}_i^{(l)} + \sum_{k=1}^{K_i} p_{i,k} \boldsymbol\mu_{i,k}^{(l)}$

where $p_{i,k}$ is proportional to $c(g_{i,k})$ .

Pre-training objectives mirror BERT (MLM, NSP), with optional whole n-gram masking (WNM). No additional n-gram prediction head is required; the n-gram encoder is learned via the masked language modeling loss. Relative positional encoding and fusion architecture apply across domains and languages (Song et al., 2021).

5. Experimental Results and Empirical Performance

HyperEmbed was validated on three small intent-classification corpora (Chatbot, AskUbuntu, WebApplication) and a large-scale news corpus (20NewsGroups). HD embeddings ( $n=2$ –4, $D=2^5$ – $2^{14}$ ) were supplied to multiple classifiers: Ridge, KNN, MLP, PA, RF, LSVC, SGD, NC, BNB. Benchmarked metrics include $F_1$ , training/test time, and memory.

As $D$ increases, F1 approaches baseline attained with traditional n-gram features.
For 20NewsGroups ( $D=2048$ , $n=2$ –3): up to $90\%$ of baseline $F_1$ retained, with $50$– $200\times$ speed-up and $100\times$ memory reduction.
Linear classifiers (Ridge, MLP, PA, SGD, LSVC) show best trade-offs, while KNN and tree-based classifiers typically lose accuracy due to the distributed nature of the HD embeddings (Alonso et al., 2020).

ZEN 2.0 training on 8.4B-token Chinese and 7.3B-token Arabic datasets established new state-of-the-art performance over existing BERT and AraBERT baselines across 10 Chinese and multiple Arabic tasks. Task-specific gains in $F_1$ and accuracy range from $+0.1$ to $+4.5$ points (Song et al., 2021).

Model	Corpora Size	n-gram Vocab	F1/Acc Gain
HyperEmbed	Small/Large	–	ca. $90\%$ baseline, substantial efficiency (see above)
ZEN 2.0(L)	8.4B/7.3B	261K/194K	$+0.1$ – $+4.5$ over BERT/AraBERT

6. Practical Guidelines for Deployment and Tuning

Optimal configuration depends on corpus scale and resource constraints.

For small corpora, select $n=2$ –$4$; for large, $n=2$ –$3$.
Sweep $D$ from $2^5$ to $2^{14}$ , choose the smallest $D$ achieving at least $95$– $98\%$ of baseline performance.
Prefer linear models and shallow MLP for HD representations.
Consider binarization of the HD vectors and classifier for maximal speed and memory efficiency in extreme settings.
Domain and language adaptation in ZEN 2.0 is architecture-neutral; PMI/frequency thresholds and n-gram lexicons are tuned per language, supporting broad coverage without rearchitecting (Alonso et al., 2020, Song et al., 2021).

7. Context, Significance, and Directions

N-gram HD encoders combine scalable representational capacity with memory and computational efficiency, enabling large-vocabulary or long-span n-gram features to be collapsed into manageable fixed-length vectors. Their use in modern NLP architectures, particularly in ZEN 2.0, demonstrates concrete accuracy gains and efficiency for diverse languages and domains.

This suggests the feasibility of high-performance, resource-conscious NLP pipeline designs, and opens potential for further research intersecting distributed representations, symbolic reasoning, and large-scale neural architectures. Continued investigation may address optimal permutation and binding schemes, fusion strategies, and generalization of HD encoding to non-linguistic sequence domains.