Papers
Topics
Authors
Recent
2000 character limit reached

HD Spam Filtering Accelerator

Updated 30 December 2025
  • HD Spam Filtering Accelerator is a system that uses hyperdimensional n-gram encoding to represent spam-related textual features efficiently.
  • It employs random bipolar vectors with cyclic permutations to capture order and frequency, offering significant speed and memory improvements over traditional methods.
  • Empirical results show competitive filtering accuracy alongside substantial computational and memory savings, making it ideal for large-scale spam detection.

N-gram HD (Hyperdimensional) Encoders are methods for representing the distributional statistics of contiguous character or token sequences (n-grams) in fixed-length, high-dimensional vectors. Such encoders leverage principles from hyperdimensional computing and vector-symbolic architectures to yield compact representations that capture order, frequency, and context, providing scalable alternatives to traditional n-gram histograms. Recent advances, exemplified by HyperEmbed and ZEN 2.0, demonstrate that n-gram HD encoders support efficient memory and computation trade-offs while achieving competitive or state-of-the-art results across domains, languages, and tasks (Alonso et al., 2020, Song et al., 2021).

1. Foundations of Hyperdimensional N-gram Encoding

Hyperdimensional computing (HDC) leverages random vectors in very high-dimensional spaces (D=103D=10^310510^5) where base vectors are generated i.i.d. and possess near-orthogonality properties. In n-gram encoding, each symbol SΣS\in\Sigma is assigned a random bipolar vector vS{+1,1}Dv_S\in\{+1,-1\}^D stored in an item memory. Positional information is encoded via cyclic permutations ρj(vS)\rho^j(v_S) where jj indexes the position within an n-gram. Binding is performed as element-wise multiplication:

vw=j=1nρj(vSj),w=S1S2Snv_w = \bigodot_{j=1}^{n} \rho^j(v_{S_j}),\quad w=S_1S_2\cdots S_n

All observed n-grams in a document are bundled via weighted summation, resulting in a single D-dimensional vector VV:

V=wΣnc(w)vwV = \sum_{w\in\Sigma^n} c(w)\,v_w

Normalization (e.g., l2l_2) is typical prior to downstream task application (Alonso et al., 2020).

2. Algorithmic Implementations and Pseudocode

The standard sliding-window algorithm processes text to construct HD representations for all character n-grams. For document DD, alphabet Σ\Sigma, n-gram length nn, and HD dimension DD:

  • Initialize: For each SΣS\in\Sigma, generate random vSv_S.
  • Process: For each position ii, bind and permute vectors for the n-gram w=D[i...i+n1]w=D[i...i+n-1].
  • Bundle: Accumulate the resulting hh into VV.
  • Normalize: V^=V/V2\widehat V = V / \|V\|_2.

Time complexity for document encoding is O(DnD)O(|D| n D) (Alonso et al., 2020):

1
2
3
4
5
6
7
8
9
for S ∈ Σ:
    v_S ← random bipolar vector in {+1,−1}^D
V ← 0
for each n-gram w = D[i…i+n−1]:
    h ← v_{D[i]}
    for j = 2 to n:
        h ← h ⊙ ρ^{j}(v_{D[i + j − 1]})
    V ← V + h
return V / ||V||_2

3. Trade-offs: Dimensionality, Fidelity, and Resource Efficiency

Classical n-gram histograms require ana^n counters for alphabet size aa. HD encoders collapse all n-gram statistics into a single DD-dimensional vector—DD is independent of nn. Increasing DD improves representational fidelity and classification accuracy, while reducing DD offers substantial savings in memory and computation.

Empirically, accuracy (micro-averaged F1F_1) grows with DD until a saturation point (DD^*): \sim512 for small corpora, \sim4096 for large. Beyond DD^*, additional gains are negligible or negative. Speed-up and memory reduction scale approximately as dimorig/D\text{dim}_{\text{orig}}/D (Alonso et al., 2020). In practical terms, for AskUbuntu with MLP (n=3n=3, D=512D=512), F1=0.91F_1 = 0.91 vs. baseline $0.92$, with 4.62×4.62\times faster training, 3.84×3.84\times faster testing, and 6.18×6.18\times smaller memory footprint.

4. Integration in Deep Architectures: ZEN 2.0

ZEN 2.0 advances n-gram HD encoding by marrying external n-gram features with Transformer-based models. Extraction proceeds via PMI thresholds on large corpora (Chinese, Arabic), yielding lexicons (Vng|\mathcal{V}_{ng}| = 261K for Chinese, 194K for Arabic). For each n-gram gg, a learnable embedding eg\mathbf{e}_g is provided and contextualized via a 6-layer Transformer (no positional encoding). At every main encoder layer, token states vi(l)\mathbf{v}_i^{(l)} are fused by summation with weighted n-gram representations:

vi(l)=vi(l)+k=1Kipi,kμi,k(l)\mathbf{v}_i^{(l)*} = \mathbf{v}_i^{(l)} + \sum_{k=1}^{K_i} p_{i,k} \boldsymbol\mu_{i,k}^{(l)}

where pi,kp_{i,k} is proportional to c(gi,k)c(g_{i,k}).

Pre-training objectives mirror BERT (MLM, NSP), with optional whole n-gram masking (WNM). No additional n-gram prediction head is required; the n-gram encoder is learned via the masked language modeling loss. Relative positional encoding and fusion architecture apply across domains and languages (Song et al., 2021).

5. Experimental Results and Empirical Performance

HyperEmbed was validated on three small intent-classification corpora (Chatbot, AskUbuntu, WebApplication) and a large-scale news corpus (20NewsGroups). HD embeddings (n=2n=2–4, D=25D=2^52142^{14}) were supplied to multiple classifiers: Ridge, KNN, MLP, PA, RF, LSVC, SGD, NC, BNB. Benchmarked metrics include F1F_1, training/test time, and memory.

  • As DD increases, F1 approaches baseline attained with traditional n-gram features.
  • For 20NewsGroups (D=2048D=2048, n=2n=2–3): up to 90%90\% of baseline F1F_1 retained, with $50$–200×200\times speed-up and 100×100\times memory reduction.
  • Linear classifiers (Ridge, MLP, PA, SGD, LSVC) show best trade-offs, while KNN and tree-based classifiers typically lose accuracy due to the distributed nature of the HD embeddings (Alonso et al., 2020).

ZEN 2.0 training on 8.4B-token Chinese and 7.3B-token Arabic datasets established new state-of-the-art performance over existing BERT and AraBERT baselines across 10 Chinese and multiple Arabic tasks. Task-specific gains in F1F_1 and accuracy range from +0.1+0.1 to +4.5+4.5 points (Song et al., 2021).

Model Corpora Size n-gram Vocab F1/Acc Gain
HyperEmbed Small/Large ca. 90%90\% baseline, substantial efficiency (see above)
ZEN 2.0(L) 8.4B/7.3B 261K/194K +0.1+0.1+4.5+4.5 over BERT/AraBERT

6. Practical Guidelines for Deployment and Tuning

Optimal configuration depends on corpus scale and resource constraints.

  • For small corpora, select n=2n=2–$4$; for large, n=2n=2–$3$.
  • Sweep DD from 252^5 to 2142^{14}, choose the smallest DD achieving at least $95$–98%98\% of baseline performance.
  • Prefer linear models and shallow MLP for HD representations.
  • Consider binarization of the HD vectors and classifier for maximal speed and memory efficiency in extreme settings.
  • Domain and language adaptation in ZEN 2.0 is architecture-neutral; PMI/frequency thresholds and n-gram lexicons are tuned per language, supporting broad coverage without rearchitecting (Alonso et al., 2020, Song et al., 2021).

7. Context, Significance, and Directions

N-gram HD encoders combine scalable representational capacity with memory and computational efficiency, enabling large-vocabulary or long-span n-gram features to be collapsed into manageable fixed-length vectors. Their use in modern NLP architectures, particularly in ZEN 2.0, demonstrates concrete accuracy gains and efficiency for diverse languages and domains.

This suggests the feasibility of high-performance, resource-conscious NLP pipeline designs, and opens potential for further research intersecting distributed representations, symbolic reasoning, and large-scale neural architectures. Continued investigation may address optimal permutation and binding schemes, fusion strategies, and generalization of HD encoding to non-linguistic sequence domains.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to HD Spam Filtering Accelerator.