Papers
Topics
Authors
Recent
Search
2000 character limit reached

SEANet-Style Vector-Quantizer

Updated 4 February 2026
  • The paper introduces BrainTokMix, a SEANet-style vector-quantizer that achieves end-to-end causal compression of high-dimensional MEG data with a 17× compression ratio.
  • It employs a four-stage residual vector quantization using 16,384-codeword codebooks and causal convolutions to tokenize spatiotemporal neural signals efficiently.
  • Empirical results show high fidelity with MAE of 0.203 and PCC of 0.944, demonstrating stable, long-context generation and effective Transformer integration.

A SEANet-style vector-quantizer, exemplified by the BrainTokMix architecture, is a multi-stage residual vector quantization (RVQ) system designed to encode high-dimensional, multichannel time series such as magnetoencephalography (MEG) recordings into discrete token streams suitable for autoregressive sequence modeling. Originating as a simplified variant of the SEANet codec used in prior work (notably BrainOmni), BrainTokMix achieves causality, computational efficiency, and direct end-to-end compression of spatiotemporal neural data. Its innovations address encoder/decoder structure, codebook interaction, and seamless integration with large-scale decoder-only Transformers for next-token prediction and generative modeling of neurophysiological signals (Csaky, 28 Jan 2026).

1. Architecture and Structural Modifications

The BrainTokMix quantizer operates on MEG segments xRC×Lwx \in \mathbb{R}^{C \times L_w}, where C=68C=68 (source-space channels) and Lw=1024L_w=1024 (10.24 s at 100 Hz). It employs a strictly causal SEANet encoder composed exclusively of convolutional layers and downsampling (by a factor of 4), yielding output zRTw×ndimz \in \mathbb{R}^{T_w \times n_{\text{dim}}} with ndim=4096n_{\text{dim}}=4096 and Tw=Lw/4=256T_w=L_w/4=256. The channels are reshaped into nneuro=4n_{\text{neuro}}=4 parallel "neuro-streams" (each d=1024d=1024).

Key structural differences from the original SEANet include:

  • Removal of per-sensor attention and sensor embedding modules.
  • Elimination of LSTM temporal or sensor-wise bottlenecks.
  • Exclusively multichannel, causal convolutions for channel mixing in both encoder and decoder.

This design yields a model that is approximately 3× faster to train, simplifies the architecture substantially, and maintains end-to-end causality for real-time applications (Csaky, 28 Jan 2026).

2. Mathematical Formulation and Quantization Process

Quantization is performed via a four-stage (Q=4Q=4) residual vector quantizer. For a latent zh,tRdz_{h,t} \in \mathbb{R}^d, the quantization proceeds as follows:

  • For each stage q=1Qq=1\dots Q, a codebook E(q)RK×dE^{(q)} \subset \mathbb{R}^{K \times d} (K=16384K=16\,384) is used.
  • At each stage, the closest codeword ek(q)(q)e^{(q)}_{k^{(q)}} is assigned by minimizing the squared Euclidean distance to the current residual r(q1)r^{(q-1)}.
  • The quantized representation is

z~h,t=q=1Qek(q)(q)\tilde{z}_{h,t} = \sum_{q=1}^Q e^{(q)}_{k^{(q)}}

and the quantized tensor is passed through the causal SEANet decoder.

Training minimizes a composite loss:

  • Lrec=xx^1L_{\text{rec}} = \|x - \hat{x}\|_1
  • Lfreq=exp(pcc(x,x^))+Lamp+0.5LϕL_{\text{freq}} = \exp(-\mathrm{pcc}(x,\hat{x})) + L_{\text{amp}} + 0.5L_{\phi}
    • where pcc\mathrm{pcc} is the channel-averaged Pearson correlation, LampL_{\text{amp}} and LϕL_{\phi} are 1\ell_1 norms on the FFT magnitude and phase, respectively.
  • Lcom=q,h,tsg[zh,t]ek(q)(q)2L_{\text{com}} = \sum_{q,h,t} \|\text{sg}[z_{h,t}] - e^{(q)}_{k^{(q)}}\|^2 is a commitment loss using the stop-gradient operator sg\text{sg}.

Codebook updates use the straight-through estimator. During the backward pass, gradients are allowed to flow through latent quantization as z~h,t+(zh,tsg[z~h,t])\tilde{z}_{h,t} + (z_{h,t} - \text{sg}[\tilde{z}_{h,t}]). Codebooks are updated using the gradients of LcomL_{\text{com}} and reconstruction/frequency losses (Csaky, 28 Jan 2026).

3. Data Preprocessing and Discrete Tokenization

The end-to-end pipeline for converting MEG to tokens comprises:

  • Per-session preprocessing: interference rejection (Maxwell or gradient-based), causal line-noise notch, causal 1–50 Hz bandpass, resampling to 100 Hz, bad-channel interpolation, projection to a standard anatomical space (fsaverage, C=68C=68 regions of interest).
  • Standardization (median/IQR channel-wise scaling), robust clipping to [10,10][-10,10].
  • Segmentation into "good" segments (≥ 60 s, σ<1.5\sigma < 1.5), with sessions retained only if ≥ 80% windows are "good".

For each segment, non-overlapping windows of 10.24 s are tokenized. Each window is encoded, quantized, and the RVQ indices ct,h,q{0,,K1}c_{t,h,q} \in \{0,\dots,K-1\} are collected. For Tw=256T_w=256, nneuro=4n_{\text{neuro}}=4, Q=4Q=4, this yields L=4096L=4096 tokens/window, corresponding to a tokenization rate of 400 tokens/sec—a compression ratio of approximately 17× relative to a raw flattened stream (Csaky, 28 Jan 2026).

4. Integration with Transformer Architectures

Tokenized outputs are flattened from their initial [t,h,q][t, h, q] grid structure to a 1D sequence:

i=((t1)nneuro+(h1))Q+q,i=1L,i = ((t-1)\cdot n_{\text{neuro}} + (h-1))\cdot Q + q, \quad i=1\dots L,

with token yict,h,qy_i \equiv c_{t,h,q}.

Transformer integration details:

  • Separate embedding tables E(q)RK×dmodelE^{(q)} \in \mathbb{R}^{K \times d_{\text{model}}} (dmodel=1200d_{\text{model}}=1200) per RVQ stage.
  • Multimodal rotary position embeddings (MRoPE) across axes (t,h,q)(t, h, q) to enable tri-axial attention.
  • Decoder head tying: output softmax head for stage qq is weight-tied to the embedding table for stage q+1q+1 (cyclically).
  • Vocabulary size: QK=65536Q \cdot K = 65\,536.

The model is trained on CamCAN and OMEGA datasets (approximately 6×1086 \times 10^8 tokens), using next-token cross-entropy and AdamW optimizer. For generation, 1-minute context windows (24,576 tokens) are used, and sampling proceeds by rolling out up to 4 minutes with overlap-add reconstruction from decoded MEG windows (Csaky, 28 Jan 2026).

5. Empirical Performance and Ablation Results

On held-out MOUS data, the quantizer achieves:

  • Mean absolute error (MAE): 0.203
  • Pearson correlation coefficient (PCC): 0.944
  • FFT amplitude error: 0.0835
  • Codebook usage perplexity: 15518\approx 15\,518

Power spectral density (PSD) and spatial covariance reconstructions closely approximate true MEG signals, with mild attenuation above 40 Hz. Long-horizon rollouts (4 min beyond a 1 min prompt) exhibit persistent on-manifold stability, with out-of-envelope rate (OER) remaining under 10–20%. Conditional specificity is demonstrated by significant prefix-divergence gaps under prompt-swap controls; e.g., covariance distance gap vs. prompt-swap post-4 min (τ=235.5\tau = 235.5 s) is Δ=0.088\Delta=0.088 [0.063, 0.173], and vs. real-real Δ=0.098\Delta=0.098 [0.046, 0.135]. Shortening the context window (61.44 s to 30.72 s) increases OER and reduces the prefix-swap gap on all metrics, indicating a reduction in generative fidelity with decreased context (Csaky, 28 Jan 2026).

6. Algorithmic Workflow and Pseudocode

The core RVQ algorithm runs as follows:

  • Forward pass:
    • Find k=argminkrEk(q)2k=\arg\min_k \|r - E^{(q)}_{k}\|^2.
    • Add e=Ek(q)e = E^{(q)}_k to quantized latent.
    • Update residual r=rer = r - e.
    • 3. Accumulate quantized vectors.
  • Losses are computed as:

lossrec=L1(x,x^)+exp(pcc(x,x^))\text{loss}_{\text{rec}} = L_1(x, \hat{x}) + \exp(-\mathrm{pcc}(x, \hat{x}))

lossfreq=L1(FFTmag(x),FFTmag(x^))+0.5L1(FFTphase(x),FFTphase(x^))\text{loss}_{\text{freq}} = L_1(\text{FFTmag}(x), \text{FFTmag}(\hat{x})) + 0.5 L_1(\text{FFTphase}(x), \text{FFTphase}(\hat{x}))

losscom=βh,t,qstopgrad(zh,t)Ek(q)(q)2\text{loss}_{\text{com}} = \beta \cdot \sum_{h,t,q} \| \text{stopgrad}(z_{h,t}) - E^{(q)}_{k^{(q)}} \|^2

losstotal=lossrec+lossfreq+losscom\text{loss}_{\text{total}} = \text{loss}_{\text{rec}} + \text{loss}_{\text{freq}} + \text{loss}_{\text{com}}

  • Backward pass: Gradients propagate with the straight-through estimator. Codebooks receive updates from both LcomL_{\text{com}} and the reconstruction/frequency components.

A compact pseudocode representation is presented in the original work and directly transcribes the forward/backward workflow (Csaky, 28 Jan 2026).

7. Significance and Application

SEANet-style vector-quantization, as realized in BrainTokMix, provides an efficient, scalable, and causal method for transforming continuous, high-dimensional neurophysiological data into discretized token streams. These representations facilitate language-model-scale autoregressive modeling and long-context generation for neuroscientific signals. The system achieves substantial compression (17×), preserves signal fidelity, and enables stable, prompt-specific long-horizon generation with strong generalization across datasets. Empirical ablations underscore the importance of extended context length for maintaining spatiotemporal specificity.

A plausible implication is that SEANet-style quantization, particularly with simplified channel mixing and multi-stage RVQ bottlenecks, defines a new standard for integrating modern autoregressive architectures with biomedical time series. This approach bridges advances in neural signal processing and sequence modeling, supporting both fundamental research and practical applications in computational neuroscience (Csaky, 28 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SEANet-Style Vector-Quantizer.