- The paper presents an attention-free, frequency-domain classifier that processes raw UTF-8 bytes to achieve competitive accuracy with minimal parameters.
- Its methodology combines FFT-based encoders, recurrent oscillator banks, and a six-parameter PhaseHarmonics module, contributing up to +2.6% accuracy improvement.
- Empirical evaluations demonstrate superior parameter efficiency and long-context performance, highlighting suitability for resource-constrained deployments.
Oscillator-Based Byte-Level Text Classification Without Tokenization or Attention
Introduction and Motivation
The paper introduces Kathleen, a frequency-domain architecture for text classification that operates directly on raw UTF-8 bytes, completely eliminating both tokenization and attention mechanisms. This model posits that sophisticated frequency processing, inspired by oscillator dynamics, is capable of matching or surpassing tokenized deep architectures on standard language classification benchmarks—despite an extreme reduction in parameter count and computational cost. The key motivation is to address the scalability and preprocessing constraints imposed by Transformers: quadratic complexity in sequence length, tokenizer dependency, and large resource requirements. The approach leverages bioresonance-inspired oscillatory kernels and uniquely compact, learnable signal processing components to achieve efficient and effective byte-level classification.
Core Architectural Components
Kathleen is constructed from several novel frequency-domain modules:
- FFT-Rotate Wavetable Encoder: This component encodes all 256 possible byte values using a single learnable vector of 256 floats. For every input byte, an FFT-based phase rotation constructs its embedding dynamically, replacing the typical 65K-parameter embedding table and outperforming it (+0.6% accuracy). This achieves maximal parameter sharing across the byte alphabet.
- Recurrent Oscillator Banks: A set of causal, temporally recurrent convolutional kernels are initialized as damped sinusoids with varying decay constants. These operate as resonant filters, amplifying informative frequency patterns over the byte sequence—an O(L) alternative to attention for contextual feature extraction.
- PhaseHarmonics Nonlinearity: The most critical component, PhaseHarmonics applies a six-way sinusoidal projection with learnable frequency-specific phase offsets, concatenating these projections with the identity and then projecting back to the original space. Despite containing only 6 parameters, ablation studies demonstrate this block yields the largest performance gain of any architectural choice (+2.6% accuracy).
- Continuous Phase Shifting: Multiple learned phase shifts are applied in the Fourier domain, concatenating representations across shifted bases. This step gives further "views" of the sequence's frequency structure (+1.3% accuracy).
- PowerLawGate: A learnable power-law nonlinearity compresses the dynamic range of oscillator outputs. Critically, this block yields nontrivial performance gain (+0.9%) only in the frequency domain—no effect in tokenized or standard embeddings—exposing context-specific architectural utility.
- DualPooling: Combines attention-weighted pooling and max pooling for global sequence aggregation, which is empirically essential for retaining sparse but informative features present in short texts.
Empirical Evaluation
Kathleen was comprehensively evaluated on standard text classification benchmarks (IMDB, AG News, SST-2) using only task-specific data for pretraining and classification. The training regimen comprises initial masked language modeling applied to perception layers, followed by supervised label finetuning. All experiments were conducted under controlled seeds and on standard hardware (single NVIDIA T4 GPU).
Key Results
- Accuracy: Kathleen-Clean (733K params) obtains 88.6% on IMDB, 92.3% on AG News, and 83.3% on SST-2. It outperforms a tokenized oscillator-based counterpart with 16× more parameters by +1.6% (IMDB) and +2.1% (AG News).
- Efficiency: Parameter efficiency surpasses BERT-base by 87× and tokenized Kathleen by 16× (measured as accuracy per million parameters).
- Long-context operation: Thanks to its O(L) complexity, Kathleen scales linearly and continues to improve when given long input sequences (e.g., >4K bytes), whereas transformer baselines run out of memory at 1-2K bytes.
- PhaseHarmonics dominance: Ablation of components establishes that the 6-parameter PhaseHarmonics induces a larger accuracy deficit (−2.6%) than removing a 560K-parameter bio-inspired block (−0.2%).
Ablation and Component Analysis
Systematic ablation on both tokenized and byte-level variants demonstrate:
- On tokens, cognitive and resonance-inspired modules contribute negligibly, while gating and lightweight convolution matter most.
- On bytes, frequency-driven modules dominate. Removing PowerLawGate or FFT-Rotate encoder reduces performance, indicating these are nontrivial only in spectral settings.
- The cognitive "Phantasy" architecture, occupying 31% of parameter budget, yields only +0.2% improvement and was conclusively pruned.
Analysis of Carrier Cancellation
Sinusoidal carrier-based encodings performed at chance until their combination with mean pooling was diagnosed as a destructive interference mechanism. The corrective shift to Fourier-based non-carrier encodings rectified the issue, highlighting the importance of signal representation and aggregation interactions in oscillator-driven models.
Practical and Theoretical Implications
Kathleen reframes efficient sequence modeling by demonstrating that explicit, learnable frequency processing on bytes is sufficient for high-quality NLP, without recourse to discrete token vocabularies or attention. This design enables:
- Deployment on constrained devices: The compact model fits within microcontroller and mobile deployment constraints.
- Long-context, streaming, and language-agnostic processing: O(L) complexity and byte-level operation permit efficient modeling over arbitrarily long sequences and across arbitrary languages without retraining the encoding pipeline.
- Explicit inductive bias for frequency-based structure: Unlike standard neural models where useful representations emerge implicitly, Kathleen capitalizes on explicit spectral and oscillatory biases with minimal parameter overhead.
The context-dependent utility observed for blocks like PowerLawGate further incentivizes factorized architecture search—components cannot be evaluated in isolation but must be validated in-band with their input feature spaces.
Limitations and Future Directions
Despite strong efficiency and competitive classification accuracy at the byte level, several limitations remain:
- A consistent 8% accuracy gap persists compared to large, pretrained Transformers, driven by both the absence of large-scale pretraining and the lack of explicit subword/semantic compositionality.
- Performance on short texts lags relative to longer sequences, correlated with the oscillator design's need for sufficient context to exploit spectral structure.
- The architecture has not been extended to generation, translation, or structured prediction; its properties for language modeling remain to be established.
Priority avenues for future work include deeper oscillator models, exploiting the scalable context processing for document-level tasks, edge-device deployment, multilingual benchmarks, and adaptation to autoregressive generation.
Conclusion
Kathleen establishes the viability of attention-free, tokenization-free byte-level text classification via learnable oscillatory frequency processing. It achieves strong empirical results with dramatic parameter efficiency—anchored by the discovery that minimal sinusoidal nonlinearities drive the majority of performance gain, while complex cognitive architecture contributes minimally. These findings recalibrate priorities for efficient sequence model design in NLP, underscoring the promise of explicit signal processing inductive bias for text understanding.