FR-KAN: Efficient Text Classification Heads
- Text classification heads (FR-KAN) are modules that replace standard MLPs with Fourier series-based activations, enhancing expressive capacity.
- FR-KAN achieves significant accuracy gains (up to +18 pp on some datasets) over matched-parameter MLPs while maintaining similar or lower computational costs.
- These heads offer rapid convergence, improved interpretability, and are well-suited for low-resource scenarios and frozen-encoder pipelines.
A text classification "head" is the module attached on top of a neural representation extractor (e.g., pre-trained transformer, convolutional network, or classical embedding pipeline) that produces final class logits for a downstream text classification task. Fourier Kolmogorov–Arnold Networks (FR-KAN) represent a recent innovation in this space, providing a flexible and computationally efficient alternative to conventional Multi-Layer Perceptron (MLP) heads, particularly effective in low-resource and frozen-encoder scenarios. FR-KAN and related Kolmogorov–Arnold (KAN) formulations are designed to leverage the rich contextual representations of LLMs with more expressive modeling capacity than standard affine layers, while maintaining practical efficiency.
1. Mathematical Foundation of KAN and FR-KAN Heads
The Kolmogorov–Arnold representation theorem asserts that any continuous function can be written as a finite sum of univariate functions composed with weighted sums of inputs: where each is a (learnable) univariate function and are linear or nonlinear outer transformations. In the original KAN, are realized with cubic B-splines; in FR-KAN, these are replaced by a truncated Fourier series, providing adaptive, globally smooth nonlinearity: with coefficients learned during training. The FR-KAN head thus implements
where are extracted embedding features and supplies the logits for each class before the softmax (Imran et al., 2024).
2. Integration with Representation Extractors
FR-KAN heads are typically deployed in a "feature adapter" regime: the upstream model (e.g., BERT, RoBERTa, ELECTRA, DistilBERT, XLNet, BART) is kept frozen, and only the classification head is trained. The architecture is:
- Input Tokenizer + Transformer backbone Final hidden representation
- FR-KAN layer (single block)
- Output logits Softmax
For static-word embedding baselines (e.g., fastText, random embedding), FR-KAN can be attached on pooled sentence vectors, convolutions, or other compressive mappings. KAConvText applies convolutional feature extraction before the KAN/FR-KAN head for sentence-level classification in Burmese (Thu et al., 9 Jul 2025).
The FR-KAN head maintains a smaller or comparable parameter count to 2-layer MLPs; e.g., DistilBERT + FR-KAN (G=5) requires ~30.7K head parameters for AGNews, 107.5K for DBpedia, 15.4K for IMDb (Imran et al., 2024).
3. Empirical Performance and Comparative Results
Substantial improvements are observed when replacing standard MLP heads with FR-KAN in downstream text classification:
- Across seven transformer backbones and seven datasets, FR-KAN achieves a mean accuracy of 0.672 (MLP: 0.582, +9.0 pp) and F1 of 0.669 (MLP: 0.559, +11.0 pp).
- Largest per-backbone gains: RoBERTa (+42 pp Acc, +53 pp F1), ELECTRA (+13 pp Acc, +14 pp F1); smallest difference: XLNet (–5 pp Acc, –3 pp F1).
- By dataset: FR-KAN increases performance by +9 pp (AgNews), +18 pp (DBpedia/Papluca), +7 pp (IMDb), +1 pp (SST-5), +15 pp (TREC-50), +4 pp (YELP-Full) over matched-parameter MLPs (Imran et al., 2024).
- In Burmese news classification, EfficientKAN (spline-based) with fastText embeds achieves F1=0.928—slightly outperforming MLP (F1=0.908) and fasterKAN grid-based variants (F1=0.927); FourierKAN performs best on periodic features but is generally less accurate for typical text (Aung et al., 26 Nov 2025).
- KAConvText on Burmese sentence tasks: MLP edge in accuracy/F1 (e.g., 91.23%/0.9109 on hate speech), KAN head close behind (89.16%/0.8931) but substantially more interpretable (Thu et al., 9 Jul 2025).
Convergence studies show FR-KAN stabilizes classification loss and accuracy within ~4 epochs, compared to >20 epochs for MLP/KAN (Imran et al., 2024).
4. Efficiency, Convergence, and Practical Deployment
FR-KAN heads demonstrate computational efficiency along several axes:
- Parameter count: comparable or less than conventional MLPs at matched capacity.
- Convergence speed: FR-KAN typically reaches optimal loss/accuracy within a quarter of the epochs required by MLP or (spline-based) KAN, reflecting the suitability of Fourier-basis univariate activations for modeling transformer-sourced embeddings (Imran et al., 2024).
- FLOPs and latency: For static embedding cases, MLP offers minimum overhead (0.26 ms/sample), but FasterKAN approaches (0.58 ms/sample, ~1,720 samples/sec, 1.03M params) yield near-maximal F1 with reduced latency, outperforming EfficientKAN in compute-bound settings (Aung et al., 26 Nov 2025).
- Environmental cost: Single-layer, rapidly trained heads imply fewer GPU-hours and reduced carbon footprint compared to larger, slower-converging alternatives.
Empirical results support the deployment of FR-KAN/FasterKAN when computational resources or low-latency requirements are paramount, with EfficientKAN favored where maximal accuracy is critical and time resources are less constrained.
5. Interpretability and Feature Attribution
KAN-based heads, including FR-KAN, provide interpretable mappings of input features to logits:
- Each class logit is a sum of univariate splines or Fourier expansions , so practitioners can plot and analyze individual functions.
- Visualization of these activations exposes sharp thresholds, plateaus, or saturating regions, facilitating audit of which embedding dimensions influence each class, and whether those relationships are monotone or more complex (Thu et al., 9 Jul 2025).
- This is in contrast to standard MLPs, where only linear weights are available for inspection.
This makes FR-KAN heads advantageous in domains requiring transparency or feature auditability, such as low-resource settings where model trust is paramount.
6. Architectural and Hyperparameter Considerations
Typical implementation guidelines:
- Grid size for Fourier coefficients: (so each uses 6 cosine and 6 sine coefficients per class per input coordinate) (Imran et al., 2024).
- Training: cross-entropy loss, Adam optimizer (learning rate ~2e-5 for transformers, 1e-3 for static embedders), batch sizes of 32–64, 5–15 epochs.
- Dropout commonly omitted in FR-KAN; regularization can be imposed on spline weights (EfficientKAN) via L1 penalties.
- KAN variants (EfficientKAN, FasterKAN, FourierKAN) offer a continuum of expressive power and efficiency. EfficientKAN maximizes accuracy on static text, FasterKAN delivers comparable F1 with lower inference time, FourierKAN is suitable when periodic features are plausible (Aung et al., 26 Nov 2025).
Pseudocode for FR-KAN integration is provided in (Imran et al., 2024), facilitating fast adaptation in PyTorch workflows with minimal changes to legacy code.
7. Recommendations and Limitations
- For high-accuracy head-only fine-tuning with static embeddings (fastText, TF-IDF), EfficientKAN is optimal; FasterKAN should be considered where throughput is essential (Aung et al., 26 Nov 2025).
- For transformer-based pipelines, FR-KAN heads outperform or match MLPs in both metrics and convergence speed (Imran et al., 2024).
- For interpretability needs, KAN-based heads (including FR-KAN) enable direct visualization and audit of decision rules.
- FourierKAN is generally less effective on non-periodic feature spaces, and more parameter-heavy than MLP—inference costs are justified primarily where nonlinear decomposability and rapid adaptation are advantageous.
- Where absolute minimal latency is needed and embedding features are well-aligned with the downstream task, a linear MLP remains a reasonable baseline (Aung et al., 26 Nov 2025).
FR-KAN thus defines a lightweight, expressive alternative head for text classification in modern NLP systems, grounded in Kolmogorov–Arnold functional decomposition and exhibiting empirical superiority in a range of scenarios, especially in head-only adaptation regimes (Imran et al., 2024, Aung et al., 26 Nov 2025, Thu et al., 9 Jul 2025).