Papers
Topics
Authors
Recent
2000 character limit reached

KAN-Based Audiobox Aesthetics

Updated 8 December 2025
  • KAN-based audiobox aesthetics leverage the Kolmogorov–Arnold theorem to design neural networks with edge-wise nonlinearities, enhancing audio aesthetics prediction accuracy.
  • The system integrates a pretrained WavLM encoder, Group-Rational KAN layers, and ensemble strategies to generate precise predictions for multiple AES axes.
  • Training incorporates semi-supervised iterative pseudo-labeling and hypernetwork-driven few-shot adaptation to achieve parameter efficiency and real-time audio processing.

Kolmogorov–Arnold Network (KAN)-based audiobox aesthetics refers to the application of KAN architectures for predicting audio aesthetics scores (AES) and for representing or reconstructing audio signals. The Kolmogorov–Arnold Network leverages the Kolmogorov–Arnold representation theorem to structure neural networks with learnable edge-wise nonlinearities, providing a versatile and parameter-efficient implicit neural representation (INR) for audio. This approach has been established within the AudioMOS Challenge 2025, specifically through the T12 system, which integrates KAN-based predictors and ensemble methodologies to deliver superior correlations in audio aesthetics prediction (Yamamoto et al., 5 Dec 2025). Recent studies have further extended KAN utility for audio representation and few-shot adaptation via hypernetwork-based architectures such as FewSound (Marszałek et al., 4 Mar 2025).

1. Foundations: Kolmogorov–Arnold Networks for Audio Processing

KANs replace traditional MLP architectures with neural networks formulated according to the Kolmogorov–Arnold representation theorem. Any continuous function f:RdRf: \mathbb{R}^d \rightarrow \mathbb{R} is structured as

f(x1,,xd)=q=02dΦq(p=1dϕq,p(xp)),f(x_1, \ldots, x_d) = \sum_{q=0}^{2d} \Phi_q \left( \sum_{p=1}^{d} \phi_{q,p}(x_p) \right),

where ϕq,p\phi_{q,p} and Φq\Phi_q are univariate rational or spline-based functions parameterized by learnable weights. In the context of audio, KAN offers a pure activation graph in which all nonlinearity is placed on edges, while neurons simply sum their incoming activations. These design choices facilitate compact neural audio representations and allow for flexible function approximation capabilities, critical for modeling perceptually-relevant audio features and reconstructions (Marszałek et al., 4 Mar 2025).

2. KAN-Based Audiobox Aesthetics in AudioMOS 2025: System Architecture

The T12 system for AudioMOS Challenge 2025 employs a KAN-based "audiobox aesthetics" predictor structured as follows:

  • Base encoder: Pretrained WavLM with 12 Transformer layers, hidden size 768. Layer outputs H()RT×768H^{(\ell)} \in \mathbb{R}^{T \times 768} are pooled into AES axis-specific embeddings using learned layer-wise and temporal weights αaxis()\alpha^{(\ell)}_{\text{axis}}:

Aaxis==112αaxis()meant=1..THt()A_{\text{axis}} = \sum_{\ell=1}^{12} \alpha^{(\ell)}_{\text{axis}} \cdot \text{mean}_{t=1..T} H^{(\ell)}_t

where each AES axis \in {PQ, PC, CE, CU} receives a unique embedding.

  • KAN predictor: Every dense+activation block in the baseline MLP is replaced with Group-Rational KAN (GR-KAN) layers: input features are split into GG groups, each group processed by rational nets RqR_q, reducing parameter count while maintaining expressivity. The rational net has degree (N,M)(N, M):

R(z)=a0+a1z+...+aNzN1+b1z+...+bMzMR(z) = \frac{a_0 + a_1 z + ... + a_N z^N}{1 + b_1 z + ... + b_M z^M}

Final outputs are produced via a linear projection after two GR-KAN layers, yielding a scalar AES prediction per axis.

  • Multi-head structure: The architecture supports prediction for all four AES axes in parallel, with shared feature extraction and axis-specific KAN parameters.

This modular design facilitates the replacement of conventional aesthetic scoring MLPs by structurally richer KAN modules, enabling precise regression on perceptually complex targets (Yamamoto et al., 5 Dec 2025).

3. Training Methodologies and Data Utilization

T12's KAN-based audiobox aesthetics model is trained using both labeled and pseudo-labeled data:

  • Labeled sets: AMC25 (2700 train, 250 dev samples) + PAM (1000 samples).
  • Unlabeled sets: VMC2022 (\sim8459 samples), pseudo-labeled via Iterative Pseudo-Labeling (IPL):

    1. Teacher model (baseline) generates pseudo-labels on VMC22.
    2. Student (KAN-based) model trains jointly on AMC25+PAM (human labels) and VMC22 (pseudo-labels), optimizing

L=axis iMSE(y^i,yi)L = \sum_{\text{axis } i} \text{MSE}(\hat{y}_i, y_i)

  1. Teacher is replaced if student shows lower dev MSE.
  2. Iteration stops when no additional improvement or after five updates.

Training uses AdamW (schedule-free), lr = 1e-4, batch size 40, for 10 epochs, typically consuming 6–8 GPU-hours on NVIDIA A100 (80GB). Four independently seeded runs are stacked in the final ensemble.

KAN models benefit from semi-supervised IPL, which stabilizes training and enhances generalization compared to standard fine-tuning procedures (Yamamoto et al., 5 Dec 2025).

4. Perceptual Metrics and Comparative Outcomes

KAN-based systems are evaluated on multiple metrics:

  • AudioMOS AES axes: PQ (Perceptual Quality), PC (Pitch Content), CE (Content Evaluation), CU (User Cues).

  • Metrics for audio representation: MSE, Log-Spectral Distance (LSD), PESQ, SI-SNR, STOI, CDPAM.

Performance figures in the AES prediction framework:

Model Type Utterance SRCC Dev MSE System SRCC PQ/PC/CE/CU
Baseline MLP 0.899 0.617 0.866/0.934/0.841/0.810
KAN (single run) 0.903 0.783
KAN (ensemble FT) ≈0.909 ≈0.58
KAN (ensemble IPL) ≈0.914 ≈0.55
T12 Ensemble 0.832–0.911 0.916/0.938/0.946/0.924
VERSA regressor 0.894 0.648

KAN ensemble stacking with VERSA regressor achieves the highest average SRCC across axes and levels on AMC25. Replacing every MLP by GR-KAN yielded SRCC improvements of ~1–3 points. IPL training further stabilized variance and improved metrics. The hybrid ensemble (KAN + VERSA) compensates for single-model outliers (Yamamoto et al., 5 Dec 2025).

For audio INR, single-signal KAN (33K params) scores:

  • MSE = 6.7×1046.7 \times 10^{-4}2.0×1032.0 \times 10^{-3})
  • LSD = 1.29 dB (±0.22)
  • SI-SNR = 20.50 (±6.94)
  • PESQ = 3.57 (±0.42)
  • STOI = 0.99 (±0.02)

FewSound, a hypernetwork-based extension, further reduces MSE by 33%, boosts SI-SNR by 60.9%, and increases PESQ by 8.7% compared to HyperSound (Marszałek et al., 4 Mar 2025).

5. Hypernetwork Extensions and KAN Scalability

FewSound implements meta-learning updates over KAN parameters using a learned hypernetwork HH. Given support data xSx_S, audio encoder EE maps to zEz_E, while universal weights θ\theta are embedded via GG. The hypernetwork predicts additive updates:

Δθ=H([zE;G(θ)]),θ=θ+Δθ\Delta \theta = H([z_E; G(\theta)]), \quad \theta' = \theta + \Delta\theta

Adaptation enables few-shot audio representation and rapid transfer between tasks. Evaluation confirms KAN's suitability for scalable and real-time deployments: a 1.5s KAN model requires 33,000 weights (\approx130 kB), with \sim220M FLOPs/s at 22 kHz — within feasible compute for CPU or DSP operation. Longer signals scale linearly in memory; overlapping windowing and pipelined layer evaluation enable bounded latency (Marszałek et al., 4 Mar 2025).

KAN's parameter efficiency and strong perceptual metrics (LSD \approx1.3 dB, PESQ \approx3.6) render it well-suited for Audiobox modules and general audio tool chains.

6. Ablation Studies and Model Comparisons

Experiments have identified optimal settings for KAN audio models:

  • Positional encoding length L=16L=16 for 1.5s signals
  • Spline order O=3O=3 yields highest PESQ within computational budgets
  • Knot grid size G=17G=17 further enhances fidelity at linear parameter growth
  • Model width/depth scaling improves accuracy with increased compute costs

Comparative analysis shows KAN outperforming SIREN, NeRF, FINER, WIRE, RFF on LSD and PESQ. In hypernetwork scenarios, FewSound’s KAN adaptation demonstrates enhanced performance and robustness compared to NeRF, particularly under adaptation and meta-learning demands (Marszałek et al., 4 Mar 2025).

A plausible implication is that KAN architectures — especially with hypernetwork adaptation — constitute an emergent backbone for high-fidelity, efficient, and flexible audio analytics and generative systems.

7. Significance and Impact

KAN-based audiobox aesthetics, as realized in the AudioMOS Challenge and correlated INRs, establishes a new state-of-the-art for AES prediction and audio reconstructions. The architectural shift toward rational and spline parameterizations yields more expressive, parameter-efficient models. Ensemble stacking with diverse architectures (KAN and metric regression) ensures peak correlation and stability. IPL strategies allow for effective utilization of large unlabeled corpora, crucial for real-world audio tasks where labeled data is limited.

This suggests KAN’s modularity and compatibility with meta-learning and hypernetwork frameworks will continue to drive advances in audio signal representation, restoration, and perceptual quality analysis, especially as deployment constraints and data diversity intensify in future audio toolchains.

(Yamamoto et al., 5 Dec 2025, Marszałek et al., 4 Mar 2025)

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Kolmogorov--Arnold Network (KAN)-based Audiobox Aesthetics.