Papers
Topics
Authors
Recent
2000 character limit reached

LyricsAENet: Compact Lyric Autoencoding

Updated 12 December 2025
  • LyricsAENet is an autoencoder-based model that compresses dense, LLM-derived lyric representations into compact vectors preserving semantic, syntactic, and sequential information.
  • Its tied-weights architecture, employing SELU activations and a pure MSE loss, minimizes overfitting while effectively reconstructing high-dimensional embeddings.
  • Empirical evaluations show that integrating LyricsAENet into HitMusicLyricNet yields a 9% lower MAE and 20% reduced MSE compared to stylometric baselines.

LyricsAENet is an autoencoder-based model for compressing dense lyric representations derived from LLMs, specifically designed to enhance multimodal music popularity prediction. Integrated within the HitMusicLyricNet framework, LyricsAENet transforms high-dimensional, LLM-generated lyric embeddings into compact vectors that preserve semantic, syntactic, and sequential information essential for downstream music analytics. Empirical benchmarks on the SpotGenTrack dataset demonstrate that LyricsAENet-driven lyric features yield substantial improvements—9% lower MAE and 20% lower MSE—relative to stylometric baselines when predicting track popularity, emphasizing the unique value of deep lyric semantics in this domain (Choudhary et al., 5 Dec 2025).

1. Architectural Design and Workflow

LyricsAENet operates as a specialized compressor within the broader HitMusicLyricNet stack. Its pipeline begins with tokenization of raw song lyrics using subword tokenization methods, where punctuation is preserved and treated as individual tokens. Tokens are then passed through a pre-trained, frozen LLM (such as BERT-Large, Llama 3, or comparable embedding services), yielding the last hidden-layer states HRT×DLLMH \in \mathbb{R}^{T \times D_{\mathrm{LLM}}}, where TT is the token count and DLLMD_{\mathrm{LLM}} represents the embedding dimension (e.g., 1024 for BERT-Large). A pooling operation (mean or max) over TT condenses HH to a single vector YRDLLMY \in \mathbb{R}^{D_{\mathrm{LLM}}}.

This pooled embedding serves as the input to a tied-weights autoencoder. The encoder comprises fully connected layers that sequentially reduce dimensionality:

DLLMDLLM/2DLLM/4DLLM/8DbD_{\mathrm{LLM}} \rightarrow D_{\mathrm{LLM}}/2 \rightarrow D_{\mathrm{LLM}}/4 \rightarrow D_{\mathrm{LLM}}/8 \rightarrow D_b

with all activations using the Scaled Exponential Linear Unit (SELU), and DbD_b (the bottleneck dimension) chosen as either DLLM/12D_{\mathrm{LLM}}/12 or DLLM/16D_{\mathrm{LLM}}/16. The decoder mirrors the encoder, employing tied weights (i.e., each decoder weight matrix is the transpose of its encoder counterpart), significantly reducing parameterization and overfitting risk.

2. Mathematical Underpinnings

Let YRDY \in \mathbb{R}^D (with D=DLLMD = D_{\mathrm{LLM}}) be the pooled lyric embedding. The encoder's operations at layer ii follow: Z(i)=σ(W(i)Z(i1)+b(i)),Z(0)=YZ^{(i)} = \sigma \left( W^{(i)}\,Z^{(i-1)} + b^{(i)} \right), \quad Z^{(0)} = Y where σ=SELU\sigma = \mathrm{SELU}.

At the bottleneck, represent Z(L)=ZbRDbZ^{(L)} = Z_b \in \mathbb{R}^{D_b}.

The decoder computes: Zˉ(L1)=σ((W(L))TZ(L)+b(L)),Yˉ=σ((W(1))TZˉ(1)+b(1))\bar{Z}^{(L-1)} = \sigma((W^{(L)})^T Z^{(L)} + b^{(L)}),\quad \bar{Y} = \sigma((W^{(1)})^T \bar{Z}^{(1)} + b^{(1)})

Training initially minimized a composite directional reconstruction loss: L(Y,Yˉ)=α1YYˉ22+α2(1cos(Y,Yˉ))L(Y, \bar{Y}) = \alpha_1 \lVert Y - \bar{Y} \rVert_2^2 + \alpha_2 \left( 1 - \cos(Y, \bar{Y}) \right) with α1=0.5,α2=0.1\alpha_1 = 0.5, \alpha_2 = 0.1 and cos(Y,Yˉ)=YYˉYYˉ\cos(Y, \bar{Y}) = \frac{Y^\top \bar{Y}}{\|Y\| \| \bar{Y} \|}. A pure MSE loss (α2=0\alpha_2 = 0) ultimately yielded better downstream effectiveness: LMSE(Y,Yˉ)=YYˉ22L_{\mathrm{MSE}}(Y, \bar{Y}) = \|Y - \bar{Y}\|^2_2

3. Training Protocol and Hyperparameters

LyricsAENet was trained exclusively on the cleaned Spotify–Genius (SPD*) dataset, which encompasses 74,206 tracks across English and four other major European languages. Preprocessing eliminated sequences outside the 100–7,000 character bounds. The LLM remains frozen throughout, with lyric embeddings extracted a single time—removing any risk of LLM drift.

Training specifics include:

  • Optimizer: Adam
  • Learning rate: 10310^{-3} (Adam default)
  • Batch size: 128
  • Dropout: none (weight tying confers sufficient regularization)
  • Epochs: 50–100 (or until convergence)
  • Activation: SELU
  • Final reconstruction MSE: ≈ 10510^{-5}

Comparative experiments with SiLU and GELU activations, as well as directional cosine loss terms, produced no substantive improvements over SELU and pure MSE.

4. Downstream Integration and Fusion

Post-training, LyricsAENet encodes each song's lyrics as a fixed vector zlRDbz_l \in \mathbb{R}^{D_b}. In parallel, AudioAENet compresses audio feature vectors to zaRdaz_a \in \mathbb{R}^{d_a}. The combined MusicFuseNet module concatenates: h=[za;  zl;  xHL;  xMD]    Rda+Db+dHL+dMDh = \left[z_a;\;z_l;\;x_{\mathrm{HL}};\;x_{\mathrm{MD}}\right] \;\in\; \mathbb{R}^{d_a + D_b + d_{\mathrm{HL}} + d_{\mathrm{MD}}} where xHLx_{\mathrm{HL}} denotes 12 high-level Spotify audio descriptors and xMDx_{\mathrm{MD}} encodes social-metadata (including artist followers, popularity, and market presence).

The composite vector hh passes through a three-layer MLP with widths decreasing by factors (1,12,13)(1,\,\tfrac12,\,\tfrac13), ReLU activations, and terminal sigmoid output, yielding a predicted popularity score: y^=σ(U(3)ϕ(U(2)ϕ(U(1)h+c(1))+c(2))+c(3))\hat y = \sigma \left( U^{(3)}\,\phi\left( U^{(2)}\,\phi\left( U^{(1)} h + c^{(1)} \right) + c^{(2)} \right) + c^{(3)} \right) During end-to-end popularity training, LyricsAENet (and AudioAENet) parameters are frozen.

5. Empirical Evaluation and Ablation Findings

On the SpotGenTrack test set, the multimodal HitMusicLyricNet incorporating LyricsAENet achieved a mean absolute error (MAE) of 0.0772 and a mean squared error (MSE) improved by 20% relative to the strongest baseline (MAE 0.0862) using only stylometric lyric features. Removing the LyricsAENet branch reverts MAE to 0.0852, erasing nearly all performance gains. Additional ablation (reported in the appendix) shows that removing lyric embeddings leads to a 10.4% MAE increase, while combining lyrics with metadata alone still outperforms audio-only models. SHAP/LIME analyses provided confidence intervals for feature importances, confirming the robustness of lyric-based improvements; no formal pp-values were provided (Choudhary et al., 5 Dec 2025).

Model Configuration MAE Relative ΔMAE
HitMusicNet (stylometry baseline) 0.0862 0
HitMusicLyricNet (full, LyricsAENet) 0.0772 –9.3%
HitMusicLyricNet (no LyricsAENet) 0.0852 +10.4%

6. Interpretation, Advantages, and Open Challenges

LyricsAENet's value lies in encoding dense, LLM-derived lyric semantics unavailable to shallow statistics. Its architecture—pooling over contextualized token embeddings and bottleneck compression via tied-weights autoencoding—allows it to distill the "meaning" of lyrics while minimizing noise and parameter redundancy. This results in richer, more predictive text-derived features for music analytics.

Several limitations remain. Generic LLMs lack domain-specific musical awareness, and their high-dimensional embeddings—while informative—risk overfitting without further regularization. The compressed lyric vectors themselves are difficult to interpret directly. Recommendations for future investigation include pre-training LLMs on music-specific corpora, experimenting with segment-level representations (e.g., chorus/hook detection), and developing interpretable or disentangled embeddings that align with known lyric attributes or song structure (Choudhary et al., 5 Dec 2025).

A plausible implication is that LLM-based lyric features, when compacted through robust autoencoding, serve as an indispensable modality for modern multimodal music analytics, but that interpretation and domain adaptation remain unresolved challenges.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to LyricsAENet.