LyricsAENet: Compact Lyric Autoencoding

Updated 12 December 2025

LyricsAENet is an autoencoder-based model that compresses dense, LLM-derived lyric representations into compact vectors preserving semantic, syntactic, and sequential information.
Its tied-weights architecture, employing SELU activations and a pure MSE loss, minimizes overfitting while effectively reconstructing high-dimensional embeddings.
Empirical evaluations show that integrating LyricsAENet into HitMusicLyricNet yields a 9% lower MAE and 20% reduced MSE compared to stylometric baselines.

LyricsAENet is an autoencoder-based model for compressing dense lyric representations derived from LLMs, specifically designed to enhance multimodal music popularity prediction. Integrated within the HitMusicLyricNet framework, LyricsAENet transforms high-dimensional, LLM-generated lyric embeddings into compact vectors that preserve semantic, syntactic, and sequential information essential for downstream music analytics. Empirical benchmarks on the SpotGenTrack dataset demonstrate that LyricsAENet-driven lyric features yield substantial improvements—9% lower MAE and 20% lower MSE—relative to stylometric baselines when predicting track popularity, emphasizing the unique value of deep lyric semantics in this domain (Choudhary et al., 5 Dec 2025).

1. Architectural Design and Workflow

LyricsAENet operates as a specialized compressor within the broader HitMusicLyricNet stack. Its pipeline begins with tokenization of raw song lyrics using subword tokenization methods, where punctuation is preserved and treated as individual tokens. Tokens are then passed through a pre-trained, frozen LLM (such as BERT-Large, Llama 3, or comparable embedding services), yielding the last hidden-layer states $H \in \mathbb{R}^{T \times D_{\mathrm{LLM}}}$ , where $T$ is the token count and $D_{\mathrm{LLM}}$ represents the embedding dimension (e.g., 1024 for BERT-Large). A pooling operation (mean or max) over $T$ condenses $H$ to a single vector $Y \in \mathbb{R}^{D_{\mathrm{LLM}}}$ .

This pooled embedding serves as the input to a tied-weights autoencoder. The encoder comprises fully connected layers that sequentially reduce dimensionality:

$D_{\mathrm{LLM}} \rightarrow D_{\mathrm{LLM}}/2 \rightarrow D_{\mathrm{LLM}}/4 \rightarrow D_{\mathrm{LLM}}/8 \rightarrow D_b$

with all activations using the Scaled Exponential Linear Unit (SELU), and $D_b$ (the bottleneck dimension) chosen as either $D_{\mathrm{LLM}}/12$ or $D_{\mathrm{LLM}}/16$ . The decoder mirrors the encoder, employing tied weights (i.e., each decoder weight matrix is the transpose of its encoder counterpart), significantly reducing parameterization and overfitting risk.

2. Mathematical Underpinnings

Let $Y \in \mathbb{R}^D$ (with $D = D_{\mathrm{LLM}}$ ) be the pooled lyric embedding. The encoder's operations at layer $i$ follow: $Z^{(i)} = \sigma \left( W^{(i)}\,Z^{(i-1)} + b^{(i)} \right), \quad Z^{(0)} = Y$ where $\sigma = \mathrm{SELU}$ .

At the bottleneck, represent $Z^{(L)} = Z_b \in \mathbb{R}^{D_b}$ .

The decoder computes: $\bar{Z}^{(L-1)} = \sigma((W^{(L)})^T Z^{(L)} + b^{(L)}),\quad \bar{Y} = \sigma((W^{(1)})^T \bar{Z}^{(1)} + b^{(1)})$

Training initially minimized a composite directional reconstruction loss: $L(Y, \bar{Y}) = \alpha_1 \lVert Y - \bar{Y} \rVert_2^2 + \alpha_2 \left( 1 - \cos(Y, \bar{Y}) \right)$ with $\alpha_1 = 0.5, \alpha_2 = 0.1$ and $\cos(Y, \bar{Y}) = \frac{Y^\top \bar{Y}}{\|Y\| \| \bar{Y} \|}$ . A pure MSE loss ( $\alpha_2 = 0$ ) ultimately yielded better downstream effectiveness: $L_{\mathrm{MSE}}(Y, \bar{Y}) = \|Y - \bar{Y}\|^2_2$

3. Training Protocol and Hyperparameters

LyricsAENet was trained exclusively on the cleaned Spotify–Genius (SPD*) dataset, which encompasses 74,206 tracks across English and four other major European languages. Preprocessing eliminated sequences outside the 100–7,000 character bounds. The LLM remains frozen throughout, with lyric embeddings extracted a single time—removing any risk of LLM drift.

Training specifics include:

Optimizer: Adam
Learning rate: $10^{-3}$ (Adam default)
Batch size: 128
Dropout: none (weight tying confers sufficient regularization)
Epochs: 50–100 (or until convergence)
Activation: SELU
Final reconstruction MSE: ≈ $10^{-5}$

Comparative experiments with SiLU and GELU activations, as well as directional cosine loss terms, produced no substantive improvements over SELU and pure MSE.

4. Downstream Integration and Fusion

Post-training, LyricsAENet encodes each song's lyrics as a fixed vector $z_l \in \mathbb{R}^{D_b}$ . In parallel, AudioAENet compresses audio feature vectors to $z_a \in \mathbb{R}^{d_a}$ . The combined MusicFuseNet module concatenates: $h = \left[z_a;\;z_l;\;x_{\mathrm{HL}};\;x_{\mathrm{MD}}\right] \;\in\; \mathbb{R}^{d_a + D_b + d_{\mathrm{HL}} + d_{\mathrm{MD}}}$ where $x_{\mathrm{HL}}$ denotes 12 high-level Spotify audio descriptors and $x_{\mathrm{MD}}$ encodes social-metadata (including artist followers, popularity, and market presence).

The composite vector $h$ passes through a three-layer MLP with widths decreasing by factors $(1,\,\tfrac12,\,\tfrac13)$ , ReLU activations, and terminal sigmoid output, yielding a predicted popularity score: $\hat y = \sigma \left( U^{(3)}\,\phi\left( U^{(2)}\,\phi\left( U^{(1)} h + c^{(1)} \right) + c^{(2)} \right) + c^{(3)} \right)$ During end-to-end popularity training, LyricsAENet (and AudioAENet) parameters are frozen.

5. Empirical Evaluation and Ablation Findings

On the SpotGenTrack test set, the multimodal HitMusicLyricNet incorporating LyricsAENet achieved a mean absolute error (MAE) of 0.0772 and a mean squared error (MSE) improved by 20% relative to the strongest baseline (MAE 0.0862) using only stylometric lyric features. Removing the LyricsAENet branch reverts MAE to 0.0852, erasing nearly all performance gains. Additional ablation (reported in the appendix) shows that removing lyric embeddings leads to a 10.4% MAE increase, while combining lyrics with metadata alone still outperforms audio-only models. SHAP/LIME analyses provided confidence intervals for feature importances, confirming the robustness of lyric-based improvements; no formal $p$ -values were provided (Choudhary et al., 5 Dec 2025).

Model Configuration	MAE	Relative ΔMAE
HitMusicNet (stylometry baseline)	0.0862	0
HitMusicLyricNet (full, LyricsAENet)	0.0772	–9.3%
HitMusicLyricNet (no LyricsAENet)	0.0852	+10.4%

6. Interpretation, Advantages, and Open Challenges

LyricsAENet's value lies in encoding dense, LLM-derived lyric semantics unavailable to shallow statistics. Its architecture—pooling over contextualized token embeddings and bottleneck compression via tied-weights autoencoding—allows it to distill the "meaning" of lyrics while minimizing noise and parameter redundancy. This results in richer, more predictive text-derived features for music analytics.

Several limitations remain. Generic LLMs lack domain-specific musical awareness, and their high-dimensional embeddings—while informative—risk overfitting without further regularization. The compressed lyric vectors themselves are difficult to interpret directly. Recommendations for future investigation include pre-training LLMs on music-specific corpora, experimenting with segment-level representations (e.g., chorus/hook detection), and developing interpretable or disentangled embeddings that align with known lyric attributes or song structure (Choudhary et al., 5 Dec 2025).

A plausible implication is that LLM-based lyric features, when compacted through robust autoencoding, serve as an indispensable modality for modern multimodal music analytics, but that interpretation and domain adaptation remain unresolved challenges.

PDF Markdown Chat (Pro)

References (1)

Lyrics Matter: Exploiting the Power of Learnt Representations for Music Popularity Prediction (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to LyricsAENet.