HitMusicLyricNet: Neural Music Prediction

Updated 12 December 2025

HitMusicLyricNet is a multimodal neural architecture that fuses dense lyric embeddings, audio features, and social metadata to predict music popularity.
The system employs LyricsAENet, an autoencoder pipeline that compresses high-dimensional lyric embeddings from large language models to capture semantic and syntactic patterns.
Empirical evaluations on the SpotGenTrack dataset demonstrate significant error reductions, confirming the value of integrating learned lyric codes in popularity prediction.

HitMusicLyricNet is a multimodal neural architecture for music popularity prediction, distinguished by its integration of dense lyric embeddings with audio and social metadata. Developed to address the under-explored role of lyric content in predicting track popularity, HitMusicLyricNet leverages a dedicated autoencoder pipeline (LyricsAENet) to extract and compress high-dimensional lyric features derived from LLMs. Empirical studies on the SpotGenTrack dataset demonstrate that this approach achieves substantial improvements in mean absolute error (MAE) and mean squared error (MSE) over baseline methods lacking learned lyric representations, underscoring the predictive value embedded in neural lyric codes (Choudhary et al., 5 Dec 2025).

1. Lyric Embedding Pipeline and Feature Extraction

The pipeline begins with comprehensive preprocessing of the SpotGenTrack dataset (over 100,000 tracks), including removal of tracks with lyrics shorter than 100 or longer than 7000 characters and restriction to English, Spanish, Portuguese, French, and German entries. Lyrics are tokenized using the native tokenizer of the selected LLM (e.g., WordPiece for BERT). A forward pass through a frozen LLM such as BERT Large (hidden size $d=1024$ ), LLaMA 3 variants, or OpenAI text embedding models produces the last hidden-layer states $H \in \mathbb{R}^{T \times d}$ , with $T$ tokens. Mean-pooling (or, in ablations, max-pooling plus CLS token) is applied across the token dimension to yield an aggregated embedding $Y \in \mathbb{R}^d$ .

2. LyricsAENet Architecture and Loss Formulation

LyricsAENet, the lyric feature compressor, is a tied-weights autoencoder with three-layer feed-forward encoder and symmetric decoder. The encoder applies Scaled Exponential Linear Unit (SELU) activations throughout and projects the input lyric embedding to sequentially lower dimensions: $\begin{aligned} h^{(1)} &= \operatorname{SELU}(W_e^{(1)}\,Y + b_e^{(1)}) &\Bigl[\frac{d}{2}\Bigr] \ h^{(2)} &= \operatorname{SELU}(W_e^{(2)}\,h^{(1)} + b_e^{(2)}) &\Bigl[\frac{d}{4}\Bigr] \ z &= \operatorname{SELU}(W_e^{(3)}\,h^{(2)} + b_e^{(3)}) &\Bigl[\frac{d}{8}\ \text{or}\ \frac{d}{16}\Bigr] \end{aligned}$ The decoder reconstructs the original embedding via layerwise transposed weights (tied weights), generating $\bar{Y} \in \mathbb{R}^d$ .

To ensure both magnitude and directional consistency with the original LLM embedding, LyricsAENet is trained with a composite objective: $L(Y, \bar{Y}) = \alpha_1 \cdot \frac{1}{d}\|Y - \bar{Y}\|_2^2 + \alpha_2 \cdot \left(1 - \frac{Y \cdot \bar{Y}}{\|Y\| \|\bar{Y}\|}\right)$ with $\alpha_1 = 0.5$ and $\alpha_2 = 0.1$ . This loss simultaneously minimizes mean squared reconstruction error and cosine distance, an approach that follows direction-preserving compression literature.

3. Training Regimen and Optimization

LyricsAENet is trained on approximately 74,000 lyric embeddings with the Adam optimizer and a learning rate of $10^{-3}$ . Training uses a batch size of 256 and three encoder/decoder layers with SELU activation throughout. No dropout is applied inside the autoencoder; weight tying and the directional loss act as implicit regularizers. All LLM parameters remain frozen during LyricsAENet training. Convergence is rapid (MSE $<10^{-5}$ in $\sim$ 20 epochs on an NVIDIA A100 GPU). Comparative tests using SiLU and GELU activations found SELU to yield superior training stability and lowest downstream MAE.

4. Multimodal Fusion in HitMusicLyricNet

The compressed lyric code $z^{(\ell)}$ produced by LyricsAENet is concatenated with a compressed audio code $z^{(a)} \in \mathbb{R}^{209/5}$ , high-level audio features $h^{(a)} \in \mathbb{R}^{12}$ , and social metadata vector $m \in \mathbb{R}^3$ . This fusion vector $v$ is input to a three-layer fully connected predictor (with widths $D$ , $D/2$ , $D/3$ ), ReLU activation, and final sigmoid nonlinearity: $\hat{y} = \sigma\left(W_3\ \phi(W_2\ \phi(W_1 v + b_1) + b_2) + b_3\right)$ The prediction head is trained by minimizing the mean squared error between $\hat{y}$ and the target popularity score (scaled to $[0,1]$ ), with a 0.2 dropout rate applied to the fusion layers.

5. Empirical Results and Contribution

Evaluations on the cleaned SpotGenTrack data compare HitMusicLyricNet against versions lacking learned lyric embeddings (HitMusicNet). Inclusion of the LyricsAENet lyric codes reduces test MAE from 0.0862 to 0.0772 (approximately 9% relative gain) and test MSE from 0.0115 to 0.0091 (roughly 20% improvement). Removing LyricsAENet from the modality fusion nearly erases the gains (MAE rises to 0.0852), confirming the principal contribution from the LLM-driven lyric embedding pipeline. Modality ablations reveal that omitting only the lyric code produces a 10.4% increase in test MAE, and single-modality tests indicate lyric embeddings approach the predictive power of low-level audio features ( $\text{MAE}\approx0.12$ ). LIME/SHAP feature importances and associated confidence intervals provide further analysis, though no formal hypothesis testing is reported.

Empirical Comparison Table

Model Variant	Test MAE	Test MSE	Relative MAE Gain
HitMusicNet (no lyrics)	0.0862	0.0115	—
HitMusicLyricNet (with lyrics)	0.0772	0.0091	9%
No LyricsAENet	0.0852	Not stated	—

6. Analysis, Limitations, and Future Directions

The integration of LyricsAENet enables the model to capture deep semantic, syntactic, and sequential patterns in lyrics that handcrafted stylometric features omit. The combination of pretrained LLM representations and directional autoencoding injects rich lyrical signals into the popularity prediction task.

Several limitations are noted:

LLMs employed are general-purpose and lack music-specific pretraining, potentially missing genre or stylistic nuances.
Despite regularization, the high dimensionality of the compressed codes may lead to overfitting risks.
The intrinsic quality of lyric embeddings is validated only via end-to-end downstream performance, not through direct linguistic assessment.
Embedding vectors remain a black box with respect to interpretability.

Suggested future work includes building music-aware LLMs, modeling localized (segment-level) lyric structure, and developing more interpretable and explainable lyric feature extraction schemes (Choudhary et al., 5 Dec 2025). A plausible implication is that domain-specific LLMs or fine-grained lyric context encoding could further advance predictive performance while improving transparency.

PDF Markdown Chat (Pro)

References (1)

Lyrics Matter: Exploiting the Power of Learnt Representations for Music Popularity Prediction (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to HitMusicLyricNet.