WeMusic-Agent-M1: Agentic Music Recommender

Updated 20 December 2025

The paper introduces a dual-mode framework that alternates between internal token generation and external API calls for precise music recommendations.
It leverages a 32B-parameter Qwen2.5-based Transformer with continued pretraining on 50B music tokens to improve factual accuracy and reduce hallucinations.
Empirical results show significant gains with +12.3% relevance, +18.7% personalization, and +31.5% diversity over baseline models.

WeMusic-Agent-M1 is a 32-billion-parameter, agent-enabled conversational recommender system designed to optimize music recommendation through large-scale internalized domain knowledge and dynamic integration of external music search tools. Built on the Qwen2.5-32B Transformer backbone, it extends the WeMusic-Base dialogue model via continued pretraining on 50 billion tokens of music-centric data and incorporates an “agentic boundary” mechanism that determines—at inference time—when to respond directly versus when to call external APIs. WeMusic-Agent-M1 achieves significant improvements in relevance, personalization, and diversity of recommendations across benchmarks derived from real-world dialog data (Bi et al., 18 Dec 2025).

1. Foundation and Model Architecture

WeMusic-Agent-M1 employs a decoder-only Transformer architecture inherited directly from Qwen2.5-32B. Central to its innovation is a dual-mode operational framework:

Internalized-only mode: The model directly generates tokens using parameterized internal knowledge.
Agentic mode: The model can emit control tokens (e.g., <tool_call>) that trigger the invocation of external music search or recommendation APIs. Successive text generation is conditioned on structured tool responses.

A lightweight gating head, attached to the shared 32B decoder stack, predicts at each decoding step whether to remain in internalized mode or branch into a tool-calling routine. The entire end-to-end process is visually summarized in the originating paper’s architecture diagram (Bi et al., 18 Dec 2025).

2. Knowledge Internalization via Continued Pretraining

The model is specialized toward music via a two-stage continual pretraining process on 50 billion music-domain tokens:

Data Sources: The corpus includes filtered open-web music articles, encyclopedia entries, WeChat Listen community comments, artist interviews, and song/playlist metadata from user logs.
Stage 1: Combines generic and music-specific text. The CPT loss function is a sum:

$L_{\text{CPT}} = L_{\text{Music}} + \lambda L_{\text{General}}$

where

$L_{\text{Music}} = -E_{x\in D_{\text{music}}} \left[ \sum_{t=1}^T w_t \cdot \log p_\theta(x_t|x_{<t}) \right]$

$w_t = \frac{1}{-\log p_{\text{QRM}}(x_t|x_{<t})}$

and $L_{\text{General}}$ regularizes the model via KL divergence with the Base Reference Model (BRM), a snapshot of Qwen2.5-32B before CPT.

Stage 2: Upweights song title and comment tokens to enhance recall and reduce hallucination, thereby strengthening factual accuracy for track and artist entities.

The dual-reference model framework (QRM for quality token weighting, BRM for knowledge stability) is designed to maximize music knowledge coverage while minimizing catastrophic forgetting of general language abilities. A plausible implication is that this architecture could generalize to other domain-specialty LLMs that need tool-calling capabilities without knowledge loss (Bi et al., 18 Dec 2025).

3. Agentic Boundary Learning and Training Procedure

Agentic boundary learning governs the decision process for internal vs. external tool-based recommendations:

Specialized model training: WeMusic-Base-Dist is trained as an internal-only recommender, while M_agent^zero is prepared for tool-calling via GRPO (Generative RL with Policy Optimization).
Trajectory sampling: 50,000 real user queries are processed by M_internal. If the predicted relevance fails to reach a threshold, the queries are flagged as “out-of-knowledge,” and M_agent^zero generates corresponding tool-calling trajectories.
Boundary learning: The model receives mixed batches: “in-knowledge” rollouts (from M_internal) and “tool-call” rollouts (from M_agent^zero), and is trained using curriculum SFT (single-turn followed by multi-turn dialogues) plus controllable RL via GRPO.

Hybrid list-wise rewards for recommendations are computed as:

$R_{\text{hybrid}}^{\text{list}}(H,\mathcal{A}_i) = R_{\text{Format}}(\mathcal{A}_i) + \frac{K}{N_{\max}} \sum_{j=1}^K \left[ I(R_{\text{Format}}(M_{i,j})>0)\cdot R_{\text{Factuality}}(M_{i,j}) \cdot (\text{Norm}(R_{\text{Relevance}}) + \text{Norm}(R_{\text{Personalization}})) \right]$

For agentic tool-calling rollouts, a discount factor $\gamma=0.8$ is applied:

$R_{\text{hybrid}}^{\text{agentic}} = \begin{cases} R_{\text{hybrid}}^{\text{list}}(H,\mathcal{A}_i), & \text{internalic} \ \gamma\, R_{\text{hybrid}}^{\text{list}}(H,\mathcal{A}_i), & \text{agentic} \end{cases}$

This configuration incentivizes maximizing internalized knowledge utility while reserving API invocation for genuine out-of-knowledge requests (Bi et al., 18 Dec 2025).

4. Tool Integration and Control Logic

At inference, WeMusic-Agent-M1 integrates external APIs as structured JSON function calls. Three plugin tool types are registered:

Tool Type	Functional Role	Invocation Trigger
Precise music search API	Exact title/artist lookup	`<tool_call>` upon low recall
Fuzzy search API	Contextual/genre/emotion-driven retrieval	As above
General web search	Long-tail, emerging or unindexed content	As above

The model’s gating head produces a binary softmax at each decoding step, dictating whether text generation will invoke a tool. On tool call, generation pauses for the API’s response (structured top- $k$ song IDs and metadata), then resumes conditioned on this input (Bi et al., 18 Dec 2025).

5. Benchmarking: WeMusic-Bench Design and Evaluation Metrics

Comprehensive evaluation uses the WeMusic-Bench, constructed from WeChat Listen logs:

Scope: 1,300 multi-turn dialog queries in Chinese; 140,000+ tracks; queries spanning genre, mood, language, era, etc.
Query complexity: Single-constraint, bi-constraint, and multi-constraint (e.g., “a fast Japanese punk song from the 1990s with female vocals”).
Metrics:
- Relevance ( $R$ ): Judged by DeepSeek-V3; per-song and text triple scores ( $S_1$ , $S_2$ , $S_3$ ).
$R_{\text{Relevance}} = \lambda_1 S_1 + \lambda_2 S_2 + \lambda_3 S_3$ - Personalization: Production-grade ranking for alignment with past user preferences. - Diversity: Semantic difference among $K = 5$ completions per dialog,

$\text{Diversity} = \frac{1}{L} \sum_{i=1}^L \left[ \frac{\sum_{p<q} d_{\text{sem}}(A_{i,p}, A_{i,q})}{C(K, 2)} \right]$

where $d_{\text{sem}}=1$ if both responses pass factuality and relevance checks and suggest different songs.

Results:

Hit@5: 0.93 (WeMusic-Agent-M1), outperforming WeMusic-Base-Dist (0.85) and Gemini-2.5-Pro (0.62)
Relevance: 0.77 (vs. 0.68 and 0.47)
Personalization: 0.74 (vs. 0.71 and 0.54)
Diversity: 0.667, over double that of Qwen2.5-32B-Instruct and other LLMs

Absolute improvements over WeMusic-Base-Dist are $+12.3\%$ relevance, $+18.7\%$ personalization, $+31.5\%$ diversity (Bi et al., 18 Dec 2025).

6. Empirical Findings, Limitations, and Future Directions

Key empirical findings include:

Knowledge internalization using token-soft-scored CPT outperforms standard next-token approaches and hard filtered RHO-1 on MusicSimpleQA (+4–5% entity recall).
The BRM component is crucial to avoid catastrophic forgetting of general language and QA capability.
Agentic boundary learning robustly expands coverage; both WeMusic-Agent-Zero and M1 improve Hit@5 and relevance, especially on previously out-of-knowledge queries.
Controllable RL tuning enables rapid convergence, with tool call rates stabilizing around 25%, indicating efficient boundary formation.

Limitations are primarily the need for frequently updated tool indexes to capture niche or new releases. Production deployment bottlenecks include synchronous API latency, with ongoing work aimed at asynchronous and cached architectures. Extension to multimodal asset recommendation (including cover art or rich music embeddings) is flagged as immediate follow-up (Bi et al., 18 Dec 2025).

7. Relationship to Prior Agentic Music Systems

WeMusic-Agent-M1’s design draws conceptual lineage from earlier LLM-powered task orchestration frameworks in music, such as MusicAgent (Yu et al., 2023), which introduced autonomous workflows for music understanding and tool coordination but did not combine internalized expert knowledge with selective tool invocation. Key advances in WeMusic-Agent-M1 relative to its forebears are:

Deeply specialized music pretraining at massive scale (50B tokens).
Learned, RL-finetuned agentic boundary gating for context-sensitive tool calling.
Comprehensive, real-world dialog benchmark (WeMusic-Bench) tailored for conversational music recommendation.
Demonstrated state-of-the-art metrics across relevance, personalization, and diversity in Chinese-language conversational settings.

This suggests that hybrid, boundary-aware agentic LLMs can deliver step-change improvements for domain-specific CRS tasks, justifying further investment in both domain-adaptive pretraining and fine-grained agentic control protocols (Bi et al., 18 Dec 2025, Yu et al., 2023).