ArtistMus: Artist-Centric Music Informatics
- ArtistMus is a framework comprising benchmarks, datasets, and models for artist-centric music retrieval and cross-modal audio-visual generation.
- It leverages retrieval-augmented QA, metadata-supervised embeddings, and graph-based similarity to improve factual accuracy and recommendation systems.
- The framework also raises ethical and legal considerations by addressing challenges in artist signature inducibility and copyright governance.
ArtistMus—when referenced in contemporary music informatics literature—denotes a constellation of benchmarks, models, and methodologies centered on artist-centric music information retrieval (MIR), cross-modal generation (notably, visual art-to-music), and retrieval-augmented LLM evaluation. Its nomenclature appears in leading resources for music question answering, cross-modal generation, and the construction/evaluation of artist-level MIR datasets. The following sections synthesize the technical architecture, dataset construction, evaluation regimes, and broader implications of notable “ArtistMus” frameworks and datasets, referencing the most recent research outputs.
1. ArtistMus for Retrieval-Augmented Music Question Answering
The "ArtistMus" benchmark (Kwon et al., 5 Dec 2025) defines a globally diverse, artist-centered evaluation resource for retrieval-augmented music question answering (MQA). Its core objective is to address factual and contextual deficits in LLMs regarding music artists, leveraging a purpose-built retrieval corpus (MusWikiDB) and a carefully constructed QA suite for systematic evaluation and improvement of retrieval-augmented generation (RAG) in music.
- MusWikiDB Corpus: Comprises 3.2 million passages from 144,389 music-related Wikipedia pages, focusing on artists, genres, instruments, and related music topics, with hierarchical topic labels (e.g., biography, career, discography, artistry, collaborations).
- ArtistMus Benchmark: Contains 1,000 multiple-choice questions (500 factual, 500 contextual) spanning 500 artists with metadata such as country, debut year, and genre. Diverse geographic (163 regions) and genre/career representation is achieved via stratified sampling.
- RAG Pipeline: RAG is implemented as BM25 retrieval (optional rerank with BGE bi-encoder), followed by LLM inference with context-passing. Dense retrieval (Contriever) underperforms relative to sparse methods (BM25) for entity-centric music QA.
- Results: RAG enhances factual accuracy for open-source LLMs by up to +56.8 percentage points (Qwen 3 8B: 35.0% → 91.8%), achieving parity with proprietary APIs (GPT-4o/Gemini factual ≈92%). Contextual reasoning likewise improves.
- Limitations: Restriction to multiple-choice templates, Wikipedia-sourced coverage, and a single-pass crawl introduce coverage and reasoning limits. Extensions under consideration include open-ended QA, richer metadata sources, and continual updating.
2. Cross-Modal Music Generation: Art2Music as “ArtistMus”
Art2Music (“ArtistMus” in this context) (Hong et al., 27 Nov 2025) is a multimodal framework for the alignment of perceptual feeling between visual art, text, and music, eschewing explicit emotion labels in favor of semantic matching and cross-modal fusion.
- Dataset (ArtiCaps): Constructed via semantic alignment of 80k paintings (ArtEmis) and 4.7k 10-sec music clips (MusicCaps) using TinyBERT-encoded cosine similarity between art commentaries and musician-written audio captions. Composite prompts fuse high-emotion lexicon extractions from both modalities and overall sentiment distributions are balanced (≈50/15/35% positive/neutral/negative).
- Model Architecture:
- Encoding: OpenCLIP ViT-H/14 for images, matching text encoder; feature vectors are concatenated and fused by a gated residual projector that adaptively weights the modalities.
- Decoder: A 4-layer bidirectional LSTM with positional embeddings projects the fused latent to high-resolution Mel-spectrograms (896 frames × 80 bins).
- Loss: Frequency-weighted L1 loss accentuates high-frequency spectral fidelity: weights linearly scale from 1.0 (low) to 1.5 (high frequency).
- Vocoder: Fine-tuned HiFi-GAN reconstructs high quality audio from Mel features.
- Evaluation: Uses Fréchet Audio Distance (FAD) for perceptual naturalness, Mel-Cepstral Distortion (MCD), Log-Spectral Distance (LSD), and cosine similarity for feeling-alignment. Ablation highlights the necessity of both image and text inputs for optimal spectral and semantic fidelity.
- Human Study: A Gemini-based LLM study provides structured rationales and alignment scores (0–10) for cross-modal matches; observed strong alignment but also highlights the granularity of emotional fit.
- Scalability & Applications: Art2Music achieves robust alignment with only 50k training samples and is resource-efficient (single consumer GPU). Its use cases include automated soundtrack generation for galleries, personalized ambient soundscapes, and on-the-fly AR music experiences.
3. Artist-Centric Representation Learning and Similarity
A foundational concept across MIR is the use of artist identity and metadata for representation learning, retrieval, and classification.
- Metadata-Supervised Embeddings: Early work (Park et al., 2017, Lee et al., 2019) demonstrates that deep feature learning using artist labels (either outright or in joint models incorporating albums/tracks) can capture broad stylistic and production signatures more robustly than expensive semantic labels (e.g., genre or mood), and yield features transferable to genre classification and content-based retrieval.
- Graph-Based Similarity: GATSY (Francesco et al., 2023) reframes the artist similarity problem as graph attention over editorial/user-provided adjacency, generating clusterized embeddings with high robustness to weak input features. By using triplet/contrastive losses over the graph, it enables both unsupervised and genre-supervised retrieval, supporting flexible, heterogeneous artist networks.
- Artist Playlist and Recommendation: Replacing track IDs with artist IDs in sequence models yields improved performance for playlist title generation, addressing data sparsity and facilitating better title quality and diversity (Kim et al., 2023). Artist-aware graph methods, such as the random-walk-with-restart paradigm in CAMA (Agnihotri et al., 2019), allow for personalized, artist-centric music recommendations, especially beneficial to users with focused artist affinities.
4. Artist-Signature Inducibility in Generative Audio Models
Recent research demonstrates that, even in the absence of explicit artist naming, text-to-audio (TTA) models can be maneuvered to “spawn” music that closely resembles known artists’ sonic signatures (Coelho, 21 Nov 2025).
- Latent Space Navigation: TTA models (e.g., Udio, Suno) embed prompt text and audio into a shared latent space, with artist-conditioned microregions corresponding to identifiable musical styles. By curating “descriptor constellations” from public taxonomies (e.g., genres, moods, production techniques from Bandcamp, RateYourMusic), prompt engineers can microlocate and reproducibly elicit outputs with distinctive timbre, structure, or mood correlated with target artists.
- Auditing and Ethics: An explicit protocol involves constructing and permuting descriptor prompts, generating multiple outputs, extracting features (e.g., Mel-spectrogram comparisons), and classifying outputs into degrees of proximity (sonic fingerprinting, compositional resonance, aesthetic aura). Empirical findings indicate local stability of artist-conditioned outputs even under variant prompt ordering, substantiating their presence in training data.
- Legal/Governance Implications: The capacity to reproduce proprietary “sonic signatures” without explicit consent foregrounds the need for dataset disclosure, output provenance, and royalty frameworks. Policy recommendations include mandatory artist-level dataset registries and attribution metadata for generated content.
5. Multimodal Dataset Resources and Genre Classification
ArtistMus-style evaluation is further supported by large-scale, multimodal datasets designed for diverse artist- and album-level MIR tasks.
- Music4All A+A (Geiger et al., 18 Sep 2025): Links 6,741 artists and 19,511 albums with high-dimensional image, text, and interaction data. Genre classification experiments reveal that visual representations (album covers, artist photos) provide stronger signals than text for artist/album genres, and that state-of-the-art multimodal fusion architectures effective in other domains (e.g., movies) may not transfer directly—SBNet’s modality-invariant design achieves the highest sample-averaged F1 (29–33%), outperforming both image-only and text-only baselines.
- Evaluation Paradigms: Missing-modality scenarios underscore the resilience of robust architectural choices to incomplete data, and cross-domain comparisons highlight the increased label complexity and ambiguity intrinsic to music genres at the artist level.
6. ArtistMus in Cross-Modal Generation and Artistic Trends
The relationship between artists (as both creators and data entities) and algorithmic music generation appears in cross-modal and artistic trend studies.
- Art2Mus and Art2Music: Both employ visual encoders (ImageBind, OpenCLIP) to steer diffusion models or sequence decoders for art-to-music synthesis, revealing substantial challenges in semantic alignment between complex art and musical composition (Rinaldi et al., 7 Oct 2024, Hong et al., 27 Nov 2025). Image-only guidance lags behind text-conditioning for producing stylistically supportive music, due to abstractness and limited direct mapping from visual to musical domains.
- AI in Artistic Practice: Recent surveys (Pons et al., 12 Aug 2025) catalog AI’s wide uptake in music co-composition, sound design, and live or installation settings, with artist identity foundational to model supervision, hybrid human–AI practices, and recommendation systems. Notions of creative agency, style transfer, and commodification of artist signatures are especially prominent in the context of TTA models and interactive creative platforms.
7. Limitations, Ethical Considerations, and Future Directions
- Coverage Gaps: Current “ArtistMus” benchmarks are bounded by their Wikipedia/Last.fm-sourced metadata, and thus less effective for emerging, niche, or under-documented artists. Multiple-choice QA formats narrow the spectrum of possible reasoning assessment.
- Attribution and Consent: The latent inducibility of artist signatures in TTA models without consent challenges current copyright frameworks and calls for new governance models—policy shifts, output credentialing, and dataset provenance standards.
- Technological Expansion: Future directions include open-ended QA, integration of richer and more dynamic metadata (AllMusic, MusicBrainz), multi-hop reasoning, and continual corpus updating to accommodate evolving artist landscapes.
ArtistMus, whether as benchmark, dataset, or generative pipeline, encapsulates the current state of artist-centric research in music information retrieval and creative AI, integrating scalable retrieval, robust representation learning, and cross-modal generation, with pronounced attention to ethical and legal challenges emerging from the algorithmic mediation and reconstitution of artist identity across modalities and systems (Hong et al., 27 Nov 2025, Kwon et al., 5 Dec 2025, Coelho, 21 Nov 2025, Geiger et al., 18 Sep 2025, Rinaldi et al., 7 Oct 2024, Kim et al., 2023, Francesco et al., 2023, Lee et al., 2019, Park et al., 2017, Agnihotri et al., 2019, Arakelyan et al., 2018).