Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

144 tokens/sec

GPT-4o

8 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

MIRFLEX: Modular Music Feature Extraction

Updated 30 June 2025

MIRFLEX is an extensible, modular library for music information retrieval that standardizes the extraction of diverse audio features.
Its modular design allows integration of independent, state-of-the-art extraction models, enabling reproducible benchmarking and evaluation.
The toolkit supports applications like generative music, recommendation, and captioning by providing robust musical and audio attribute annotations.

MIRFLEX is an extensible, modular library designed for music information retrieval (MIR) research, providing a unified toolkit for extracting a diverse range of concrete musical features from audio signals. Developed to address the need for standardized, high-quality feature extraction in both MIR benchmarking and downstream applications, MIRFLEX integrates a suite of state-of-the-art and best-available open-source models to deliver musical and audio attribute annotations used in applications such as music generation, recommendation, playlisting, and multi-modal captioning.

1. System Architecture and Extensibility

MIRFLEX is architected as a modular system composed of discrete, independently implemented feature extraction modules. Each module encapsulates the complete workflow—input pre-processing, model inference (including any neural network weights or rule-based logic), and post-processing—required for a particular MIR task. This design principle supports:

Ease of Extension: Developers can integrate new or improved feature extraction models by adding modules without altering the core system, facilitated by an open call for contributions.
Standardization: All features are exported in standardized formats (as either latent neural representations or post-processed categorical/continuous labels), ensuring interoperability and seamless downstream consumption.
Benchmarking Utility: Modules can be evaluated or replaced independently, supporting robust scientific comparison across MIR methods.

A simplified data flow diagram is as follows:

 [Input Audio File]
         |
[Feature Extraction Modules]
  |   |   |   |   |
Key Chord Tempo Vocals Instrument
  \   |   |   |   /
   [Central Feature Repository]
         |
  [Export/Integration Layer]
         |
    [MIR Applications]

2. Extracted Features and Underlying Models

MIRFLEX covers a comprehensive set of musical and audio features, with each supported by a state-of-the-art extraction model:

Feature	Integrated Model	Key Metrics
Key Detection	CNN with directional filters [Schreiber & Müller]	67.9% (GiantSteps)
Chord Detection	Bidirectional Transformer [Park et al., 2019]	WCSR 83.9
Downbeat/Tempo	BeatNet (CRNN + Particle Filtering) [Heydari et al.]	F1 80.64% (GTZAN)
Vocals/Gender	EfficientNet (Essentia library)	-
Instrument	CNN-based classifier (Essentia)	F1 up to 0.98
Mood/Theme/Genre	Frequency-aware CNN (Essentia)	Mood acc. 15.46%

A typical extraction workflow involves transforming audio to a time-frequency representation—such as the Constant-Q Transform (CQT):

$X_{CQT}(t, f) = \sum_{n=0}^{N-1} x(n) w(n-t) e^{-j 2\pi f n / f_s}$

where $x(n)$ is the discrete audio signal, and $w(\cdot)$ is a window function.

Neural models process such representations to yield probabilities or labels, e.g., identifying the predominant key, tracking chord changes, extracting instrument classes, or detecting vocal/gender presence.

3. Applications in Research and Industry

MIRFLEX’s extracted features form the basis for numerous downstream applications across MIR and music technology:

Generative Music: Latent or symbolic features (key, chords, structure) provide conditioning signals for symbolic or neural music generation (e.g., controlling harmonization or genre style).
Recommendation and Personalization: Playlist and station algorithms can leverage descriptors such as instrument presence, mood, and genre for personalized user experiences.
Music Information Retrieval and Tagging: Features support automated tagging, similarity search, structural decomposition, and musicological analyses.
Music Captioning and Multimodal AI: Integrated in systems like SonicVerse (2506.15154), MIRFLEX enables detailed, feature-aware music captioning by providing ground-truth auxiliary targets for multi-task models.

A plausible implication is that the modular nature of MIRFLEX expedites rapid prototyping and facilitates reproducible benchmarking by standardizing the feature extraction pipeline.

4. Benchmarking, Standardization, and Model Evaluation

The library is explicitly designed for benchmarking, offering standardized extraction and allowing the research community to compare or replace models systematically. Key reported metrics from integrated models include:

Key Detection Accuracy: 67.9% (GiantSteps dataset)
Chord Detection WCSR: 83.9
Instrument F1 Score: up to 0.98 (Philharmonica+UIowa)
Mood Detection Accuracy: 15.46% (Frequency-aware CNN)
Downbeat Detection F-measure: 80.64% (GTZAN)

Researchers can substitute modules, apply models to shared datasets, and report results using the same feature set, promoting objective and reproducible MIR advancements.

5. Integration in Multi-Task Music AI Systems

MIRFLEX plays a central role in contemporary multi-task and multi-modal AI pipelines. In the SonicVerse system (2506.15154), for example:

MIRFLEX is used to annotate audio datasets with concrete, multifaceted music features (key, chord, instrument, genre, mood, vocals, etc.), producing triplet data: (audio, MIRFLEX features, text caption).
In model training, auxiliary heads are tasked with predicting MIRFLEX features, with outputs projected as language tokens fed into a LLM (e.g., Mistral-7B).
This architecture enhances the technical informativeness of generated music captions and improves alignment between audio, technical attributes, and text.

Empirical results in SonicVerse indicate the addition of MIRFLEX-derived features yields improvements in both natural language metrics (e.g., BLEU, ROUGE, BERT Score) and music-specific accuracy (matching ground-truth features in generated captions).

6. Open Source Collaboration and Future Directions

The MIRFLEX project is open-source and actively seeks contributions for new or improved feature extractors. Future directions highlighted by the authors include:

Broader Feature Coverage: Expansion to novel and more nuanced musical attributes.
Community Collaboration: A collaborative, continually evolving ecosystem for MIR feature extraction.
Enhanced Benchmarking: Ongoing integration and comparative evaluation of new methodologies.
Support for Emerging Applications: Facilitating research in music captioning, music-to-text, multimedia generation, and music understanding.

A plausible implication is that as MIRFLEX extends both its scope and model accuracy, it will play a foundational role in the next generation of music technology research, providing high-quality, reproducible, and extensible feature extraction for the global MIR community.

7. Summary and Impact

MIRFLEX represents a significant step toward unifying music feature extraction and benchmarking. Its modular, open, and extensible design permits rapid integration of state-of-the-art models and ensures robust interoperability for a broad set of MIR and music technology tasks. By serving as both a benchmarking standard and an enabler for advanced applications—from generative music to multi-modal captioning—MIRFLEX addresses key challenges in reproducibility, feature coverage, and community collaboration in music information retrieval research.

Open-Source Repository: https://github.com/AMAAI-Lab/megamusicaps

PDF Markdown Chat (Upgrade)

References (1)

SonicVerse: Multi-Task Learning for Music Feature-Informed Captioning (2025)