MuseAgent: Multimodal Music Intelligence

Updated 24 January 2026

MuseAgent is an agentic multimodal system that integrates perceptual modules, structured symbolic representations, and an LLM orchestration loop for interactive music understanding.
It employs measure-wise optical music recognition and time-aligned automatic music transcription to enable precise score analysis and performance feedback.
The system advances state-of-the-art performance by fusing audio, image, and text modalities with iterative, retrieval-augmented workflows for complex music reasoning.

MuseAgent is an agentic multimodal system for grounded music understanding that integrates perceptual front ends, structured symbolic representations, and a LLM-based orchestration loop. It operates at the intersection of multimodal reasoning, optical music recognition (OMR), automatic music transcription (AMT), and retrieval-augmented generation (RAG), enabling fine-grained interactive analysis and synthesis of music scores and performance audio. MuseAgent-1 establishes significant advances over prior multimodal LLMs by tightly coupling specialist modules with an LLM controller, systematically storing intermediate artifacts, and supporting iterative agentic workflows for complex music reasoning tasks (Zhao et al., 17 Jan 2026).

1. System Architecture and Perceptual Modules

MuseAgent-1 comprises a central LLM controller surrounded by dedicated perceptual modules and a retrieval/memory subsystem.

Measure-wise OMR (M-OMR): Segmenting each score image into measures, MuseAgent applies a ResNet encoder and LSTM decoder to produce an ABC-notation token sequence for each measure. This supports precise symbolic representation and direct grounding in musical structure.
AMT + Alignment: The module processes performance audio via CQT spectrograms, followed by CNN–BiLSTM networks to extract time-aligned MusicXML/MIDI note events. A hierarchical HMM aligns transcription outputs to score locations, resulting in JSON artifacts containing expressive parameters, note–measure correspondences, and performance features.
Music Retrieval: Both explicit and implicit retrieval operations fetch relevant ABC, MusicXML, or MIDI documents. These external symbolic documents augment agent reasoning for multi-turn tasks.
Memory Bank: Stores intermediate OMR/AMT results, retrieved files, and prior dialogue turns, supporting extended context and enabling multi-step reasoning across modalities.

The LLM orchestrates module invocation, fusing output embeddings via early fusion and cross-attention mechanisms before generating answers or further plans.

2. Structured Symbolic and Audio Representations

A defining feature is the pipeline for systematic symbolic grounding:

Score Representation: The segmented score image $I_m$ yields ABC token sequences $\mathrm{ABC}_m$ , which are embedded via learned matrices $\phi: s_i \mapsto \mathbf{e}_i \in \mathbb{R}^d$ . The resulting score embedding $E_{\rm score} \in \mathbb{R}^{N \times d}$ provides a symbol-level representation consumable by the LLM.
Audio Representation: The AMT front end produces frame-wise onset and activation probabilities, binarized and merged to form a set of note events $Y = \{y_1, ..., y_K\}$ . Alignment via H-HMM computes mappings between note onsets and score positions, encapsulated in structured JSON with micro-timing and dynamics features.

This approach enables MuseAgent to reason over both static (score) and dynamic (audio) modalities with rich symbolic grounding.

3. Multimodal Fusion and Interactive Reasoning

MuseAgent implements retrieval-augmented generation (RAG) with cross-modal fusion:

Fusion Pipeline: Score embeddings $E_{\rm score}$ , audio alignment embeddings $E_{\rm audio}$ , and retrieved document embeddings $E_{\rm ret}$ are concatenated into $E_{\rm joint}$ and processed by Transformer layers. Cross-attention blocks allow targeted information exchange; for example, alignment between score and audio features is mediated by attention patterns $A = \mathrm{softmax}(QK^\top / \sqrt{d})$ .
Agentic Planning: The LLM controller plans multi-step tool invocation sequences, dynamically selecting perceptual modules based on user queries and prior memory. This enables workflows such as measure-specific content extraction, performance–score discrepancy analysis, and adaptive module re-use across dialogue turns.

A plausible implication is that the structured fusion pipeline materially improves reasoning fidelity over monolithic vision–LLMs, especially for tasks demanding precise symbolic analysis (Zhao et al., 17 Jan 2026).

4. Benchmarking and Quantitative Evaluation

MuseBench provides a rigorous testbed spanning 28 tasks across text, image, and audio modalities:

Modality	MuseAgent w/ GPT-4.1	Leading Baseline(s)	Random Baseline
Text (theory)	86.7 (GPT-4.1)	85.5 (GPT-4o)	25.0
Image (sheet)	74.1	68.1 (NotaGPT)	25.0
Audio (perf.)	79.1	55.9 (GPT-4o)	50.0

Closed-set ABC conversion yields Levenshtein distances of 18.39 for MuseAgent (M-OMR), versus 59.47 (NotaGPT-7B) and 147.47 (LLaVA-13B). Open-set score understanding shows semantic scores of 19.15 (MuseAgent M-OMR), compared to 16.45 (GPT-4o) and 18.37 (Gemini).

This demonstrates that the agentic architecture with specialist modules and structured grounding leads to stronger performance, particularly in image and audio analysis, where monolithic models suffer from high error rates and limited perceptual fidelity.

5. Qualitative Analysis and Example Workflows

Case studies illustrate the depth of MuseAgent’s reasoning:

Sheet Music Interpretation: For queries such as identifying passing tones in a given measure, the agent invokes M-OMR on the target image, translates results into ABC tokens, and synthesizes a symbolic analysis; for example, correctly identifying “The D♮ on the second half of beat 2 in measure 12 is a passing tone between C and E♭.”
Performance-Level Analysis: For tasks like noting omissions or deviations, MuseAgent invokes AMT and score–audio alignment, providing detailed feedback such as “All six notes of the G minor chord at beat 1 of measure 8 are present; however, the staccato on the high G is slightly delayed.”

Qualitative insights show that M-OMR sharply reduces errors (time-signature misreads, missing accidentals) compared to purely vision–LLMs, and AMT+alignment yields micro-timed feedback on expressive nuances absent from baseline MLLMs.

MuseAgent-1 is architecturally analogous to WeaveMuse (Karystinaios, 14 Sep 2025), which employs a manager–specialist multi-agent design for multimodal music understanding, symbolic composition, and audio synthesis. Both systems feature constraint-aware decoding, policy-driven tool orchestration, and memory-based state continuity. Unlike MUSE for productivity tasks (Yang et al., 9 Oct 2025), which centers on experience-driven self-evolution via hierarchical memory, MuseAgent focuses on domain-specific perceptual grounding for music analysis. All three agents embody principles of modularity, memory augmentation, and iterative planning–execution loops, yet MuseAgent’s unique strength lies in its grounding to both structured score representations and time-aligned audio features.

7. Limitations and Future Directions

MuseAgent’s current instantiation relies on supervised segmentation for measure-wise processing and assumes clean score images and audio inputs. Retrieval and memory mechanisms depend on robust indexing and embedding schemes. Memory banks enable multi-turn reasoning but may require context-window management to avoid bloat and latency. Future research directions include expanding module generality (e.g., for jazz or non-Western scores), improving unsupervised alignment across modalities, and integrating more advanced memory consolidation strategies, as seen in broader agentic frameworks (Yang et al., 9 Oct 2025).

A plausible implication is that continued refinement of perceptual modules, symbolic–audio alignment, and agentic orchestration will further bridge the gap between human musical reasoning and autonomous machine understanding.

MuseAgent represents a convergence of agentic orchestration, perceptual grounding, and multimodal symbolic reasoning for interactive music understanding. By fusing measure-wise OMR, time-aligned transcription, and structured memory with LLM-controlled planning, MuseAgent advances both the breadth and fidelity of computational music intelligence (Zhao et al., 17 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (3)

MuseAgent-1: Interactive Grounded Multimodal Understanding of Music Scores and Performance Audio (2026)

WeaveMuse: An Open Agentic System for Multimodal Music Understanding and Generation (2025)

Learning on the Job: An Experience-Driven Self-Evolving Agent for Long-Horizon Tasks (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MuseAgent.

MuseAgent: Multimodal Music Intelligence

1. System Architecture and Perceptual Modules

2. Structured Symbolic and Audio Representations

3. Multimodal Fusion and Interactive Reasoning

4. Benchmarking and Quantitative Evaluation

5. Qualitative Analysis and Example Workflows

7. Limitations and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

MuseAgent: Multimodal Music Intelligence

1. System Architecture and Perceptual Modules

2. Structured Symbolic and Audio Representations

3. Multimodal Fusion and Interactive Reasoning

4. Benchmarking and Quantitative Evaluation

5. Qualitative Analysis and Example Workflows

6. Comparison with Related Agentic Paradigms

7. Limitations and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research