I can listen but cannot read: An evaluation of two-tower multimodal systems for instrument recognition (2407.18058v1)

Published 25 Jul 2024 in cs.SD, cs.CL, cs.IR, cs.LG, and eess.AS

Abstract: Music two-tower multimodal systems integrate audio and text modalities into a joint audio-text space, enabling direct comparison between songs and their corresponding labels. These systems enable new approaches for classification and retrieval, leveraging both modalities. Despite the promising results they have shown for zero-shot classification and retrieval tasks, closer inspection of the embeddings is needed. This paper evaluates the inherent zero-shot properties of joint audio-text spaces for the case-study of instrument recognition. We present an evaluation and analysis of two-tower systems for zero-shot instrument recognition and a detailed analysis of the properties of the pre-joint and joint embeddings spaces. Our findings suggest that audio encoders alone demonstrate good quality, while challenges remain within the text encoder or joint space projection. Specifically, two-tower systems exhibit sensitivity towards specific words, favoring generic prompts over musically informed ones. Despite the large size of textual encoders, they do not yet leverage additional textual context or infer instruments accurately from their descriptions. Lastly, a novel approach for quantifying the semantic meaningfulness of the textual space leveraging an instrument ontology is proposed. This method reveals deficiencies in the systems' understanding of instruments and provides evidence of the need for fine-tuning text encoders on musical data.

Summary

The paper evaluates two-tower multimodal systems for zero-shot instrument recognition, finding a notable disparity between audio encoder performance and text encoder/projection issues.
Crucial findings reveal that text encoders struggle to effectively capture semantic relationships among instruments and leverage contextual information.
Future directions suggest freezing audio encoders and focusing on improving text representation through techniques like text augmentation or mapping text to the audio domain.

Evaluation of Two-Tower Multimodal Systems for Instrument Recognition

The paper "I can listen but cannot read: An evaluation of two-tower multimodal systems for instrument recognition" investigates the efficacy of two-tower multimodal systems in zero-shot music classification tasks, specifically focusing on instrument recognition. The research literature has explored the field of Music Information Retrieval (MIR), highlighting its potential applications in tasks like genre, instrument, and emotion recognition. However, the authors recognize the persistent issue of limited annotated datasets and the rigidity of these systems when required to infer beyond predefined classes. Within this context, the application of Zero-Shot Learning (ZSL) emerges as a prospective solution, enabling models to predict new classes absent of labeled examples.

Summary of Objectives and Methodology

The paper centers on evaluating existing two-tower multimodal systems and examining the embedding properties within the audio-text joint space. By leveraging the LPMusicCaps-MTT dataset, the study designates instrument classification with the TinySOL dataset — comprising 2913 audio clips across 14 instruments — as a testbed to scrutinize the semantic properties of these models. The models in focus include MusCALL, Music/Speech CLAP, and Music CLAP — each encompassing distinct training data and architectural designs pertinent to audio and text encoding within the joint space.

In the evaluation, the study performs several experiments, including assessing the context dependency of the prompts, examining embedding distributions, and quantifying semantic meaningfulness using domain-specific ontologies. By dissecting the intricate instances where models falter in leveraging context and distinguishing between audio and textual input, the research provides a comprehensive view into the mechanics of two-tower systems.

Key Findings and Implications

Crucial findings reveal that despite promising results in the domain of zero-shot classification, there exists a notable disparity in the two-tower systems’ capacity to map and leverage audio and text embeddings synergistically. Audio encoders in the considered models showcase commendable performance, signaling that existing challenges likely reside within the text encoder or the projection of textual data into the joint embedding space. For instance, MusCALL leads in low metric performance primarily due to its struggles with text embedding alignment, despite demonstrating acceptable results in audio-only scenarios.

Furthermore, the examination of how context influences model performance revealed that CLAP models exhibit a dependence on specific instrument labels rather than utilizing additional textual context effectively. Pertinently, the model performances often regressed when confronted with music-informed prompts, underscoring potential inadequacies in comprehending the contextual nuances required for instruments. Through the introduction of semantic meaningfulness assessment based on an instrument ontology, the paper identifies significant deficiencies in textual encoders capturing the intrinsic relationships prevalent among different instruments and instrument families.

Future Prospects and Challenges

The evidence derived from this study provides a vital foundation for future advancements in the domain of multimodal learning systems dedicated to music classification. To overcome the limitations identified, the paper suggests freezing audio encoders while attempting alternative approaches such as mapping textual data to the audio domain or implementing fine-tuning methods particularly suited to music-related semantics. They highlight text augmentation techniques and context modulation approaches as avenues to enhance text representation within two-tower systems. Moreover, constructing specialized music terminology similarity corpora is recognized as an essential step toward improving model validation and fine-tuning strategies.

In conclusion, while this research illuminates the present challenges obstructing optimized audio-text integration within two-tower multimodal systems, it also lays the groundwork for novel methodologies to address these hurdles and refine MIR systems for greater adaptability and accuracy in classifying musical instruments and possibly other musical properties in the future.