Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos (1606.06259v2)

Published 20 Jun 2016 in cs.CL and cs.MM
MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos

Abstract: People are sharing their opinions, stories and reviews through online video sharing websites every day. Studying sentiment and subjectivity in these opinion videos is experiencing a growing attention from academia and industry. While sentiment analysis has been successful for text, it is an understudied research question for videos and multimedia content. The biggest setbacks for studies in this direction are lack of a proper dataset, methodology, baselines and statistical analysis of how information from different modality sources relate to each other. This paper introduces to the scientific community the first opinion-level annotated corpus of sentiment and subjectivity analysis in online videos called Multimodal Opinion-level Sentiment Intensity dataset (MOSI). The dataset is rigorously annotated with labels for subjectivity, sentiment intensity, per-frame and per-opinion annotated visual features, and per-milliseconds annotated audio features. Furthermore, we present baselines for future studies in this direction as well as a new multimodal fusion approach that jointly models spoken words and visual gestures.

An Overview of MOSI: Multimodal Corpus for Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos

The paper "MOSI: Multimodal Corpus of Sentiment Intensity and Subjectivity Analysis in Online Opinion Videos" addresses the understudied area of sentiment analysis within the field of multimedia content, specifically focusing on online video platforms like YouTube. Traditional sentiment analysis has largely been centered around text, but this work extends the frontier by introducing a comprehensive, multimodal dataset that integrates audio, visual, and textual elements for a nuanced examination of opinion videos.

Key Contributions

  1. Dataset Introduction: The authors introduce the Multimodal Opinion-level Sentiment Intensity (MOSI) dataset, a carefully annotated corpus for sentiment and subjectivity analysis. It contains 2199 opinion segments from 93 videos, providing not only sentiment intensity annotations but also offering per-frame and per-opinion annotated visual and audio features. This enriched dataset addresses the significant gaps highlighted by the lack of appropriate corpora for multimodal sentiment analysis.
  2. Challenges Identified in Video Sentiment Analysis:
    • High-tempo Dynamics: Video bloggers frequently shift topics and sentiments, complicating the identification of distinct opinions within a single utterance.
    • Subtle Sentiment Intensity: Beyond simple polarity classification, the dataset enables estimation of nuanced sentiment strengths, reflecting more complex affective states.
    • Multimodal Information Integration: The research underscores the necessity of moving beyond sole reliance on text to incorporate non-verbal cues like facial expressions and gestures, which provide critical context for sentiment inference.
  3. Methodology: The authors describe a sophisticated annotation process ensuring the alignment of textual, audio, and visual data. They employed Mechanical Turk for sentiment intensity annotation and developed mechanisms for the precise segmentation of subjective content, achieving a Krippendorf’s alpha of 0.68, signifying a commendable level of inter-annotator agreement.
  4. Multimodal Analysis: Employing the dataset, the authors conduct experiments identifying various interaction patterns between spoken language and visual gestures. Their approach reveals four distinctive types of interactions: neutral, emphasizer, positive, and negative patterns, suggesting that sentiment interpretation greatly benefits from these multimodal signals.
  5. Baseline Models: The paper provides a baseline using SVR and DNN models on verbal, visual, and combined features, culminating in a multimodal dictionary approach that models spoken words alongside facial gestures. This method exhibited superior performance in predicting sentiment intensities compared to traditional feature concatenation, as evidenced by a mean absolute error of 1.1 and a correlation of 0.53.

Implications and Future Directions

The introduction of MOSI marks a critical advancement in the field of multimodal sentiment analysis, setting a benchmark for future research. The integration of multimodal cues opens avenues for enhancing sentiment analysis algorithms, potentially improving their applicability in fields like automated video summarization, sentiment-aware recommendation systems, and human-computer interaction.

The possibility of extending this framework to include more comprehensive visual and auditory elements, along with cross-domain sentiment analysis, represents intriguing opportunities for future inquiry. The results also highlight the potential of employing advanced neural architectures, such as transformers or attention-based models, which could further untangle the complex interdependencies between multimodal cues.

Overall, the MOSI dataset presents a significant resource for the academic community, offering a robust platform for exploring the intricate dynamics of sentiment and subjectivity in multimedia content. Researchers now have the empirical foundation to refine multimodal sentiment analysis techniques, propelling the development of more sophisticated applications in natural language processing and beyond.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Amir Zadeh (36 papers)
  2. Rowan Zellers (25 papers)
  3. Eli Pincus (1 paper)
  4. Louis-Philippe Morency (123 papers)
Citations (406)