MAC: Mining Activity Concepts for Language-based Temporal Localization (1811.08925v1)

Published 21 Nov 2018 in cs.CV

Abstract: We address the problem of language-based temporal localization in untrimmed videos. Compared to temporal localization with fixed categories, this problem is more challenging as the language-based queries not only have no pre-defined activity list but also may contain complex descriptions. Previous methods address the problem by considering features from video sliding windows and language queries and learning a subspace to encode their correlation, which ignore rich semantic cues about activities in videos and queries. We propose to mine activity concepts from both video and language modalities by applying the actionness score enhanced Activity Concepts based Localizer (ACL). Specifically, the novel ACL encodes the semantic concepts from verb-obj pairs in language queries and leverages activity classifiers' prediction scores to encode visual concepts. Besides, ACL also has the capability to regress sliding windows as localization results. Experiments show that ACL significantly outperforms state-of-the-arts under the widely used metric, with more than 5% increase on both Charades-STA and TACoS datasets.

Authors (4)

Runzhou Ge (10 papers)
Jiyang Gao (28 papers)
Kan Chen (74 papers)
Ram Nevatia (54 papers)

Citations (174)

View on Semantic Scholar

Summary

Overview of MAC: Mining Activity Concepts for Language-based Temporal Localization

The paper "MAC: Mining Activity Concepts for Language-based Temporal Localization" addresses the challenging problem of language-based temporal localization in untrimmed videos, which enables the identification of specific activities based on natural language queries without relying on pre-defined categories. Previous approaches have primarily focused on determining correlations between video features extracted through sliding windows and language queries, but these methods often overlook the rich semantic cues inherent in both modalities.

Introduction

The authors introduce a novel approach to this problem by proposing the Activity Concepts based Localizer (ACL) enhanced with actionness scores. ACL integrates verb-object (VO) pairs derived from language queries as semantic activity concepts and probability distributions derived from activity classifiers as visual activity concepts. By leveraging these activity concepts, the method improves the visual-semantic correlation encoding, offering more accurate localization.

Methodology

The proposed ACL framework consists of several components:

Video Pre-processing: Videos are decomposed into feature-extractable units using a CNN, specifically C3D, which delivers both general visual features and activity label distributions.
Language Embedding: Queries are transformed into embeddings using skip-thought vectors, augmented by VO pairs as semantic concepts.
Activity Concept Mining: Both semantic and visual activities are mined using their respective concepts to enhance correlation-based alignment.
Actionness Confidence Scoring: Sliding windows are rated based on their likelihood of containing meaningful actions, improving proposal selection during testing.

The approach demonstrates that utilizing and processing both semantic activity concepts (VOC) and visual activity concepts significantly contributes to enhancing localization performance.

Results and Implications

ACL outperformed state-of-the-art methods in experiments conducted on two datasets, Charades-STA and TACoS. The results show substantial improvement over prior techniques, such as CTRL, with more than a 5% increase in accuracy. Additionally, substituting activity labels from Sports-1M with Kinetics offers further performance gains, underscoring the efficacy of diverse activity labeling for concept extraction.

From a theoretical standpoint, this research enriches the understanding of multimodal interaction and semantic mining in video-language tasks. Practically, it paves the way for enhancing AI applications in video content analysis, including automated indexing and retrieval systems, personalized content curation, and real-time action detection in security surveillance or entertainment settings.

Future Directions

Future research may delve into expanding the range of activity descriptions and investigating the integration of other semantic structures beyond VO pairs. Additionally, the exploration of dynamic and more granular modeling of video events could lead to refinement of temporal proposals. Moreover, improvements in computational efficiency and generalization over diverse datasets remain areas for advancement.

Overall, this paper provides valuable insights into solving language-based activity localization through innovative concept mining, fundamentally shifting how visual and semantic features can be synthesized to enhance video understanding.

Related Papers

Find Related Papers