Category-Based Deep CCA for Fine-Grained Venue Discovery from Multimodal Data (1805.02997v1)

Published 8 May 2018 in cs.CV

Abstract: In this work, travel destination and business location are taken as venues. Discovering a venue by a photo is very important for context-aware applications. Unfortunately, few efforts paid attention to complicated real images such as venue photos generated by users. Our goal is fine-grained venue discovery from heterogeneous social multimodal data. To this end, we propose a novel deep learning model, Category-based Deep Canonical Correlation Analysis (C-DCCA). Given a photo as input, this model performs (i) exact venue search (find the venue where the photo was taken), and (ii) group venue search (find relevant venues with the same category as that of the photo), by the cross-modal correlation between the input photo and textual description of venues. In this model, data in different modalities are projected to a same space via deep networks. Pairwise correlation (between different modal data from the same venue) for exact venue search and category-based correlation (between different modal data from different venues with the same category) for group venue search are jointly optimized. Because a photo cannot fully reflect rich text description of a venue, the number of photos per venue in the training phase is increased to capture more aspects of a venue. We build a new venue-aware multimodal dataset by integrating Wikipedia featured articles and Foursquare venue photos. Experimental results on this dataset confirm the feasibility of the proposed method. Moreover, the evaluation over another publicly available dataset confirms that the proposed method outperforms state-of-the-arts for cross-modal retrieval between image and text.

Citations (95)

View on Semantic Scholar

Summary

Category-Based Deep CCA for Fine-Grained Venue Discovery from Multimodal Data: A Summary

In this academic work, the authors Yi Yu, Suhua Tang, Kiyoharu Aizawa, and Akiko Aizawa present a comprehensive approach to fine-grained venue discovery using multimodal data. They introduce the Category-based Deep Canonical Correlation Analysis (C-DCCA), a novel model that addresses the contextual linkage between user-generated images and textual data of venues, thereby enhancing the discovery process for context-aware applications. The model efficiently performs both exact venue search and group venue search, which respectively aim to locate the venue where a given photo was taken and identify venues of the same category.

The core contribution of the paper is the development of the C-DCCA, which extends the capabilities of the standard Deep Canonical Correlation Analysis (DCCA) by integrating category-based correlation into the correlation maximization framework. This approach leverages the disparate yet related modalities of image and text to project venue data into a shared embedding space. In this space, both pairwise correlation (between modalities of the same venue) and category-based correlation (between modalities from different venues within the same category) are concurrently optimized. This dual constraint effectively enhances the model’s robustness in identifying venues from complex multimodal data.

In terms of methodology, the authors build a dataset combining Wikipedia articles and Foursquare venue photos, reflecting diverse visual and textual features. This dataset serves as a robust foundation for the C-DCCA model to capture the multidimensional facets of venues. Experimental evaluations demonstrate that the C-DCCA surpasses existing models like CCA, KCCA, and traditional DCCA, especially in the context of cross-modal retrieval tasks between images and texts. Notably, incorporating more user-generated images in the training phase significantly enhances the correlation with text descriptors, thereby improving retrieval precision.

The practical implications of this research are manifold. For context-aware applications in domains such as tourism and local business recommendation systems, accurately identifying venues from user-generated content can greatly improve user experience and service personalization. Theoretically, this work underscores the potential of integrating category information into the correlation learning between different data modalities, setting a precedent for future explorations in multi-task learning scenarios in AI and deep learning.

Moving forward, the methodologies proposed in this research could be expanded to address other multimodal learning challenges, extending beyond venue discovery to areas like event detection and multimedia content analysis. Furthermore, the integration of additional contextual data sources, such as user reviews or interaction history, may present opportunities to refine the model’s predictive accuracy and applicability in real-world settings. Overall, this paper adds substantial value to the field of multimodal data processing and its application in fine-grained recognition tasks.

Related Papers

YouTube

Show All Videos