Papers
Topics
Authors
Recent
Search
2000 character limit reached

COCO-CN for Cross-Lingual Image Tagging, Captioning and Retrieval

Published 22 May 2018 in cs.CL and cs.CV | (1805.08661v2)

Abstract: This paper contributes to cross-lingual image annotation and retrieval in terms of data and baseline methods. We propose COCO-CN, a novel dataset enriching MS-COCO with manually written Chinese sentences and tags. For more effective annotation acquisition, we develop a recommendation-assisted collective annotation system, automatically providing an annotator with several tags and sentences deemed to be relevant with respect to the pictorial content. Having 20,342 images annotated with 27,218 Chinese sentences and 70,993 tags, COCO-CN is currently the largest Chinese-English dataset that provides a unified and challenging platform for cross-lingual image tagging, captioning and retrieval. We develop conceptually simple yet effective methods per task for learning from cross-lingual resources. Extensive experiments on the three tasks justify the viability of the proposed dataset and methods. Data and code are publicly available at https://github.com/li-xirong/coco-cn

Citations (136)

Summary

  • The paper introduces the COCO-CN dataset, integrating over 20K images with bilingual annotations to reduce semantic gaps in cross-lingual tasks.
  • The paper employs a cascading MLP and sequential learning strategy, achieving significant improvements in image tagging and captioning as reflected by enhanced BLEU and CIDEr scores.
  • The paper advances cross-modal retrieval by enhancing the W2VV model with soft attention and contrastive loss, outperforming traditional methods in aligning Chinese queries with English image descriptions.

COCO-CN for Cross-Lingual Image Tagging, Captioning and Retrieval

The paper "COCO-CN for Cross-Lingual Image Tagging, Captioning and Retrieval" presents a sophisticated approach towards enhancing cross-lingual multimedia content description. It introduces a novel dataset, COCO-CN, which extends the well-known MS-COCO dataset by integrating Chinese language annotations. This integration includes manually written Chinese sentences and tags, creating the most comprehensive Chinese-English dataset focusing on image tagging, captioning, and retrieval.

COCO-CN Dataset and Annotation System

COCO-CN is a substantial enhancement in the field of bilingual datasets, comprising 20,342 images annotated with over 27,218 Chinese sentences and 70,993 tags. The authors developed a recommendation-assisted annotation system, supporting annotators by providing relevant tags and sentences based on the image content. This approach potentially minimizes the semantic discrepancy inherent in cross-lingual tasks by relying on human intelligence for the annotation process, thus improving the dataset's quality over automatic translation methods alone.

Methods and Experimental Evaluation

The research uses COCO-CN to explore robust models for cross-lingual image tagging, captioning, and retrieval:

  • Cross-Lingual Image Tagging: The authors propose a Cascading MLP model, which learns through sequential training on cross-lingual resources. This model demonstrates superior performance over simpler models trained on monolingual resources alone, such as the MLP trained solely on COCO-CN.
  • Cross-Lingual Image Captioning: By leveraging the Sequential Learning strategy, where models are initially trained on machine-translated text and refined with manually written sentences, significant improvements in metrics such as BLEU and CIDEr scores are achieved. This demonstrates the potential for hybrid training strategies in leveraging both machine-generated and human-labeled data effectively.
  • Cross-Lingual Image Retrieval: The enhanced W2VV model with soft attention mechanisms and contrastive loss offers a novel approach to cross-lingual and cross-modal retrieval. This solution evaluates Chinese queries against English image descriptions and outperforms classical methods by using multimodal embeddings.

Implications and Future Directions

COCO-CN advances the research in cross-lingual image tasks by providing a well-balanced, semantically rich dataset that exceeds the limitations of existing datasets, which are either monolingually focused or restricted in size and variety. Its manual annotation process ensures a higher quality standard that could be pivotal for training and evaluating advanced deep learning models in multilingual environments.

Future research could explore integrating more diverse languages into similar frameworks, thus broadening the application scope of these models. Additionally, tasks could investigate improved integration of multimodal data by leveraging advancements in attention mechanisms and representation learning.

Conclusion

The contribution of COCO-CN is notable in the cross-lingual description of multimedia content. Its introduction, utilization, and the associated methodologies presented in the paper pave the way for further investigation into effective model training paradigms that cater to diverse languages and cultures, ultimately enhancing the accessibility and comprehension of digital multimedia in a global context.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.