Multilingual Universal Sentence Encoder for Semantic Retrieval (1907.04307v1)

Published 9 Jul 2019 in cs.CL

Abstract: We introduce two pre-trained retrieval focused multilingual sentence encoding models, respectively based on the Transformer and CNN model architectures. The models embed text from 16 languages into a single semantic space using a multi-task trained dual-encoder that learns tied representations using translation based bridge tasks (Chidambaram al., 2018). The models provide performance that is competitive with the state-of-the-art on: semantic retrieval (SR), translation pair bitext retrieval (BR) and retrieval question answering (ReQA). On English transfer learning tasks, our sentence-level embeddings approach, and in some cases exceed, the performance of monolingual, English only, sentence embedding models. Our models are made available for download on TensorFlow Hub.

PDF Abstract

Multilingual Universal Sentence Encoder for Semantic Retrieval

The paper at hand introduces two pre-trained multilingual sentence encoding models, built upon the Transformer and CNN architectures. These models are designed to embed text from 16 diverse languages into a unified semantic space, enhancing applications in semantic retrieval (SR), bitext retrieval (BR), and retrieval question answering (ReQA). By utilizing a multi-task dual-encoder, the models learn unified representations through translation-based bridge tasks, demonstrating competitive performance compared to the state-of-the-art.

Model Overview

Both models—Transformer and CNN—are trained using a shared dual-encoder to support multiple downstream tasks, such as multi-feature question-answer prediction, translation ranking, and natural language inference (NLI). SentencePiece tokenization ensures language coverage and uniform representation across the model's multilingual capacity.

Transformer Model: Utilizes the encoder component of the Transformer architecture. Bi-directional self-attention computes context-aware token representations which are averaged for sentence-level embeddings.
CNN Model: Employs convolutional layers with different filter widths for efficient inference with reduced accuracy, compared to the Transformer.

Training Corpus and Configuration

The training data includes mined question-answer pairs, translation pairs, and the SNLI corpus, balanced across languages using translations. Models handle input sequences up to 256 tokens for CNN and 100 tokens for Transformer. Hyperparameter tuning and model architectures—such as CNN layers and Transformer layers—are optimized to maximize performance across tasks.

Experimentation on Retrieval Tasks

The paper conducts extensive experiments to evaluate the models' capabilities across retrieval tasks:

Semantic Retrieval (SR): Utilizing datasets like Quora and AskUbuntu, the models aim to identify semantically similar sentences. Results indicate that both models perform competitively, with USE\textsubscript{Trans} typically surpassing the CNN variant.
Bitext Retrieval (BR): On the UN Parallel Corpus, the models exhibit strong precision, yet slightly lag behind the latest state-of-the-art due to vocabulary limitations affecting languages with large character sets.
Retrieval Question Answering (ReQA): Evaluated against the SQuAD dataset, USE\textsubscript{QA Trans+Cxt} outperforms BM25 in paragraph retrieval, highlighting the models' effectiveness in contextual sentence retrieval.

Cross-lingual Retrieval

The cross-lingual extension of SR and ReQA tasks showcases the models' robustness across multiple languages. Despite inherent difficulties in cross-lingual retrieval, the performance remains close to monolingual tasks, underlining the models' multilingual versatility.

Transfer Learning and Benchmarks

Performance on SentEval transfer tasks reveals the models' competitive stance against monolingual models. Particularly, the Transformer-based model exceeds expectations in certain tasks, denoting its adaptability and strength in English transfer scenarios.

From a computational perspective, the models benefit from the advantageous inference times and memory footprints, making them suitable for real-world applications lacking extensive computational resources.

Conclusion and Implications

This paper's advances in multilingual sentence encoding offer both theoretical and practical implications. The ability to embed multiple languages into a shared space has implications for cross-linguistic research, interoperability, and extensibility in AI systems. Future research may explore enhancing language-specific embeddings and expanding the language set to further enhance multilingual interaction and understanding. The models, made publicly available on TensorFlow Hub, indicate a commitment to fostering cross-linguistic AI accessibility and innovation.

PDF Markdown Bookmark Chat (Pro)

Authors (12)

Yinfei Yang (73 papers)
Daniel Cer (28 papers)
Amin Ahmad (5 papers)
Mandy Guo (21 papers)
Jax Law (2 papers)
Noah Constant (32 papers)
Gustavo Hernandez Abrego (12 papers)
Steve Yuan (5 papers)
Chris Tar (8 papers)
Yun-Hsuan Sung (18 papers)
Brian Strope (11 papers)
Ray Kurzweil (11 papers)

Citations (457)

View on Semantic Scholar

Multilingual Universal Sentence Encoder for Semantic Retrieval (1907.04307v1)