Some Like It Small: Czech Semantic Embedding Models for Industry Applications (2311.13921v1)
Abstract: This article focuses on the development and evaluation of Small-sized Czech sentence embedding models. Small models are important components for real-time industry applications in resource-constrained environments. Given the limited availability of labeled Czech data, alternative approaches, including pre-training, knowledge distillation, and unsupervised contrastive fine-tuning, are investigated. Comprehensive intrinsic and extrinsic analyses are conducted, showcasing the competitive performance of our models compared to significantly larger counterparts, with approximately 8 times smaller size and 5 times faster speed than conventional Base-sized models. To promote cooperation and reproducibility, both the models and the evaluation pipeline are made publicly accessible. Ultimately, this article presents practical applications of the developed sentence embedding models in Seznam.cz, the Czech search engine. These models have effectively replaced previous counterparts, enhancing the overall search experience for instance, in organic search, featured snippets, and image search. This transition has yielded improved performance.
- ParaCrawl: Web-Scale Acquisition of Parallel Corpora. In ACL.
- Costra 1.1: An Inquiry into Geometric Properties of Sentence Spaces. In TSD.
- Enriching word vectors with subword information. TACL.
- DiffCSE: Difference-based contrastive learning for sentence embeddings. In NAACL.
- ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. In ICLR.
- SentEval: An Evaluation Toolkit for Universal Sentence Representations. In LREC.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In ACL.
- Language-agnostic BERT Sentence Embedding. In ACL.
- Filip, O. 2021. Rychlá oprava dotazů ve vyhledávači pomocí neuronových sítí. https://www.root.cz/clanky/rychla-oprava-dotazu-ve-vyhledavaci-pomoci-neuronovych-siti/.
- Condenser: a Pre-training Architecture for Dense Retrieval. In EMNLP.
- Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval. In ACL.
- SimCSE: Simple Contrastive Learning of Sentence Embeddings. In EMNLP.
- Sentiment Analysis in Czech Social Media Using Supervised Machine Learning. In WASSA.
- Hávová, M. 2023. Jak Vyhledávání na Seznamu rozpozná jména, příjmení a osobnosti? https://blog.seznam.cz/2023/07/vylepsili-jsme-rozpoznavani-jmen-prijmeni-a-osobnosti-ve-vyhledavani-na-seznamu/.
- Distilling the Knowledge in a Neural Network. arXiv preprint arXiv:1503.02531.
- Introducing Neural Bag of Whole-Words with ColBERTer: Contextualized Late Interactions using Enhanced Reduction. In CIKM.
- ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. In SIGIR.
- Adam: A Method for Stochastic Optimization. ICLR.
- Siamese BERT-Based Model for Web Search Relevance Ranking Evaluated on a New Czech Dataset. In AAAI.
- Announcing CzEng 2.0 Parallel Corpus with over 2 Gigawords. arXiv preprint arXiv:2007.03006.
- Multi-Aspect Dense Retrieval. In SIGKDD.
- Czech Text Document Corpus v 2.0. In LREC.
- Comparison of Czech Transformers on Text Classification Tasks. In SLPS. Springer.
- Trans-encoder: unsupervised sentence-pair modelling through self-and mutual-distillations. arXiv preprint arXiv:2109.13059.
- RankCSE: Unsupervised Sentence Representations Learning via Learning to Rank. In ACL.
- SGDR: Stochastic Gradient Descent with Warm Restarts. In ICLR.
- Decoupled Weight Decay Regularization. In ICLR.
- Less is More: Pretrain a Strong Siamese Encoder for Dense Text Retrieval Using a Weak Decoder. In EMNLP.
- MTEB: Massive Text Embedding Benchmark. arXiv:2210.07316.
- CatBoost: unbiased boosting with categorical features. NeurIPS, 31.
- RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering. In NAACL.
- Learning transferable visual models from natural language supervision. In ICML, 8748–8763.
- Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In EMNLP.
- Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation. In EMNLP.
- Czert–Czech BERT-like Model for Language Representation. arXiv preprint arXiv:2103.13031.
- MPNet: Masked and Permuted Pre-training for Language Understanding. NeurIPS.
- RobeCzech: Czech RoBERTa, a monolingual contextualized language representation model. In TSD.
- Czech dataset for semantic textual similarity. In TSD. Springer.
- TSDAE: Using Transformer-based Sequential Denoising Auto-Encoderfor Unsupervised Sentence Embedding Learning. In EMNLP.
- SimLM: Pre-training with Representation Bottleneck for Dense Passage Retrieval. In ACL.
- InfoCSE: Information-aggregated Contrastive Learning of Sentence Embeddings. In EMNLP.
- RetroMAE: Pre-Training Retrieval-oriented Language Models Via Masked Auto-Encoder. In EMNLP.
- Dense Text Retrieval based on Pretrained Language Models: A Survey. arXiv preprint arXiv:2211.14876.
- A Robustly Optimized BERT Pre-training Approach with Post-training. In CCL.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.