Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Sõnajaht: Definition Embeddings and Semantic Search for Reverse Dictionary Creation (2404.19430v1)

Published 30 Apr 2024 in cs.CL

Abstract: We present an information retrieval based reverse dictionary system using modern pre-trained LLMs and approximate nearest neighbors search algorithms. The proposed approach is applied to an existing Estonian language lexicon resource, S~onaveeb (word web), with the purpose of enhancing and enriching it by introducing cross-lingual reverse dictionary functionality powered by semantic search. The performance of the system is evaluated using both an existing labeled English dataset of words and definitions that is extended to contain also Estonian and Russian translations, and a novel unlabeled evaluation approach that extracts the evaluation data from the lexicon resource itself using synonymy relations. Evaluation results indicate that the information retrieval based semantic search approach without any model training is feasible, producing median rank of 1 in the monolingual setting and median rank of 2 in the cross-lingual setting using the unlabeled evaluation approach, with models trained for cross-lingual retrieval and including Estonian in their training data showing superior performance in our particular task.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. Dictionary search based on the target word description. In Proc. of the Tenth Annual Meeting of The Association for Natural Language Processing (NLP2004), pages 556–559.
  2. Roger Brown and David McNeill. 1966. The “tip of the tongue” phenomenon. Journal of verbal learning and verbal behavior, 5(4):325–337.
  3. Information retrieval: Implementing and evaluating search engines. MIT Press.
  4. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
  5. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  6. Language-agnostic BERT sentence embedding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 878–891, Dublin, Ireland. Association for Computational Linguistics.
  7. Learning to understand phrases by embedding the dictionary. Transactions of the Association for Computational Linguistics, 4:17–30.
  8. Yu A Malkov and Dmitry A Yashunin. 2018. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence, 42(4):824–836.
  9. Wordalchemy: a transformer-based reverse dictionary. In 2022 2nd International Conference on Intelligent Technologies (CONIT), pages 1–5. IEEE.
  10. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
  11. Niklas Muennighoff. 2022. SGPT: GPT sentence embeddings for semantic search. arXiv preprint arXiv:2202.08904.
  12. MTEB: Massive text embedding benchmark. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2014–2037, Dubrovnik, Croatia. Association for Computational Linguistics.
  13. Wantwords: An open-source online reverse dictionary system. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pages 175–181.
  14. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  15. Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
  16. Nils Reimers and Iryna Gurevych. 2020. Making monolingual sentence embeddings multilingual using knowledge distillation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
  17. Building a scalable database-driven reverse dictionary. IEEE Transactions on Knowledge and Data Engineering, 25(3):528–540.
  18. Mpnet: Masked and permuted pre-training for language understanding. Advances in Neural Information Processing Systems, 33:16857–16867.
  19. DefSent: Sentence embeddings using definition sentences. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 411–418, Online. Association for Computational Linguistics.
  20. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533.
  21. C-pack: Packaged resources to advance general chinese embedding.
  22. BERT for monolingual and cross-lingual reverse dictionary. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4329–4338, Online. Association for Computational Linguistics.
  23. Multilingual universal sentence encoder for semantic retrieval. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 87–94, Online. Association for Computational Linguistics.
  24. Multi-channel reverse dictionary model. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 312–319.
  25. A robustly optimized BERT pre-training approach with post-training. In Proceedings of the 20th Chinese National Conference on Computational Linguistics, pages 1218–1227, Huhhot, China. Chinese Information Processing Society of China.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets