Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Siamese BERT-based Model for Web Search Relevance Ranking Evaluated on a New Czech Dataset (2112.01810v1)

Published 3 Dec 2021 in cs.IR and cs.CL

Abstract: Web search engines focus on serving highly relevant results within hundreds of milliseconds. Pre-trained language transformer models such as BERT are therefore hard to use in this scenario due to their high computational demands. We present our real-time approach to the document ranking problem leveraging a BERT-based siamese architecture. The model is already deployed in a commercial search engine and it improves production performance by more than 3%. For further research and evaluation, we release DaReCzech, a unique data set of 1.6 million Czech user query-document pairs with manually assigned relevance levels. We also release Small-E-Czech, an Electra-small LLM pre-trained on a large Czech corpus. We believe this data will support endeavours both of search relevance and multilingual-focused research communities.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Matěj Kocián (2 papers)
  2. Jakub Náplava (10 papers)
  3. Daniel Štancl (1 paper)
  4. Vladimír Kadlec (1 paper)
Citations (15)