Papers
Topics
Authors
Recent
Search
2000 character limit reached

Taxonomy-guided Semantic Indexing for Academic Paper Search

Published 25 Oct 2024 in cs.IR and cs.AI | (2410.19218v1)

Abstract: Academic paper search is an essential task for efficient literature discovery and scientific advancement. While dense retrieval has advanced various ad-hoc searches, it often struggles to match the underlying academic concepts between queries and documents, which is critical for paper search. To enable effective academic concept matching for paper search, we propose Taxonomy-guided Semantic Indexing (TaxoIndex) framework. TaxoIndex extracts key concepts from papers and organizes them as a semantic index guided by an academic taxonomy, and then leverages this index as foundational knowledge to identify academic concepts and link queries and documents. As a plug-and-play framework, TaxoIndex can be flexibly employed to enhance existing dense retrievers. Extensive experiments show that TaxoIndex brings significant improvements, even with highly limited training data, and greatly enhances interpretability.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Revisiting the inverted indices for billion-scale approximate nearest neighbors. In ECCV, pages 202–216.
  2. Scibert: Pretrained language model for scientific text. In EMNLP, pages 3615–3620.
  3. Efficient inverted indexes for approximate retrieval over learned sparse representations. In SIGIR.
  4. Rradistill: Distilling llms’ passage ranking ability for long-tail queries document re-ranking on a search engine. In EMNLP.
  5. SPECTER: Document-level Representation Learning using Citation-informed Transformers. In ACL.
  6. Promptagator: Few-shot dense retrieval from 8 examples. In ICLR.
  7. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  8. From distillation to hard negative sampling: Making sparse neural ir models more effective. In SIGIR, pages 2353–2359.
  9. Luyu Gao and Jamie Callan. 2022. Unsupervised corpus aware language model pre-training for dense passage retrieval. In ACL, pages 2843–2853.
  10. Unsupervised dense information retrieval with contrastive learning. arXiv preprint arXiv:2112.09118.
  11. Improving retrieval in theme-specific applications using a corpus topical taxonomy. In WWW, page 1497–1508.
  12. De-rrd: A knowledge distillation framework for recommender system. In CIKM.
  13. Semi-supervised learning for cross-domain recommendation to cold-start users. In CIKM.
  14. Dense passage retrieval for open-domain question answering. In EMNLP.
  15. Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907.
  16. Taxocom: Topic taxonomy completion with hierarchical discovery of novel topic clusters. In WWW, pages 2819–2829.
  17. Mvfs: Multi-view feature selection for recommender system. In CIKM.
  18. Constructing tree-based index for efficient and effective dense retrieval. In SIGIR, pages 131–140.
  19. Oag-bert: Towards a unified backbone language model for academic knowledge services. In KDD, page 3418–3428.
  20. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. In KDD, pages 1930–1939.
  21. Generative relevance feedback with large language models. In SIGIR, page 2026–2031.
  22. Generation-augmented retrieval for open-domain question answering. In ACL, pages 4089–4100.
  23. Edleno Silva de Moura and Marco Antonio Cristo. 2009. Indexing the Web, pages 1463–1467. Springer US, Boston, MA.
  24. Multi-vector models with textual guidance for fine-grained scientific document similarity. In NAACL, pages 4453–4470.
  25. Csfcube-a test collection of computer science research articles for faceted query by example. NeurIPS 2021 Track on Datasets and Benchmarks.
  26. Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings. In EMNLP.
  27. Ninh Pham and Tao Liu. 2022. Falconn++: A locality-sensitive filtering approach for approximate nearest neighbor search. In NeurIPS, pages 31186–31198.
  28. Large language models are effective text rankers with pairwise ranking prompting. arXiv preprint arXiv:2306.17563.
  29. RocketQA: An optimized training approach to dense passage retrieval for open-domain question answering. In NAACL-HLT, pages 5835–5847.
  30. ColBERTv2: Effective and efficient retrieval via lightweight late interaction. In NAACL, pages 3715–3734.
  31. Automated phrase mining from massive text corpora. IEEE Transactions on Knowledge and Data Engineering, 30(10):1825–1837.
  32. A web-scale system for scientific knowledge exploration. In Proceedings of ACL 2018, System Demonstrations, pages 87–92.
  33. Taxonomy completion via implicit concept insertion. In WWW, pages 2159–2169.
  34. Scirepeval: A multi-format benchmark for scientific document representations. In EMNLP, pages 5548–5566.
  35. Multi-dimensional, phrase-based summarization in text cubes. IEEE Data Eng. Bull., 39(3):74–84.
  36. BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In NeurIPS Datasets and Benchmarks Track.
  37. Scientific document retrieval using multi-level aspect-based queries. In NeurIPS Datasets and Benchmarks Track.
  38. Pseudo-relevance feedback for multiple representation dense retrieval. In SIGIR, pages 297–306.
  39. Tacoprompt: A collaborative multi-task prompt learning method for self-supervised taxonomy completion. In EMNLP, pages 15804–15817.
  40. Improving query representations for dense retrieval with pseudo relevance feedback. In CIKM, pages 3592–3596.
  41. Optimizing dense retrieval model training with hard negatives. In SIGIR, pages 1503–1512.
  42. Learning discrete representations via constrained clustering for effective and efficient dense retrieval. In WSDM, pages 1328–1336.
  43. Adversarial retriever-ranker model for dense retrieval. In ICLR.
  44. Hybrid inverted index is a robust accelerator for dense retrieval. In EMNLP.
  45. Pre-training multi-task contrastive learning models for scientific literature understanding. In Findings of EMNLP, pages 12259–12275.
  46. Teleclass: Taxonomy enrichment and llm-enhanced hierarchical text classification with minimal supervision. arXiv preprint arXiv:2403.00165.
  47. Bert-qe: Contextualized query expansion for document re-ranking. In Findings of EMNLP, pages 4718–4728.
Citations (1)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.