Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

De-DSI: Decentralised Differentiable Search Index (2404.12237v2)

Published 18 Apr 2024 in cs.IR, cs.AI, and cs.DC

Abstract: This study introduces De-DSI, a novel framework that fuses LLMs with genuine decentralization for information retrieval, particularly employing the differentiable search index (DSI) concept in a decentralized setting. Focused on efficiently connecting novel user queries with document identifiers without direct document access, De-DSI operates solely on query-docid pairs. To enhance scalability, an ensemble of DSI models is introduced, where the dataset is partitioned into smaller shards for individual model training. This approach not only maintains accuracy by reducing the number of data each model needs to handle but also facilitates scalability by aggregating outcomes from multiple models. This aggregation uses a beam search to identify top docids and applies a softmax function for score normalization, selecting documents with the highest scores for retrieval. The decentralized implementation demonstrates that retrieval success is comparable to centralized methods, with the added benefit of the possibility of distributing computational complexity across the network. This setup also allows for the retrieval of multimedia items through magnet links, eliminating the need for platforms or intermediaries.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. SMA Abbas “A gossip-based distributed social networking system” In 2009 18th IEEE International Workshops on Enabling Technologies: Infrastructures for Collaborative Enterprises, 2009, pp. 93–98 IEEE
  2. “Web3: A decentralized societal infrastructure for identity, trust, money, and data” In arXiv preprint arXiv:2203.00398, 2022
  3. “Autoregressive search engines: Generating substrings as document identifiers” In Advances in Neural Information Processing Systems 35, 2022, pp. 31668–31683
  4. Mathijs Bruin “Filecoin Open Grant Proposal: Scaling out ipfs-search.com along with IPFS” [Accessed 07-02-2024], https://github.com/filecoin-project/devgrants/blob/master/open-grant-proposals/ipfs-search-scale-out.md, 2021
  5. “CorpusBrain: Pre-train a Generative Retrieval Model for Knowledge-Intensive Language Tasks” In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, 2022, pp. 191–200
  6. “ORCAS: 18 Million Clicked Query-Document Pairs for Analyzing Search” In arXiv preprint arXiv:2006.05324, 2020
  7. “Autoregressive entity retrieval” In arXiv preprint arXiv:2010.00904, 2020
  8. Frido Emans “Bump in the road” [Accessed 07-02-2024], https://blog.ipfs-search.com/bump-in-the-road/, 2023
  9. Alexander Gorishnyak “GitHub Release Stats — qwertycube.com” [Accessed 22-02-2024], https://qwertycube.com/github-release-stats/?OWNER=tribler\&REPO=tribler
  10. Arvid Norberg Greg Hazel “bep_0009.rst_post — bittorrent.org” [Accessed 29-03-2024], https://www.bittorrent.org/beps/bep_0009.html, 2017
  11. “Decentralized Federated Learning: A Survey on Security and Privacy” In IEEE Transactions on Big Data 10.2 Institute of ElectricalElectronics Engineers (IEEE), 2024, pp. 194–213 DOI: 10.1109/tbdata.2024.3362191
  12. “Natural questions: a benchmark for question answering research” In Transactions of the Association for Computational Linguistics 7 MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA journals-info …, 2019, pp. 453–466
  13. “Decentralized Federated Learning: Fundamentals, State of the Art, Frameworks, Trends, and Challenges” In IEEE Communications Surveys and Tutorials 25.4 Institute of ElectricalElectronics Engineers (IEEE), 2023, pp. 2983–3013 DOI: 10.1109/comst.2023.3315746
  14. “Communication-Efficient Learning of Deep Networks from Decentralized Data”, 2023 arXiv:1602.05629 [cs.LG]
  15. “DSI++: Updating transformer memory with new documents” In arXiv preprint arXiv:2212.09744, 2022
  16. “Rethinking search: making domain experts out of dilettantes” In Acm sigir forum 55.1, 2021, pp. 1–27 ACM New York, NY, USA
  17. Bulat Nasrulin, Georgy Ishmaev and Johan Pouwelse “Meritrank: Sybil tolerant reputation for merit-based tokenomics” In 2022 4th Conference on Blockchain Research & Applications for Innovative Networks and Services (BRAINS), 2022, pp. 95–102 IEEE
  18. Mark Nottingham “RFC 9518: Centralization, Decentralization, and Internet Standards — datatracker.ietf.org” [Accessed 22-02-2024], https://datatracker.ietf.org/doc/rfc9518/, 2023
  19. Róbert Ormándi, István Hegedüs and Márk Jelasity “Efficient p2p ensemble learning with linear models on fully distributed data” In CoRR abs/1109.1396, 2011
  20. Róbert Ormándi, István Hegedűs and Márk Jelasity “Gossip learning with linear models on fully distributed data” In Concurrency and Computation: Practice and Experience 25.4, 2013, pp. 556–571 DOI: https://doi.org/10.1002/cpe.2858
  21. “Key management for onion routing in a true peer to peer setting” In Advances in Information and Computer Security: 9th International Workshop on Security, IWSEC 2014, Hirosaki, Japan, August 27-29, 2014. Proceedings 9, 2014, pp. 62–71 Springer
  22. “Peer to Peer Trace Archive” [Accessed 28-02-2024], http://p2pta.ewi.tudelft.nl/datasets/
  23. “How context affects language models’ factual predictions” In arXiv preprint arXiv:2005.04611, 2020
  24. “Language models as knowledge bases?” In arXiv preprint arXiv:1909.01066, 2019
  25. Johan Pouwelse “Open information pools” In 2000 USENIX Annual Technical Conference (USENIX ATC 00), 2000
  26. “The Bittorrent P2P File-Sharing System: Measurements and Analysis” In Peer-to-Peer Systems IV Berlin, Heidelberg: Springer Berlin Heidelberg, 2005, pp. 205–216
  27. “TRIBLER: a social-based peer-to-peer system” In Concurrency and computation: Practice and experience 20.2 Wiley Online Library, 2008, pp. 127–138
  28. ProbeLab “Amino (the Public IPFS DHT) is getting a facelift” [Accessed 07-02-2024], https://blog.ipfs.tech/2023-09-amino-refactoring/, 2023
  29. Lily Ray “We Surveyed 1,400 Searchers About Google - Here’s What We Learned” Accessed: 2024-02-24, https://moz.com/blog/new-google-survey-results, 2019
  30. Matei Ripeanu “Peer-to-peer architecture case study: Gnutella network” In Proceedings first international conference on peer-to-peer computing, 2001, pp. 99–100 IEEE
  31. Adam Roberts, Colin Raffel and Noam Shazeer “How much knowledge can you pack into the parameters of a language model?” In arXiv preprint arXiv:2002.08910, 2020
  32. Amit Singhal “Modern information retrieval: A brief overview” In IEEE Data Eng. Bull. 24.4, 2001, pp. 35–43
  33. “Deployment of a blockchain-based self-sovereign identity” In 2018 IEEE international conference on Internet of Things (iThings) and IEEE green computing and communications (GreenCom) and IEEE cyber, physical and social computing (CPSCom) and IEEE smart data (SmartData), 2018, pp. 1336–1342 IEEE
  34. “Learning to tokenize for generative retrieval” In Advances in Neural Information Processing Systems 36, 2024
  35. “Transformer memory as a differentiable search index” In Advances in Neural Information Processing Systems 35, 2022, pp. 21831–21843
  36. Almer S Tigelaar, Djoerd Hiemstra and Dolf Trieschnigg “Peer-to-peer information retrieval: An overview” In ACM Transactions on Information Systems (TOIS) 30.2 ACM New York, NY, USA, 2012, pp. 1–34
  37. Guido Urdaneta, Guillaume Pierre and Maarten Van Steen “A survey of DHT security techniques” In ACM Computing Surveys (CSUR) 43.2 ACM New York, NY, USA, 2011, pp. 1–49
  38. Martijn Vos, Georgy Ishmaev and Johan Pouwelse “Decentralizing components of electronic markets to prevent gatekeeping and manipulation” In Electronic Commerce Research and Applications 56 Elsevier, 2022, pp. 101220
  39. “Wi-fi walkman: a wireless handheld that shares and recommends music on peer-to-peer networks” In Embedded Processors for Multimedia and Communications II 5683, 2005, pp. 155–163 SPIE
  40. “A neural corpus indexer for document retrieval” In Advances in Neural Information Processing Systems 35, 2022, pp. 25600–25614
  41. “100 Million DHT replies” In 14-th IEEE International Conference on Peer-to-Peer Computing, 2014, pp. 1–4 DOI: 10.1109/P2P.2014.6934318
  42. “Building a privacy-preserving semantic overlay for Peer-to-Peer networks” In 2013 IEEE International Workshop on Information Forensics and Security (WIFS), 2013, pp. 79–84 IEEE
  43. “DynamicRetriever: A Pre-trained Model-based IR System Without an Explicit Index” In Machine Intelligence Research 20.2 Springer, 2023, pp. 276–288
  44. “Large language models for information retrieval: A survey” In arXiv preprint arXiv:2308.07107, 2023
  45. “Bridging the gap between indexing and retrieval for differentiable search index with query generation” In arXiv preprint arXiv:2206.10128, 2022
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Petru Neague (3 papers)
  2. Marcel Gregoriadis (4 papers)
  3. Johan Pouwelse (39 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com