De-DSI: Decentralised Differentiable Search Index (2404.12237v2)
Abstract: This study introduces De-DSI, a novel framework that fuses LLMs with genuine decentralization for information retrieval, particularly employing the differentiable search index (DSI) concept in a decentralized setting. Focused on efficiently connecting novel user queries with document identifiers without direct document access, De-DSI operates solely on query-docid pairs. To enhance scalability, an ensemble of DSI models is introduced, where the dataset is partitioned into smaller shards for individual model training. This approach not only maintains accuracy by reducing the number of data each model needs to handle but also facilitates scalability by aggregating outcomes from multiple models. This aggregation uses a beam search to identify top docids and applies a softmax function for score normalization, selecting documents with the highest scores for retrieval. The decentralized implementation demonstrates that retrieval success is comparable to centralized methods, with the added benefit of the possibility of distributing computational complexity across the network. This setup also allows for the retrieval of multimedia items through magnet links, eliminating the need for platforms or intermediaries.
- SMA Abbas “A gossip-based distributed social networking system” In 2009 18th IEEE International Workshops on Enabling Technologies: Infrastructures for Collaborative Enterprises, 2009, pp. 93–98 IEEE
- “Web3: A decentralized societal infrastructure for identity, trust, money, and data” In arXiv preprint arXiv:2203.00398, 2022
- “Autoregressive search engines: Generating substrings as document identifiers” In Advances in Neural Information Processing Systems 35, 2022, pp. 31668–31683
- Mathijs Bruin “Filecoin Open Grant Proposal: Scaling out ipfs-search.com along with IPFS” [Accessed 07-02-2024], https://github.com/filecoin-project/devgrants/blob/master/open-grant-proposals/ipfs-search-scale-out.md, 2021
- “CorpusBrain: Pre-train a Generative Retrieval Model for Knowledge-Intensive Language Tasks” In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, 2022, pp. 191–200
- “ORCAS: 18 Million Clicked Query-Document Pairs for Analyzing Search” In arXiv preprint arXiv:2006.05324, 2020
- “Autoregressive entity retrieval” In arXiv preprint arXiv:2010.00904, 2020
- Frido Emans “Bump in the road” [Accessed 07-02-2024], https://blog.ipfs-search.com/bump-in-the-road/, 2023
- Alexander Gorishnyak “GitHub Release Stats — qwertycube.com” [Accessed 22-02-2024], https://qwertycube.com/github-release-stats/?OWNER=tribler\&REPO=tribler
- Arvid Norberg Greg Hazel “bep_0009.rst_post — bittorrent.org” [Accessed 29-03-2024], https://www.bittorrent.org/beps/bep_0009.html, 2017
- “Decentralized Federated Learning: A Survey on Security and Privacy” In IEEE Transactions on Big Data 10.2 Institute of ElectricalElectronics Engineers (IEEE), 2024, pp. 194–213 DOI: 10.1109/tbdata.2024.3362191
- “Natural questions: a benchmark for question answering research” In Transactions of the Association for Computational Linguistics 7 MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA journals-info …, 2019, pp. 453–466
- “Decentralized Federated Learning: Fundamentals, State of the Art, Frameworks, Trends, and Challenges” In IEEE Communications Surveys and Tutorials 25.4 Institute of ElectricalElectronics Engineers (IEEE), 2023, pp. 2983–3013 DOI: 10.1109/comst.2023.3315746
- “Communication-Efficient Learning of Deep Networks from Decentralized Data”, 2023 arXiv:1602.05629 [cs.LG]
- “DSI++: Updating transformer memory with new documents” In arXiv preprint arXiv:2212.09744, 2022
- “Rethinking search: making domain experts out of dilettantes” In Acm sigir forum 55.1, 2021, pp. 1–27 ACM New York, NY, USA
- Bulat Nasrulin, Georgy Ishmaev and Johan Pouwelse “Meritrank: Sybil tolerant reputation for merit-based tokenomics” In 2022 4th Conference on Blockchain Research & Applications for Innovative Networks and Services (BRAINS), 2022, pp. 95–102 IEEE
- Mark Nottingham “RFC 9518: Centralization, Decentralization, and Internet Standards — datatracker.ietf.org” [Accessed 22-02-2024], https://datatracker.ietf.org/doc/rfc9518/, 2023
- Róbert Ormándi, István Hegedüs and Márk Jelasity “Efficient p2p ensemble learning with linear models on fully distributed data” In CoRR abs/1109.1396, 2011
- Róbert Ormándi, István Hegedűs and Márk Jelasity “Gossip learning with linear models on fully distributed data” In Concurrency and Computation: Practice and Experience 25.4, 2013, pp. 556–571 DOI: https://doi.org/10.1002/cpe.2858
- “Key management for onion routing in a true peer to peer setting” In Advances in Information and Computer Security: 9th International Workshop on Security, IWSEC 2014, Hirosaki, Japan, August 27-29, 2014. Proceedings 9, 2014, pp. 62–71 Springer
- “Peer to Peer Trace Archive” [Accessed 28-02-2024], http://p2pta.ewi.tudelft.nl/datasets/
- “How context affects language models’ factual predictions” In arXiv preprint arXiv:2005.04611, 2020
- “Language models as knowledge bases?” In arXiv preprint arXiv:1909.01066, 2019
- Johan Pouwelse “Open information pools” In 2000 USENIX Annual Technical Conference (USENIX ATC 00), 2000
- “The Bittorrent P2P File-Sharing System: Measurements and Analysis” In Peer-to-Peer Systems IV Berlin, Heidelberg: Springer Berlin Heidelberg, 2005, pp. 205–216
- “TRIBLER: a social-based peer-to-peer system” In Concurrency and computation: Practice and experience 20.2 Wiley Online Library, 2008, pp. 127–138
- ProbeLab “Amino (the Public IPFS DHT) is getting a facelift” [Accessed 07-02-2024], https://blog.ipfs.tech/2023-09-amino-refactoring/, 2023
- Lily Ray “We Surveyed 1,400 Searchers About Google - Here’s What We Learned” Accessed: 2024-02-24, https://moz.com/blog/new-google-survey-results, 2019
- Matei Ripeanu “Peer-to-peer architecture case study: Gnutella network” In Proceedings first international conference on peer-to-peer computing, 2001, pp. 99–100 IEEE
- Adam Roberts, Colin Raffel and Noam Shazeer “How much knowledge can you pack into the parameters of a language model?” In arXiv preprint arXiv:2002.08910, 2020
- Amit Singhal “Modern information retrieval: A brief overview” In IEEE Data Eng. Bull. 24.4, 2001, pp. 35–43
- “Deployment of a blockchain-based self-sovereign identity” In 2018 IEEE international conference on Internet of Things (iThings) and IEEE green computing and communications (GreenCom) and IEEE cyber, physical and social computing (CPSCom) and IEEE smart data (SmartData), 2018, pp. 1336–1342 IEEE
- “Learning to tokenize for generative retrieval” In Advances in Neural Information Processing Systems 36, 2024
- “Transformer memory as a differentiable search index” In Advances in Neural Information Processing Systems 35, 2022, pp. 21831–21843
- Almer S Tigelaar, Djoerd Hiemstra and Dolf Trieschnigg “Peer-to-peer information retrieval: An overview” In ACM Transactions on Information Systems (TOIS) 30.2 ACM New York, NY, USA, 2012, pp. 1–34
- Guido Urdaneta, Guillaume Pierre and Maarten Van Steen “A survey of DHT security techniques” In ACM Computing Surveys (CSUR) 43.2 ACM New York, NY, USA, 2011, pp. 1–49
- Martijn Vos, Georgy Ishmaev and Johan Pouwelse “Decentralizing components of electronic markets to prevent gatekeeping and manipulation” In Electronic Commerce Research and Applications 56 Elsevier, 2022, pp. 101220
- “Wi-fi walkman: a wireless handheld that shares and recommends music on peer-to-peer networks” In Embedded Processors for Multimedia and Communications II 5683, 2005, pp. 155–163 SPIE
- “A neural corpus indexer for document retrieval” In Advances in Neural Information Processing Systems 35, 2022, pp. 25600–25614
- “100 Million DHT replies” In 14-th IEEE International Conference on Peer-to-Peer Computing, 2014, pp. 1–4 DOI: 10.1109/P2P.2014.6934318
- “Building a privacy-preserving semantic overlay for Peer-to-Peer networks” In 2013 IEEE International Workshop on Information Forensics and Security (WIFS), 2013, pp. 79–84 IEEE
- “DynamicRetriever: A Pre-trained Model-based IR System Without an Explicit Index” In Machine Intelligence Research 20.2 Springer, 2023, pp. 276–288
- “Large language models for information retrieval: A survey” In arXiv preprint arXiv:2308.07107, 2023
- “Bridging the gap between indexing and retrieval for differentiable search index with query generation” In arXiv preprint arXiv:2206.10128, 2022
- Petru Neague (3 papers)
- Marcel Gregoriadis (4 papers)
- Johan Pouwelse (39 papers)