Papers
Topics
Authors
Recent
Search
2000 character limit reached

The ROOTS Search Tool: Data Transparency for LLMs

Published 27 Feb 2023 in cs.CL and cs.AI | (2302.14035v1)

Abstract: ROOTS is a 1.6TB multilingual text corpus developed for the training of BLOOM, currently the largest LLM explicitly accompanied by commensurate data governance efforts. In continuation of these efforts, we present the ROOTS Search Tool: a search engine over the entire ROOTS corpus offering both fuzzy and exact search capabilities. ROOTS is the largest corpus to date that can be investigated this way. The ROOTS Search Tool is open-sourced and available on Hugging Face Spaces. We describe our implementation and the possible use cases of our tool.

Citations (26)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.