TabLib: A Dataset of 627M Tables with Context (2310.07875v1)

Published 11 Oct 2023 in cs.CL, cs.AI, cs.DB, and cs.LG

Abstract: It is well-established that large, diverse datasets play a pivotal role in the performance of modern AI systems for text and image modalities. However, there are no datasets for tabular data of comparable size and diversity to those available for text and images. Thus we present "TabLib'', a compilation of 627 million tables totaling 69 TiB, along with 867B tokens of context. TabLib was extracted from numerous file formats, including CSV, HTML, SQLite, PDF, Excel, and others, sourced from GitHub and Common Crawl. The size and diversity of TabLib offer considerable promise in the table modality, reminiscent of the original promise of foundational datasets for text and images, such as The Pile and LAION.

References (59)

Citations (5)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

HackerNews

TabLib: A Dataset Of 627M Tables With Context (2023) (1 point, 0 comments)

[R] TabLib: A Dataset Of 627 Million Tables With Context (15 points, 1 comment)
"TabLib: A Dataset Of 627 Million Tables With Context", Eggert et al 2023 (69TB + 0.87t tokens descriptions) (12 points, 0 comments)

TabLib: A Dataset of 627M Tables with Context (2310.07875v1)

Summary

Related Papers

HackerNews

Reddit