Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TabLib: A Dataset of 627M Tables with Context (2310.07875v1)

Published 11 Oct 2023 in cs.CL, cs.AI, cs.DB, and cs.LG

Abstract: It is well-established that large, diverse datasets play a pivotal role in the performance of modern AI systems for text and image modalities. However, there are no datasets for tabular data of comparable size and diversity to those available for text and images. Thus we present "TabLib'', a compilation of 627 million tables totaling 69 TiB, along with 867B tokens of context. TabLib was extracted from numerous file formats, including CSV, HTML, SQLite, PDF, Excel, and others, sourced from GitHub and Common Crawl. The size and diversity of TabLib offer considerable promise in the table modality, reminiscent of the original promise of foundational datasets for text and images, such as The Pile and LAION.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. Training Compute-Optimal Large Language Models, March 2022. URL http://arxiv.org/abs/2203.15556. arXiv:2203.15556 [cs].
  2. Data-centric Artificial Intelligence: A Survey, June 2023a. URL http://arxiv.org/abs/2303.10158. Issue: arXiv:2303.10158 arXiv:2303.10158 [cs].
  3. Learning Transferable Visual Models From Natural Language Supervision, February 2021. URL http://arxiv.org/abs/2103.00020. arXiv:2103.00020 [cs].
  4. Zero-Shot Text-to-Image Generation, February 2021. URL http://arxiv.org/abs/2102.12092. arXiv:2102.12092 [cs].
  5. LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs, November 2021. URL http://arxiv.org/abs/2111.02114. Issue: arXiv:2111.02114 arXiv:2111.02114 [cs].
  6. LAION-5B: An open large-scale dataset for training next generation image-text models, October 2022. URL http://arxiv.org/abs/2210.08402. Issue: arXiv:2210.08402 arXiv:2210.08402 [cs].
  7. High-Resolution Image Synthesis with Latent Diffusion Models, April 2022. URL http://arxiv.org/abs/2112.10752. arXiv:2112.10752 [cs].
  8. Transformers for Tabular Data Representation: A Survey of Models and Applications. Transactions of the Association for Computational Linguistics, 11:227–249, March 2023. ISSN 2307-387X. doi:10.1162/tacl_a_00544. URL https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00544/115239/Transformers-for-Tabular-Data-Representation-A.
  9. A Survey on Table Question Answering: Recent Advances, July 2022. URL http://arxiv.org/abs/2207.05270. arXiv:2207.05270 [cs].
  10. Table Pre-training: A Survey on Model Architectures, Pre-training Objectives, and Downstream Tasks, April 2022. URL http://arxiv.org/abs/2201.09745. arXiv:2201.09745 [cs].
  11. A Large Public Corpus of Web Tables containing Time and Context Metadata. In Proceedings of the 25th International Conference Companion on World Wide Web - WWW ’16 Companion, pages 75–76, Montréal, Québec, Canada, 2016. ACM Press. ISBN 978-1-4503-4144-8. doi:10.1145/2872518.2889386. URL http://dl.acm.org/citation.cfm?doid=2872518.2889386.
  12. TabEL: Entity Linking in Web Tables. In Marcelo Arenas, Oscar Corcho, Elena Simperl, Markus Strohmaier, Mathieu d’Aquin, Kavitha Srinivas, Paul Groth, Michel Dumontier, Jeff Heflin, Krishnaprasad Thirunarayan, Krishnaprasad Thirunarayan, and Steffen Staab, editors, The Semantic Web - ISWC 2015, volume 9366, pages 425–441. Springer International Publishing, Cham, 2015. ISBN 978-3-319-25006-9 978-3-319-25007-6. doi:10.1007/978-3-319-25007-6_25. URL http://link.springer.com/10.1007/978-3-319-25007-6_25. Series Title: Lecture Notes in Computer Science.
  13. GitTables: A Large-Scale Corpus of Relational Tables. Proceedings of the ACM on Management of Data, 1(1):1–17, May 2023. ISSN 2836-6573. doi:10.1145/3588710. URL http://arxiv.org/abs/2106.07258. arXiv:2106.07258 [cs].
  14. VizNet: Towards A Large-Scale Visualization Learning and Benchmarking Repository, May 2019. URL http://arxiv.org/abs/1905.04616. arXiv:1905.04616 [cs].
  15. WikiDBs: A Corpus of Relational Databases From Wikidata. In Joint Proceedings of Workshops at the 49th International Conference on Very Large Data Bases (VLDB 2023), Vancouver, Canada, August 28 - September 1, 2023, volume 3462 of CEUR Workshop Proceedings. CEUR-WS.org, 2023. URL https://ceur-ws.org/Vol-3462/TADA3.pdf.
  16. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task, February 2019. URL http://arxiv.org/abs/1809.08887. arXiv:1809.08887 [cs].
  17. WebTables: exploring the power of tables on the web. Proceedings of the VLDB Endowment, 1(1):538–549, August 2008. ISSN 2150-8097. doi:10.14778/1453856.1453916. URL https://doi.org/10.14778/1453856.1453916.
  18. Google Dataset Search by the Numbers, June 2020. URL http://arxiv.org/abs/2006.06894. arXiv:2006.06894 [cs].
  19. Dataset search: a survey. The VLDB Journal, 29(1):251–272, January 2020. ISSN 1066-8888, 0949-877X. doi:10.1007/s00778-019-00564-x. URL http://arxiv.org/abs/1901.00735. Number: 1 arXiv:1901.00735 [cs].
  20. Ad Hoc Table Retrieval using Semantic Similarity. In Proceedings of the 2018 World Wide Web Conference on World Wide Web - WWW ’18, pages 1553–1562, 2018. doi:10.1145/3178876.3186067. URL http://arxiv.org/abs/1802.06159. arXiv:1802.06159 [cs].
  21. Knowledge vault: a web-scale approach to probabilistic knowledge fusion. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 601–610, New York New York USA, August 2014. ACM. ISBN 978-1-4503-2956-9. doi:10.1145/2623330.2623623. URL https://dl.acm.org/doi/10.1145/2623330.2623623.
  22. From tabular data to knowledge graphs: A survey of semantic table interpretation tasks and methods. Journal of Web Semantics, 76:100761, April 2023. ISSN 1570-8268. doi:10.1016/j.websem.2022.100761. URL https://www.sciencedirect.com/science/article/pii/S1570826822000452.
  23. SemTab 2019: Resources to Benchmark Tabular Data to Knowledge Graph Matching Systems. In Andreas Harth, Sabrina Kirrane, Axel-Cyrille Ngonga Ngomo, Heiko Paulheim, Anisa Rula, Anna Lisa Gentile, Peter Haase, and Michael Cochez, editors, The Semantic Web, Lecture Notes in Computer Science, pages 514–530, Cham, 2020. Springer International Publishing. ISBN 978-3-030-49461-2. doi:10.1007/978-3-030-49461-2_30.
  24. Matching Web Tables with Knowledge Base Entities: From Entity Lookups to Entity Embeddings. In Claudia d’Amato, Miriam Fernandez, Valentina Tamma, Freddy Lecue, Philippe Cudré-Mauroux, Juan Sequeda, Christoph Lange, and Jeff Heflin, editors, The Semantic Web – ISWC 2017, Lecture Notes in Computer Science, pages 260–277, Cham, 2017. Springer International Publishing. ISBN 978-3-319-68288-4. doi:10.1007/978-3-319-68288-4_16.
  25. Sara Bonfitto. Table understanding approaches for extracting knowledge from heterogeneous tables, March 2021. URL https://wires.onlinelibrary.wiley.com/doi/abs/10.1002/widm.1407.
  26. Sherlock: A Deep Learning Approach to Semantic Data Type Detection, May 2019. URL http://arxiv.org/abs/1905.10688. arXiv:1905.10688 [cs, stat].
  27. Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach, March 2021. URL http://arxiv.org/abs/2010.13273. arXiv:2010.13273 [cs].
  28. Recommending Related Tables, July 2019. URL http://arxiv.org/abs/1907.03595. arXiv:1907.03595 [cs].
  29. JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD ’19, pages 847–864, New York, NY, USA, June 2019. Association for Computing Machinery. ISBN 978-1-4503-5643-5. doi:10.1145/3299869.3300065. URL https://dl.acm.org/doi/10.1145/3299869.3300065.
  30. Table union search on open data. Proceedings of the VLDB Endowment, 11(7):813–825, March 2018. ISSN 2150-8097. doi:10.14778/3192965.3192973. URL https://dl.acm.org/doi/10.14778/3192965.3192973.
  31. Correlation Sketches for Approximate Join-Correlation Queries. In Proceedings of the 2021 International Conference on Management of Data, pages 1531–1544, June 2021. doi:10.1145/3448016.3458456. URL http://arxiv.org/abs/2104.03353. arXiv:2104.03353 [cs].
  32. LakeBench: Benchmarks for Data Discovery over Data Lakes, July 2023. URL http://arxiv.org/abs/2307.04217. arXiv:2307.04217 [cs].
  33. Auto-join: joining tables by leveraging transformations. Proceedings of the VLDB Endowment, 10(10):1034–1045, June 2017. ISSN 2150-8097. doi:10.14778/3115404.3115409. URL https://doi.org/10.14778/3115404.3115409.
  34. WarpGate: A Semantic Join Discovery System for Cloud Data Warehouses, January 2023a. URL http://arxiv.org/abs/2212.14155. arXiv:2212.14155 [cs].
  35. Pylon: Semantic Table Union Search in Data Lakes, January 2023b. URL http://arxiv.org/abs/2301.04901. arXiv:2301.04901 [cs].
  36. TableGPT: Towards Unifying Tables, Nature Language and Commands into One GPT, August 2023b. URL http://arxiv.org/abs/2307.08674. Issue: arXiv:2307.08674 arXiv:2307.08674 [cs].
  37. Is GPT-4 a Good Data Analyst?, May 2023. URL http://arxiv.org/abs/2305.15038. arXiv:2305.15038 [cs].
  38. Data-Copilot: Bridging Billions of Data and Humans with Autonomous Workflow, June 2023. URL http://arxiv.org/abs/2306.07209. arXiv:2306.07209 [cs].
  39. Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs, May 2023. URL http://arxiv.org/abs/2305.03111. arXiv:2305.03111 [cs].
  40. DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction, April 2023. URL http://arxiv.org/abs/2304.11015. arXiv:2304.11015 [cs].
  41. MultiModalQA: Complex Question Answering over Text, Tables and Images, April 2021. URL http://arxiv.org/abs/2104.06039. arXiv:2104.06039 [cs].
  42. Bridging Textual and Tabular Data for Cross-Domain Text-to-SQL Semantic Parsing, December 2020. URL http://arxiv.org/abs/2012.12627. arXiv:2012.12627 [cs].
  43. Ziqi Zhang. Effective and efficient Semantic Table Interpretation using TableMiner+. Semantic Web, 8(6):921–957, August 2017. ISSN 22104968, 15700844. doi:10.3233/SW-160242. URL https://www.medra.org/servlet/aliasResolver?alias=iospress&doi=10.3233/SW-160242.
  44. ToTTo: A Controlled Table-To-Text Generation Dataset, October 2020. URL http://arxiv.org/abs/2004.14373. arXiv:2004.14373 [cs].
  45. Column Type Annotation using ChatGPT, July 2023. URL http://arxiv.org/abs/2306.00745. arXiv:2306.00745 [cs].
  46. TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data, May 2020. URL http://arxiv.org/abs/2005.08314. arXiv:2005.08314 [cs].
  47. TURL: table understanding through representation learning. Proceedings of the VLDB Endowment, 14(3):307–319, November 2020. ISSN 2150-8097. doi:10.14778/3430915.3430921. URL https://dl.acm.org/doi/10.14778/3430915.3430921.
  48. RPT: relational pre-trained transformer is almost all you need towards democratizing data preparation. Proceedings of the VLDB Endowment, 14(8):1254–1261, April 2021. ISSN 2150-8097. doi:10.14778/3457390.3457391. URL https://dl.acm.org/doi/10.14778/3457390.3457391.
  49. TAPAS: Weakly Supervised Table Parsing via Pre-training. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4320–4333, 2020. doi:10.18653/v1/2020.acl-main.398. URL http://arxiv.org/abs/2004.02349. arXiv:2004.02349 [cs].
  50. TABBIE: Pretrained Representations of Tabular Data, May 2021. URL http://arxiv.org/abs/2105.02584. arXiv:2105.02584 [cs].
  51. Wes McKinney. Data Structures for Statistical Computing in Python. In Stéfan van der Walt and Jarrod Millman, editors, Proceedings of the 9th Python in Science Conference, pages 56 – 61, 2010. doi:10.25080/Majora-92bf1922-00a.
  52. Ray: A Distributed Framework for Emerging AI Applications, September 2018. URL http://arxiv.org/abs/1712.05889. arXiv:1712.05889 [cs, stat].
  53. M. E. J. Newman. Power laws, Pareto distributions and Zipf’s law. Contemporary Physics, 46(5):323–351, September 2005. ISSN 0010-7514, 1366-5812. doi:10.1080/00107510500052444. URL http://arxiv.org/abs/cond-mat/0412004. Number: 5 arXiv:cond-mat/0412004.
  54. Powerlaw: a Python package for analysis of heavy-tailed distributions. PLoS ONE, 9(1):e85777, January 2014. ISSN 1932-6203. doi:10.1371/journal.pone.0085777. URL http://arxiv.org/abs/1305.0215. arXiv:1305.0215 [physics].
  55. Survey on deep learning with class imbalance. Journal of Big Data, 6(1):27, March 2019. ISSN 2196-1115. doi:10.1186/s40537-019-0192-5. URL https://doi.org/10.1186/s40537-019-0192-5.
  56. Deduplicating Training Data Makes Language Models Better, March 2022. URL http://arxiv.org/abs/2107.06499. Issue: arXiv:2107.06499 arXiv:2107.06499 [cs].
  57. The Pile: An 800GB Dataset of Diverse Text for Language Modeling, December 2020. URL http://arxiv.org/abs/2101.00027. Issue: arXiv:2101.00027 arXiv:2101.00027 [cs].
  58. Efficient Estimation of Word Representations in Vector Space, September 2013. URL http://arxiv.org/abs/1301.3781. arXiv:1301.3781 [cs].
  59. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, August 2019. URL http://arxiv.org/abs/1908.10084. arXiv:1908.10084 [cs].
Citations (5)

Summary

We haven't generated a summary for this paper yet.

HackerNews

Reddit Logo Streamline Icon: https://streamlinehq.com