Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Solo: Data Discovery Using Natural Language Questions Via A Self-Supervised Approach (2301.03560v2)

Published 9 Jan 2023 in cs.IR

Abstract: Most deployed data discovery systems, such as Google Datasets, and open data portals only support keyword search. Keyword search is geared towards general audiences but limits the types of queries the systems can answer. We propose a new system that lets users write natural language questions directly. A major barrier to using this learned data discovery system is it needs expensive-to-collect training data, thus limiting its utility. In this paper, we introduce a self-supervised approach to assemble training datasets and train learned discovery systems without human intervention. It requires addressing several challenges, including the design of self-supervised strategies for data discovery, table representation strategies to feed to the models, and relevance models that work well with the synthetically generated questions. We combine all the above contributions into a system, Solo, that solves the problem end to end. The evaluation results demonstrate the new techniques outperform state-of-the-art approaches on well-known benchmarks. All in all, the technique is a stepping stone towards building learned discovery systems. The code is open-sourced at https://github.com/TheDataStation/solo

Definition Search Book Streamline Icon: https://streamlinehq.com
References (77)
  1. Revisiting neural scaling laws in language and vision. Advances in Neural Information Processing Systems 35 (2022), 22300–22312.
  2. Semantic web for the working ontologist. ACM Press.
  3. Explaining neural scaling laws. arXiv preprint arXiv:2102.06701 (2021).
  4. Variational inference: A review for statisticians. Journal of the American statistical Association 112, 518 (2017), 859–877.
  5. Weight uncertainty in neural network. In International conference on machine learning. PMLR, 1613–1622.
  6. Improving language models by retrieving from trillions of tokens. In International conference on machine learning. PMLR, 2206–2240.
  7. Webtables: exploring the power of tables on the web. Proceedings of the VLDB Endowment 1, 1 (2008), 538–549.
  8. EmbDI: Generating Embeddings for Relational Data Integration. (2021).
  9. Auctus: a dataset search engine for data discovery and augmentation. Proceedings of the VLDB Endowment 14, 12 (2021), 2791–2794.
  10. Dataset search: a survey. The VLDB Journal 29, 1 (2020), 251–272.
  11. Reading wikipedia to answer open-domain questions. arXiv preprint arXiv:1704.00051 (2017).
  12. Open Question Answering over Tables and Text. In International Conference on Learning Representations. https://openreview.net/forum?id=MmCRswl1UYl
  13. Table search using a deep contextualized language model. In Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval. 589–598.
  14. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2818–2829.
  15. datacityofchicago 2023. Chicago data portal. https://data.cityofchicago.org/.
  16. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. arXiv preprint arXiv:2104.08758 (2021).
  17. elasticsearch 2023. Elasticsearch. https://www.elastic.co/.
  18. Aurum: A data discovery system. In 2018 IEEE 34th International Conference on Data Engineering (ICDE). IEEE, 1001–1012.
  19. Creating Training Corpora for NLG Micro-Planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Vancouver, Canada). Association for Computational Linguistics, 179–188. https://doi.org/10.18653/v1/P17-1017
  20. Question Answering is a Format; When is it Useful? arXiv preprint arXiv:1909.11291 (2019).
  21. Scaling laws for neural machine translation. arXiv preprint arXiv:2109.07740 (2021).
  22. Deep Learning. MIT Press. http://www.deeplearningbook.org.
  23. Towards complex text-to-sql in cross-domain database with intermediate representation. arXiv preprint arXiv:1905.08205 (2019).
  24. Goods: Organizing google’s datasets. In Proceedings of the 2016 International Conference on Management of Data. 795–806.
  25. Relational Header Discovery using Similarity Search in a Table Corpus. In 2021 IEEE 37th International Conference on Data Engineering (ICDE). IEEE, 444–455.
  26. Open domain question answering over tables via dense retrieval. arXiv preprint arXiv:2103.12011 (2021).
  27. TaPas: Weakly supervised table parsing via pre-training. arXiv preprint arXiv:2004.02349 (2020).
  28. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).
  29. Efficient IR-style keyword search over relational databases. In Proceedings 2003 VLDB Conference. Elsevier, 850–861.
  30. A neural network for factoid question answering over paragraphs. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 633–644.
  31. Gautier Izacard and Edouard Grave. 2020a. Distilling Knowledge from Reader to Retriever for Question Answering. arXiv:2012.04584 [cs.CL]
  32. Gautier Izacard and Edouard Grave. 2020b. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. arXiv:2007.01282 [cs.CL]
  33. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence 33, 1 (2010), 117–128.
  34. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data 7, 3 (2019), 535–547.
  35. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551 (2017).
  36. Hands-on Bayesian neural networks—A tutorial for deep learning users. IEEE Computational Intelligence Magazine 17, 2 (2022), 29–48.
  37. SOCRAT: A Dynamic Web Toolbox for Interactive Data Processing, Analysis and Visualization. Information 13, 11 (2022), 547.
  38. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020).
  39. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906 (2020).
  40. George Katsogiannis-Meimarakis and Georgia Koutrika. 2023. A survey on deep learning approaches for text-to-SQL. The VLDB Journal (2023), 1–32.
  41. Lessons Learned from the Chameleon Testbed. In Proceedings of the 2020 USENIX Annual Technical Conference (USENIX ATC ’20). USENIX Association.
  42. Duplicate Table Discovery with Xash. BTW 2023 (2023).
  43. Bayesian incremental learning for deep neural networks. arXiv preprint arXiv:1802.07329 (2018).
  44. Natural Questions: a Benchmark for Question Answering Research. Transactions of the Association of Computational Linguistics (2019).
  45. lambdalabs 2023. GPU cloud built for deep learning. https://lambdalabs.com/.
  46. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems 33 (2020), 9459–9474.
  47. Tapex: Table pre-training via learning a neural sql executor. arXiv preprint arXiv:2107.07653 (2021).
  48. Augmented language models: a survey. arXiv preprint arXiv:2302.07842 (2023).
  49. Multi-hop reading comprehension through question decomposition and rescoring. arXiv preprint arXiv:1906.02916 (2019).
  50. FeTaQA: Free-form Table Question Answering. Transactions of the Association for Computational Linguistics 10 (2022), 35–49.
  51. OpenAI. 2023. GPT-4 Technical Report. https://arxiv.org/abs/2303.08774. (2023).
  52. openckan 2023. CKAN, open source data management system. https://ckan.org/.
  53. CLTR: An End-to-End, Transformer-Based System for Cell Level Table Retrieval and Table Question Answering. arXiv preprint arXiv:2106.04441 (2021).
  54. Automatic differentiation in pytorch. (2017).
  55. Unsupervised question decomposition for question answering. arXiv preprint arXiv:2002.09758 (2020).
  56. Modern natural language interfaces to databases: Composing statistical parsing with semantic tractability. In COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics. 141–147.
  57. Pre-trained models for natural language processing: A survey. Science China Technological Sciences 63, 10 (2020), 1872–1897.
  58. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683 (2019).
  59. Stephen Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: BM25 and beyond. Now Publishers Inc.
  60. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 815–823.
  61. Web table retrieval using multimodal deep learning. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1399–1408.
  62. Content-based table retrieval for web queries. Neurocomputing 349 (2019), 183–189.
  63. Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE, 23–30.
  64. Attention is all you need. Advances in neural information processing systems 30 (2017).
  65. Retrieving Complex Tables with Multi-Granular Graph Representation Learning. arXiv preprint arXiv:2105.01736 (2021).
  66. Knowledge base completion via search-based question answering. In Proceedings of the 23rd international conference on World wide web. 515–526.
  67. Andrew G Wilson and Pavel Izmailov. 2020. Bayesian deep learning and a probabilistic perspective of generalization. Advances in neural information processing systems 33 (2020), 4697–4708.
  68. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).
  69. TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data. In Annual Conference of the Association for Computational Linguistics (ACL).
  70. TaBERT: Pretraining for joint understanding of textual and tabular data. arXiv preprint arXiv:2005.08314 (2020).
  71. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. arXiv preprint arXiv:1809.08887 (2018).
  72. Complex question decomposition for semantic parsing. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 4477–4486.
  73. Shuo Zhang and Krisztian Balog. 2018. Ad hoc table retrieval using semantic similarity. In Proceedings of the 2018 world wide web conference. 1553–1562.
  74. Yi Zhang and Zachary G Ives. 2020. Finding related tables in data lakes for interactive data science. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 1951–1966.
  75. Zixuan Zhao and Raul Castro Fernandez. 2022. Leva: Boosting Machine Learning Performance with Relational Embedding Data Augmentation. (2022).
  76. Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning. CoRR abs/1709.00103 (2017).
  77. Answering Keyword Queries involving Aggregates and Group-Bys in Relational Databases. (2015).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Qiming Wang (23 papers)
  2. Raul Castro Fernandez (27 papers)
Citations (5)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub