Papers
Topics
Authors
Recent
Search
2000 character limit reached

NameGuess: Column Name Expansion for Tabular Data

Published 19 Oct 2023 in cs.CL, cs.DB, and cs.LG | (2310.13196v1)

Abstract: Recent advances in LLMs have revolutionized many sectors, including the database industry. One common challenge when dealing with large volumes of tabular data is the pervasive use of abbreviated column names, which can negatively impact performance on various data search, access, and understanding tasks. To address this issue, we introduce a new task, called NameGuess, to expand column names (used in database schema) as a natural language generation problem. We create a training dataset of 384K abbreviated-expanded column pairs using a new data fabrication method and a human-annotated evaluation benchmark that includes 9.2K examples from real-world tables. To tackle the complexities associated with polysemy and ambiguity in NameGuess, we enhance auto-regressive LLMs by conditioning on table content and column header names -- yielding a fine-tuned model (with 2.7B parameters) that matches human performance. Furthermore, we conduct a comprehensive analysis (on multiple LLMs) to validate the effectiveness of table content in NameGuess and identify promising future opportunities. Code has been made available at https://github.com/amazon-science/nameguess.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Falcon-40B: an open large language model with state-of-the-art performance.
  2. Ice-tea: in-context expansion and translation of english abbreviations. In International Conference on Intelligent Text Processing and Computational Linguistics (CICLing), pages 41–54. Springer.
  3. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow.
  4. Language models are few-shot learners. In Advances in Neural Information Processing Systems (NeurIPS), volume 33, pages 1877–1901.
  5. Context-aware abbreviation expansion using large language models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 1261–1275.
  6. FORTAP: using formulas for numerical-reasoning-aware table pretraining. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), pages 1150–1166.
  7. Turl: Table understanding through representation learning. ACM SIGMOD Record, 51(1):33–40.
  8. Table pre-training: A survey on model architectures, pre-training objectives, and downstream tasks. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI), pages 5426–5435.
  9. Building the dresden web table corpus: A classification approach. In 2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC), pages 41–50. IEEE.
  10. Mate: Multi-view attention for table transformer efficiency. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7606–7619.
  11. Understanding dataset difficulty with 𝒱𝒱\mathcal{V}caligraphic_V-usable information. In Proceedings of the 39th International Conference on Machine Learning (ICML), volume 162, pages 5988–6008.
  12. Christiane Fellbaum. 1998. WordNet: An electronic lexical database. MIT press.
  13. TableGPT: Few-shot table-to-text generation with table structure reconstruction and content matching. In Proceedings of the 28th International Conference on Computational Linguistics (COLING), pages 1978–1988.
  14. Structured abbreviation expansion in context. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 995–1005.
  15. TaPas: Weakly supervised table parsing via pre-training. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), pages 4320–4333.
  16. Gittables: A large-scale corpus of relational tables. Proceedings of the ACM on Management of Data, 1(1):1–17.
  17. Sherlock: A deep learning approach to semantic data type detection. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), pages 1500–1508.
  18. TABBIE: pretrained representations of tabular data. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (NAACL-HLT), pages 3446–3456.
  19. Valentine: Evaluating matching techniques for dataset discovery. In 2021 IEEE 37th International Conference on Data Engineering (ICDE), pages 468–479.
  20. Edward M Leonard. 2011. Design and implementation of an enterprise data warehouse. Marquette University.
  21. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  22. Similarity flooding: a versatile graph matching algorithm and its application to schema matching. In Proceedings 18th International Conference on Data Engineering (ICDE), pages 117–128.
  23. OpenAI. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  24. What does this acronym mean? introducing a new dataset for acronym identification and disambiguation. In Proceedings of the 28th International Conference on Computational Linguistics (COLING), pages 3285–3301.
  25. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  26. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research (JMLR), 21(1):5485–5551.
  27. Brian Roark and Richard Sproat. 2014. Hippocratic abbreviation expansion. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL), pages 364–369.
  28. Timo Schick and Hinrich Schütze. 2021a. Few-shot text generation with natural language instructions. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 390–402.
  29. Timo Schick and Hinrich Schütze. 2021b. It’s not just size that matters: Small language models are also few-shot learners. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 2339–2352.
  30. Annotating columns with pre-trained language models. In Proceedings of the 2022 International Conference on Management of Data (SIGMOD), pages 1493–1503.
  31. Dataset cartography: Mapping and diagnosing datasets with training dynamics. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9275–9293.
  32. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  33. Tcn: Table convolutional network for web table interpretation. In Proceedings of the Web Conference 2021, page 4020–4032.
  34. Emergent abilities of large language models. Transactions on Machine Learning Research.
  35. Clinical abbreviation disambiguation using neural word embeddings. In Proceedings of BioNLP 15, pages 171–176.
  36. Unifiedskg: Unifying and multi-tasking structured knowledge grounding with text-to-text language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 602–631.
  37. TableFormer: Robust transformer modeling for table-text encoding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), pages 528–537.
  38. TaBERT: Pretraining for joint understanding of textual and tabular data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), pages 8413–8426.
  39. Mapping abbreviations to full forms in biomedical articles. Journal of the American Medical Informatics Association, 9(3):262–272.
  40. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3911–3921.
  41. Sato: Contextual semantic type detection in tables. Proc. VLDB Endow., 13(12):1835–1848.
  42. Shuo Zhang and Krisztian Balog. 2020. Web table extraction, retrieval, and augmentation: A survey. ACM Transactions on Intelligent Systems and Technology, 11(2):1–35.
  43. Bertscore: Evaluating text generation with BERT. In 8th International Conference on Learning Representations (ICLR).
  44. Seq2sql: Generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103.
  45. Plod: An abbreviation detection dataset for scientific documents. In Proceedings of the Language Resources and Evaluation Conference (IREC), pages 680–688.
Citations (2)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.