Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Wiki-TabNER:Advancing Table Interpretation Through Named Entity Recognition (2403.04577v1)

Published 7 Mar 2024 in cs.AI and cs.CL

Abstract: Web tables contain a large amount of valuable knowledge and have inspired tabular LLMs aimed at tackling table interpretation (TI) tasks. In this paper, we analyse a widely used benchmark dataset for evaluation of TI tasks, particularly focusing on the entity linking task. Our analysis reveals that this dataset is overly simplified, potentially reducing its effectiveness for thorough evaluation and failing to accurately represent tables as they appear in the real-world. To overcome this drawback, we construct and annotate a new more challenging dataset. In addition to introducing the new dataset, we also introduce a novel problem aimed at addressing the entity linking task: named entity recognition within cells. Finally, we propose a prompting framework for evaluating the newly developed LLMs on this novel TI task. We conduct experiments on prompting LLMs under various settings, where we use both random and similarity-based selection to choose the examples presented to the models. Our ablation study helps us gain insights into the impact of the few-shot examples. Additionally, we perform qualitative analysis to gain insights into the challenges encountered by the models and to understand the limitations of the proposed dataset.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Sercan Ö. Arik and Tomas Pfister. 2021. TabNet: Attentive Interpretable Tabular Learning. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021. AAAI Press, 6679–6687. https://doi.org/10.1609/AAAI.V35I8.16826
  2. TabEL: Entity Linking in Web Tables. In The Semantic Web - ISWC 2015, Marcelo Arenas, Oscar Corcho, Elena Simperl, Markus Strohmaier, Mathieu d’Aquin, Kavitha Srinivas, Paul Groth, Michel Dumontier, Jeff Heflin, Krishnaprasad Thirunarayan, Krishnaprasad Thirunarayan, and Steffen Staab (Eds.). Springer International Publishing, Cham, 425–441.
  3. DBpedia - A crystallization point for the Web of Data. Journal of Web Semantics 7, 3 (2009), 154–165. https://doi.org/10.1016/j.websem.2009.07.002 The Web of Data.
  4. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.). https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html
  5. WebTables: exploring the power of tables on the web. Proc. VLDB Endow. 1, 1 (2008), 538–549. https://doi.org/10.14778/1453856.1453916
  6. Learning Semantic Annotations for Tabular Data. arXiv:1906.00781 [cs.DB]
  7. Prompt-Based Metric Learning for Few-Shot NER. In Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, 7199–7212. https://doi.org/10.18653/V1/2023.FINDINGS-ACL.451
  8. CONTaiNER: Few-Shot Named Entity Recognition via Contrastive Learning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, 6338–6353. https://doi.org/10.18653/V1/2022.ACL-LONG.439
  9. TURL: Table Understanding through Representation Learning. Proc. VLDB Endow. 14, 3 (2020), 307–319. https://doi.org/10.5555/3430915.3442430
  10. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, 4171–4186. https://doi.org/10.18653/V1/N19-1423
  11. Matching Web Tables with Knowledge Base Entities: From Entity Lookups to Entity Embeddings. In The Semantic Web – ISWC 2017: 16th International Semantic Web Conference, Vienna, Austria, October 21–25, 2017, Proceedings, Part I (Vienna, Austria). Springer-Verlag, Berlin, Heidelberg, 260–277. https://doi.org/10.1007/978-3-319-68288-4_16
  12. MATE: Multi-view Attention for Table Transformer Efficiency. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, 7606–7619. https://doi.org/10.18653/V1/2021.EMNLP-MAIN.600
  13. TableGPT: Few-shot Table-to-Text Generation with Table Structure Reconstruction and Content Matching. In Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Barcelona, Spain (Online), 1978–1988. https://doi.org/10.18653/v1/2020.coling-main.179
  14. TableGPT: Few-shot Table-to-Text Generation with Table Structure Reconstruction and Content Matching. In Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), December 8-13, 2020, Donia Scott, Núria Bel, and Chengqing Zong (Eds.). International Committee on Computational Linguistics, 1978–1988. https://doi.org/10.18653/V1/2020.COLING-MAIN.179
  15. James Alistair Hammerton. 2003. Named Entity Recognition with Long Short-Term Memory. In Proceedings of the Seventh Conference on Natural Language Learning, CoNLL 2003, Held in cooperation with HLT-NAACL 2003, Edmonton, Canada, May 31 - June 1, 2003, Walter Daelemans and Miles Osborne (Eds.). ACL, 172–175. https://aclanthology.org/W03-0426/
  16. TaPas: Weakly Supervised Table Parsing via Pre-training. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.398
  17. TABBIE: Pretrained Representations of Tabular Data. arXiv:2105.02584 [cs.CL]
  18. SemTab 2019: Resources to Benchmark Tabular Data to Knowledge Graph Matching Systems, In The Semantic Web. ESWC 2020. The Semantic Web, 514–530. https://doi.org/10.1007/978-3-030-49461-2_30 The final authenticated version is available online at https://doi.org/10.1007/978-3-030-49461-2_30.
  19. SpanBERT: Improving Pre-training by Representing and Predicting Spans. Trans. Assoc. Comput. Linguistics 8 (2020), 64–77. https://doi.org/10.1162/TACL_A_00300
  20. Named Entity Recognition in Industrial Tables using Tabular Language Models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: EMNLP 2022 - Industry Track, Abu Dhabi, UAE, December 7 - 11, 2022, Yunyao Li and Angeliki Lazaridou (Eds.). Association for Computational Linguistics, 348–356. https://doi.org/10.18653/V1/2022.EMNLP-INDUSTRY.35
  21. Neural Architectures for Named Entity Recognition. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, Kevin Knight, Ani Nenkova, and Owen Rambow (Eds.). The Association for Computational Linguistics, 260–270. https://doi.org/10.18653/V1/N16-1030
  22. A Large Public Corpus of Web Tables Containing Time and Context Metadata. In Proceedings of the 25th International Conference Companion on World Wide Web (Montréal, Québec, Canada) (WWW ’16 Companion). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 75–76. https://doi.org/10.1145/2872518.2889386
  23. Annotating and Searching Web Tables Using Entities, Types and Relationships. Proc. VLDB Endow. 3, 1 (2010), 1338–1347. http://dblp.uni-trier.de/db/journals/pvldb/pvldb3.html#LimayeSC10
  24. OpenAI. 2023. GPT-4 Technical Report. CoRR abs/2303.08774 (2023). https://doi.org/10.48550/ARXIV.2303.08774 arXiv:2303.08774
  25. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 21 (2020), 140:1–140:67. http://jmlr.org/papers/v21/20-074.html
  26. Lance A. Ramshaw and Mitchell P. Marcus. 1995. Text Chunking using Transformation-Based Learning. arXiv:cmp-lg/9505040 [cmp-lg]
  27. Yago: A Core of Semantic Knowledge. In Proceedings of the 16th International Conference on World Wide Web (Banff, Alberta, Canada) (WWW ’07). Association for Computing Machinery, New York, NY, USA, 697–706. https://doi.org/10.1145/1242572.1242667
  28. Llama 2: Open Foundation and Fine-Tuned Chat Models. CoRR abs/2307.09288 (2023). https://doi.org/10.48550/ARXIV.2307.09288 arXiv:2307.09288
  29. Attention Is All You Need. arXiv e-prints, Article arXiv:1706.03762 (June 2017), arXiv:1706.03762 pages. https://doi.org/10.48550/arXiv.1706.03762 arXiv:1706.03762 [cs.CL]
  30. Prompting PaLM for Translation: Assessing Strategies and Performance. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, 15406–15427. https://doi.org/10.18653/V1/2023.ACL-LONG.859
  31. DeepStruct: Pretraining of Language Models for Structure Prediction. In Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022, Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, 803–823. https://doi.org/10.18653/V1/2022.FINDINGS-ACL.67
  32. An Enhanced Span-based Decomposition Method for Few-Shot Sequence Labeling. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, WA, United States, July 10-15, 2022, Marine Carpuat, Marie-Catherine de Marneffe, and Iván Vladimir Meza Ruíz (Eds.). Association for Computational Linguistics, 5012–5024. https://doi.org/10.18653/V1/2022.NAACL-MAIN.369
  33. GNN-SL: Sequence Labeling Based on Nearest Examples via GNN. In Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, 12679–12692. https://doi.org/10.18653/V1/2023.FINDINGS-ACL.803
  34. GPT-NER: Named Entity Recognition via Large Language Models. CoRR abs/2304.10428 (2023). https://doi.org/10.48550/ARXIV.2304.10428 arXiv:2304.10428
  35. Structure-aware Pre-training for Table Understanding with Tree-based Transformers. CoRR abs/2010.12537 (2020). arXiv:2010.12537 https://arxiv.org/abs/2010.12537
  36. Finetuned Language Models are Zero-Shot Learners. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net. https://openreview.net/forum?id=gEZrGCozdqR
  37. UnifiedSKG: Unifying and Multi-Tasking Structured Knowledge Grounding with Text-to-Text Language Models. arXiv:2201.05966 [cs.CL]
  38. TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (Eds.). Association for Computational Linguistics, 8413–8426. https://doi.org/10.18653/V1/2020.ACL-MAIN.745
  39. TableGPT: Towards Unifying Tables, Nature Language and Commands into One GPT. CoRR abs/2307.08674 (2023). https://doi.org/10.48550/ARXIV.2307.08674 arXiv:2307.08674
  40. Towards Foundation Models for Learning on Tabular Data. CoRR abs/2310.07338 (2023). https://doi.org/10.48550/ARXIV.2310.07338 arXiv:2310.07338
  41. Table2Vec. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM. https://doi.org/10.1145/3331184.3331333
  42. TableLlama: Towards Open Large Generalist Models for Tables. CoRR abs/2311.09206 (2023). https://doi.org/10.48550/ARXIV.2311.09206 arXiv:2311.09206
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Aneta Koleva (6 papers)
  2. Martin Ringsquandl (14 papers)
  3. Ahmed Hatem (3 papers)
  4. Thomas Runkler (34 papers)
  5. Volker Tresp (158 papers)
Citations (1)