Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 75 tok/s

Gemini 2.5 Pro 42 tok/s Pro

GPT-5 Medium 31 tok/s Pro

GPT-5 High 24 tok/s Pro

GPT-4o 98 tok/s Pro

Kimi K2 226 tok/s Pro

GPT OSS 120B 447 tok/s Pro

Claude Sonnet 4 38 tok/s Pro

2000 character limit reached

AutoDDG: Automated Dataset Description Generation using Large Language Models (2502.01050v2)

Published 3 Feb 2025 in cs.DB

Abstract: The proliferation of datasets across open data portals and enterprise data lakes presents an opportunity for deriving data-driven insights. However, widely-used dataset search systems rely on keyword searches over dataset metadata, including descriptions, to facilitate discovery. When these descriptions are incomplete, missing, or inconsistent with dataset contents, findability is severely hindered. In this paper, we address the problem of automatic dataset description generation: how to generate informative descriptions that enhance dataset discovery and support relevance assessment. We introduce AutoDDG, a framework for automated dataset description generation tailored for tabular data. To derive descriptions that are comprehensive, accurate, readable and concise, AutoDDG adopts a data-driven approach to summarize the contents of a dataset, and leverages LLMs to both enrich the summaries with semantic information and to derive human-readable descriptions. An important challenge for this problem is how to evaluate the effectiveness of methods for data description generation and the quality of the descriptions. We propose a multi-pronged evaluation strategy that: (1) measures the improvement in dataset retrieval within a dataset search engine, (2) compares generated descriptions to existing ones (when available), and (3) evaluates intrinsic quality metrics such as readability, faithfulness to the data, and conciseness. Additionally, we introduce two new benchmarks to support this evaluation. Our experimental results, using these benchmarks, demonstrate that AutoDDG generates high-quality, accurate descriptions and significantly improves dataset retrieval performance across diverse use cases.

Collections

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

AutoDDG: Automated Dataset Description Generation using Large Language Models (2502.01050v2)

Collections

Summary

Paper Prompts

Follow-up Questions

Related Papers

Authors (6)