LLM-Extracted Metadata in Enterprise Catalogs

Updated 13 December 2025

LLM-extracted metadata is the process of using large language models to automatically generate descriptive and searchable metadata for enterprise data catalogs.
The approach integrates preprocessing, semantic retrieval, and generative few-shot prompting to achieve high accuracy, with over 80% ROUGE-1 F1 and ~90% acceptance by stewards.
This methodology enhances catalog usability by enriching business-facing descriptions, reducing manual efforts, and improving overall data ecosystem discoverability.

LLM-extracted metadata in enterprise data catalogs refers to the process of automatically generating descriptive and searchable textual metadata for tables, columns, and other assets using LLMs. The paradigm facilitates scalable enrichment of business-facing descriptions, leveraging retrieval-augmented few-shot prompting and fine-tuned generative models. Rigorous evaluation demonstrates that such approaches achieve high accuracy, factual grounding, and very low toxicity, with over 80% ROUGE-1 F1 and nearly 90% acceptance by data stewards. This methodology addresses the chronic sparsity in enterprise catalogs, improving searchability, usability, and overall data ecosystem discoverability (Singh et al., 12 Mar 2025).

1. Architectural Pipeline for Metadata Generation

The end-to-end pipeline comprises three tightly integrated stages:

Preprocessing and Expansion: Inputs for columns include column_name, table_name, schema_name, database_name, and column_comment. For tables, additional context such as column_descriptions and business context is gathered. Abbreviations are expanded using a curated dictionary, with disambiguation to prevent erroneous mappings in ambiguous schemas.
Semantic Retrieval for Few-Shot Examples: All catalog columns (>100,000) are embedded using BAAI/bge-large-en-v1.5 and indexed in FAISS for approximate nearest-neighbor retrieval. For the target column, the top 100 neighbors by cosine similarity are retrieved; examples covering all tokens via Longest Common Subsequence (LCS) scoring are selected. Preference is given to same-table or schema matches.
Generative LLM for Description Generation: Both pretrained (e.g., Llama2-13B-chat, GPT-3.5-turbo, GPT-4 for tables) and fine-tuned (Llama2-7B-chat with QLoRA on 35,000+ pairs) models are supported. Prompt enrichment involves instructing the LLM as a "data steward" to output a JSON description, incorporating cleaned comments, glossary matches, and retrieved few-shot exemplars. For tables, relevant columns are appended, and property-specific sub-questions are posed, with responses concatenated. Post-processing ensures schema validity and is followed by human steward review.

2. Few-Shot Prompt Engineering and Control

High-fidelity generation is strongly reliant on prompt structure:

Example Injection: Few-shot exemplars are provided as distinct user–assistant pairs instead of concatenated blocks, enabling the model to learn both structure and style.
Template Structure:
- System: "You are a data steward…generate a JSON with key ‘Generated Description’."
- User: Presents few-shot exemplars (column_name, table_name, schema, database, description).
- User: Supplies the target column data with cleaned metadata and requests the description.

Strict control over prompt format, example injection, and inclusion of expanded or glossary terms reduces hallucination and directs the model toward paraphrasing over verbatim copying.

3. Quantitative Evaluation and Metric Definition

Performance is measured through multiple metrics:

ROUGE-1 F1 (unigram overlap): $P_{\mathrm{R1}} = \frac{|\mathrm{Overlap\_Unigrams}|}{|\mathrm{Unigrams}_{gen}|}$ $R_{\mathrm{R1}} = \frac{|\mathrm{Overlap\_Unigrams}|}{|\mathrm{Unigrams}_{ref}|}$

$F_1^{\mathrm{ROUGE1}} = \frac{2 \cdot P_{\mathrm{R1}} \cdot R_{\mathrm{R1}}}{P_{\mathrm{R1}} + R_{\mathrm{R1}}}$

BERTScore F1: Measures contextual overlap via embedding similarity (mean precision, recall, F1 over aligned tokens).
Acceptance Accuracy (by Stewards): Proportion labeled "no or minor edits" / total instances.
Factual Grounding (AlignScore): Quantifies entailment/alignment between input metadata and generated description.
Toxicity: Detoxify scores on generated text, all averaging below 0.001 for this application.

Results:

Column description BERTScore F1: fine-tuned Llama2-7B: 0.74, GPT-3.5-turbo: 0.67.
ROUGE-1 F1 for columns: Llama2-7B and GPT-3.5-turbo: 0.81.
Acceptance: 88% of columns and 87% of tables required no or minor edits.
Table BERTScore F1: GPT-4: 0.50.
AlignScore for exact examples: Both GPT-3.5 and fine-tuned Llama2-7B > 0.65.
Zero-shot baselines underperform retrieval-augmented setups by 10–20 ROUGE-1 points.

4. Experimental Data Sources and Stewardship

Enterprise Context: Evaluation performed on proprietary catalogs with ~500,000 columns and custom glossaries.
Fine-Tuning Regime: Llama2-7B-chat, QLoRA, 35,000 pairs, 90:10 train/validation split, single epoch.
Retrieval Indexing: 100% coverage, top-100 nearest neighbors re-ranked per column.
Human Review: 892 columns and 31 tables with labeled model outputs; high acceptance rates.

A key finding is that retrieval-augmented few-shot prompting on GPT-3.5 nearly matches or exceeds fine-tuned models in factual grounding, with effective paraphrasing and low copying rates.

5. Challenges, Limitations, and Enterprise Recommendations

Hallucination Risk: Absence of observed values or over-generalization can result in plausible but misleading descriptions. Factual grounding and post-generation review remain essential.
Abbreviation Ambiguity: Recurrent schema abbreviations require continuous dictionary curation.
Context Window Constraints: Large tables push token limits; selective column inclusion or newer LLMs with extended context windows help.
Human-in-the-Loop Workflow: Automated descriptions function as initial drafts ("jumpstarts") requiring steward curation before acceptance.
Future Enhancements: Domain-specific fine-tuning, advanced RAG architectures, ingestion of metadata and sampled values, and entity-relationship diagram integration are recommended for further reducing hallucination.

6. Impact on Catalog Usability and Scalability

LLM-derived descriptions enable:

Immediate enrichment of catalog metadata at enterprise scale, drastically reducing manual effort.
Increased discoverability and improved search, as users can query on high-quality, business-facing text.
Foundation for self-documenting data ecosystems supporting BI, analytics, and governance.
Empirical performance guarantees: >80% ROUGE-1 F1, ~90% acceptance, minimal toxicity, and robust factual alignment.

The ability to scale descriptive metadata generation—using advanced retrieval-augmented prompting and steward validation—establishes a practical blueprint for comprehensive and reliable enterprise catalog enrichment (Singh et al., 12 Mar 2025).

PDF Markdown Chat (Pro)

References (1)

Leveraging Retrieval Augmented Generative LLMs For Automated Metadata Description Generation to Enhance Data Catalogs (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to LLM-Extracted Metadata.