Goby Benchmark for Enterprise LLM Integration
- Goby Benchmark is a dataset and evaluation framework designed to measure LLM performance on enterprise data integration tasks with realistic, production-grade data.
- It incorporates over 4 million rows from 1,200 source tables, capturing heterogeneous schemas and proprietary label spaces typical of business operations.
- The framework employs hierarchical annotation and runtime class-learning to improve semantic column type annotation and bridge performance gaps in LLM applications.
The Goby Benchmark is a dataset and evaluation framework designed to measure the performance of LLMs in enterprise data integration tasks, specifically assessing their ability to handle real-world, proprietary business data as opposed to public datasets. It addresses the critical discrepancy that arises when LLMs, predominantly trained on publicly available resources, are deployed in private organizational contexts with "dark" data, leading to substantial drops in performance. The Goby Benchmark comprises over four million rows across approximately 1,200 source tables from authentic enterprise workloads in event promotion and marketing, offering a realistic testbed for semantic column type annotation and related data integration problems.
1. Construction and Properties of the Goby Benchmark
The Goby Benchmark is assembled from production-grade enterprise data, using professionally developed wrappers to convert diverse web pages and APIs into structured relational tables. Compared to public benchmarks like VizNet and Wikipedia tables, Goby tables contain on average 3,400 rows and 20 columns, featuring complex, semantically rich labels common in business settings (e.g., event location, organizer details). The resulting dataset captures essential enterprise characteristics: large scale, heterogeneous schema, proprietary label space, and realistic distributions. This design ensures high fidelity to real-world integration scenarios encountered in practice.
Benchmark | Source | Avg. Rows/Table | Domain Coverage |
---|---|---|---|
Goby | Enterprise | 3,400 | Event, Marketing |
VizNet | Public | 400 | Web, Wikipedia |
Such benchmark realism is fundamental for evaluating LLM models in contexts where public data skews are not representative of private enterprise datasets.
2. Challenges in Enterprise Data Integration Using LLMs
Current LLMs demonstrate strong performance on publicly available benchmarks but exhibit notable degradation when applied to enterprise data paradigms. This gap is attributed to several factors:
- Distribution Shift: Enterprise data distributions differ significantly from the LLM pretraining corpus, both in schema structure and semantic label sets.
- Proprietary Labeling: Unique labels and column types in enterprise contexts are not present in public resources, leading to misclassification and semantic drift.
- Operational Semantics: Business operations impose different requirements and semantic interpretations, requiring models to adapt to non-standard practices.
Empirically, on semantic column type annotation tasks, LLMs show an average F1 score reduction of 0.18 when moving from public datasets (F1 = 0.89) to Goby (F1 = 0.71), with precision dropping by 14.1 percentage points and recall by 18.8 points. This reveals an inherent limitation of public-data-centric model development (Kayali et al., 29 Dec 2024).
3. Hierarchical Annotation and Ontology Synthesis
To address enterprise-specific annotation challenges, the benchmark introduces hierarchical annotation and ontology synthesis:
- Hierarchical Annotation: Column type labels are organized into a semantic hierarchy (ontology). Construction proceeds by aggregating enterprise table label dictionaries, then lifting these into parent-child structures that resolve granularity mismatches, grouping related labels under common superclasses.
- Ontology Synthesis: The expanded data dictionary is mapped onto a universal ontology encompassing all ground-truth classes with 100% coverage, enabling LLMs to access label context during inference.
Hierarchical supervision provides two advantages: it reduces ambiguity in label assignment and allows encoding of global table semantics (e.g., tree serialization feeds the entire hierarchy to the model).
4. Runtime Class-Learning and Dynamic Adaptation
Beyond static annotation, the Goby framework implements runtime class-learning, enabling dynamic expansion of the label set during model deployment:
- Sampling: For columns with unknown values, the model samples data, detects out-of-vocabulary classes, and iteratively augments the label dictionary.
- Quiescence: Label expansion continues until no further novel classes are detected.
This process ensures models adapt to evolving enterprise data, a requirement in environments where schemas and business practices change regularly.
5. Evaluation Metrics and Performance Gains
Standard classification metrics are used for assessment: precision, recall, and F1-score, defined in LaTeX as:
where is precision and is recall.
Key findings after employing hierarchical annotation, runtime class-learning, and ontology synthesis include:
- Out-of-the-box LLM F1-score on Goby: 0.71
- Post-intervention F1-score: 0.85
- Precision and recall losses due to enterprise shift: 14.1% and 18.8%, respectively, mitigated by proposed techniques
Tree serialization of the ontology context to the LLM significantly narrows the public/enterprise performance gap; performance on Goby becomes comparable to that on VizNet, indicating method efficacy (Kayali et al., 29 Dec 2024).
6. Applications and Broader Significance
The Goby Benchmark enables rigorous evaluation and development of data integration technologies suitable for enterprise deployment. Its scope includes:
- Semantic column type annotation
- Entity matching
- Candidate key identification
- Join-column prediction
By providing a challenging, diverse dataset, Goby acts as a catalyst for advancing LLM robustness to data distribution shifts, ultimately facilitating LLM adoption in production data systems.
Access is public and available at [https://goby-benchmark.github.io/].
7. Implications for Research and Industry Practice
The Goby results highlight the need to revise benchmarking strategies for LLM-based integration tools: public benchmarks overestimate real-world effectiveness. The demonstrated avenues—hierarchical annotation, adaptive class learning, and ontology synthesis—offer actionable interventions for bridging the data gap. A plausible implication is that future enterprise systems incorporating these techniques will substantially improve data integration reliability, scalability, and semantic accuracy.
In summary, the Goby Benchmark establishes a new standard for enterprise data integration evaluation, revealing systematic weaknesses in current LLM approaches and proposing principled methodologies for remediation. It is expected to guide further research into adaptive, ontology-driven enterprise data systems and inform academic and industrial best practices for LLM deployment in proprietary data environments.