Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 144 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 22 tok/s Pro
GPT-5 High 24 tok/s Pro
GPT-4o 84 tok/s Pro
Kimi K2 200 tok/s Pro
GPT OSS 120B 432 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

LDT-Bench: Benchmarking Data Lakes & Generative AI

Updated 18 October 2025
  • LDT-Bench is a benchmarking framework that evaluates data lakes with mixed structured and unstructured data using dynamic scale factors and detailed metadata integration.
  • It also assesses generative models for code synthesis and imaginative video tasks, focusing on long-distance semantic dependencies and cross-language consistency.
  • The framework employs reproducible metrics for query execution, metadata generation, and generated output quality, providing actionable insights for system optimization.

LDT-Bench refers to several benchmarking methodologies and datasets developed for advanced evaluation tasks in data management and generation systems across two primary domains: (1) benchmarking data lakes supporting mixed structured and unstructured data and (2) benchmarking generative models for code and imaginative video tasks under long-distance semantic dependencies. LDT-Bench (sometimes also referencing “DLBench”) is associated with rigorous, large-scale, and reproducible evaluation protocols, addressing critical gaps in benchmarking for systems requiring integration, creative alignment, or abstract semantic reasoning.

1. Data Models and Benchmark Construction

Data Lake Benchmarking (DLBench)

DLBench’s data model is expressly constructed to reflect the heterogeneity of real-world data lakes (Sawadogo et al., 2021). It includes:

  • Textual Documents: Long scientific articles in both English and French, with document lengths ranging from a few pages to tens of pages. Configurations scale up to 50,000 documents and approximately 62 GB of raw data. Associated metadata fields such as publication year, language, and domain are generated for each item.
  • Tabular Data: Raw CSV files extracted from open Canadian government datasets, comprising up to 5,000 files or about 1.4 GB in the largest configuration. Metadata for tables (e.g., year) supports integration with textual document metadata.
  • Scale Factor (SFSF): A parameterized scaling mechanism governs data set size, allowing systematic variation from moderate (10,000 texts, 1,000 tables at SF=1SF = 1) to large-scale benchmarks (50,000 texts, 5,000 tables at SF=5SF=5).

This dual-format, metadata-driven approach is intended to capture integration, retrieval, and analytic patterns characteristic of operational data lakes.

Long-Distance Semantic Benchmarks for Generation Tasks

In the generative AI domain, LDT-Bench is designed for evaluating model capabilities under semantically challenging scenarios (Farchi et al., 28 Oct 2024, Wu et al., 16 Oct 2025). Key structural components include:

  • Code Tasks: Benchmarks are automatically generated using a graph-based engine, with seeds elaborated into language-independent descriptions and then into multiple target language implementations (COBOL, Java, Python, C++). Each node represents an artifact type (description, code, natural language summary), with directed, labeled edges representing generation transitions (strong/trusted vs. tested/targeted LLM paths).
  • Imaginative Video Tasks: LDT-Bench for video generation comprises 2,839 concept pairs engineered to maximize semantic distance (object–action, action–action). Object sets are curated from ImageNet-1K and COCO (1,938 objects), while action sets are selected from ActivityNet, UCF101, and Kinetics-600 (901 actions). Pairs are algorithmically chosen for maximal embedding-space distance (T5/CLIP), producing prompts such as “a panda piloting a helicopter” that stress-test generalization.

2. Workload and Task Design

The DLBench workload is partitioned into retrieval and analytic tasks (Sawadogo et al., 2021):

  • Retrieval: Category filtering, term-based search (by metadata and content), identification of similar or joinable data instances.
  • Textual Analysis: Document scoring (e.g., ElasticSearch ranking), snippet extraction, keyword aggregation (stopword removal), and unsupervised mining (PCA/KMeans clustering) grouped by metadata.
  • Tabular Analytics: SQL-like simple and complex (join/aggregation) queries; tuple mining over numeric or categorical content.

Individual queries (Q1a–Q10b) are instantiated to ensure completeness and systematic system stress.

For code and generative tasks (Farchi et al., 28 Oct 2024), the benchmark workload is dynamically grown using seed topics, LLM chains, and graph traversal, producing cross-language variants and mutual conversion paths for self-consistency validation.

For video, the prompt set is explicitly crafted for long-distance semantic relationships; automatic QA protocols (ImageryQA) systematically probe for object/action presence, visual alignment, and anomaly detection (Wu et al., 16 Oct 2025).

3. Performance Metrics and Evaluation Protocols

Data Lake Metrics

Three central metrics guide DLBench assessment:

  • Query Execution Time: Measures response time for each of 20 query instances, critically evaluating storage, retrieval, and metadata handling efficiency.
  • Metadata Generation Time: Quantifies the time needed to generate the complete metadata catalog, targeting integration overhead.
  • Metadata Size: Assesses the storage cost; in reference configurations, metadata can reach half the size of raw data, illuminating data-to-metadata scalability trade-offs.

Evaluation is systematic and multi-scale: for each scale factor, data generation, metadata integration, metric measurement, and query execution are performed (with 10-fold timing for averaging) to ensure statistical rigor.

Generation Task Metrics

  • Code Benchmarking: Uses correctness/similarity measures based on graph cycles (equality, symmetry, transitivity) between pairs or sets of generated artifacts. “LLM as a Judge” (LaaJ) supplies automated, scale-based scores (e.g., 1–7 usefulness rating; S(LaaJl)=I(LaaJl,SPr1i,SPr2j)S(\mathrm{LaaJ}_l) = \sum I(\mathrm{LaaJ}_l, S^i_{Pr1}, S^j_{Pr2}) over all sample pairs and languages).
  • Video Generation: ImageryQA is the evaluation protocol, comprising:
    • ElementQA: Object/action detection in video using structured prompts for multimodal LLMs
    • AlignQA: Aesthetics/image quality assessment
    • AnomalyQA: Detection of spatial or temporal artifacts
  • Semantic Distance Measurement: The key input to adaptive search/reward in generative video is:

Dˉsem(p)=1E(i,j)Eϕ(pi)ϕ(pj)2\bar{\mathcal{D}}_{sem}(\mathbf{p}) = \frac{1}{|E|} \sum_{(i,j) \in E} \|\phi(p_i) - \phi(p_j)\|_2

where embeddings ϕ(p)\phi(p) are taken from a text encoder.

  • Guided Search Parameterization: The number of candidates generated at each step in video synthesis and reward scaling is adaptively set by:

Nt=Nbase(1+λDˉsem(p))N_t = N_\text{base} \cdot \bigl(1 + \lambda \cdot \bar{\mathcal{D}}_{sem}(\mathbf{p})\bigr)

and

RAIR(x^0)=(αMQ+βTA+γVQ+ωRany)Dˉsem(x^0)R_\text{AIR}(\hat{\mathbf{x}}_0) = \bigl(\alpha \mathrm{MQ} + \beta \mathrm{TA} + \gamma \mathrm{VQ} + \omega R_\text{any}\bigr) \cdot \bar{\mathcal{D}}_{sem}(\hat{\mathbf{x}}_0)

where MQ, TA, VQ, and RanyR_\text{any} represent motion, temporal alignment, visual quality, and extensible reward, respectively (Wu et al., 16 Oct 2025).

4. Comparative and Analytical Criteria

DLBench and LDT-Bench emphasize technology-agnostic, data-centric comparison strategies.

  • Technology Agnosticism: Designed without presupposing storage or system implementation details, enabling cross-architecture evaluations (e.g., cloud-native, on-premise, hybrid data lakes).
  • Multidimensional Integration: Simultaneous attention to latency, resource cost, and integration overhead enables differential diagnosis of performance bottlenecks, revealing trade-offs such as faster retrieval vs. metadata bloat.
  • Workload Breadth: The inclusion of both filtering and complex analytics (text mining, clustering, tuple mining) ensures broad operational coverage.
  • Real-world Relevance: Emphasis on tasks and query patterns common in operational and research data lakes, benchmarking generality.
  • Scale-Aware Analysis: Explicit variation of SFSF exposes non-linear scaling effects and supports reproducibility in both academic research and industry deployments.

In code and video benchmarks, self-consistency, symmetry, and discriminative capability of the LLM-based judgment are systematically quantified using graph-based claims and scoring.

5. Experimental Findings and Impact

Data Lake Systems

In proof-of-concept studies (e.g., with the AUDAL system), DLBench’s metrics are instrumental in distinguishing metadata-heavy vs. query-heavy system designs; metadata size is often observed to be substantial, dictating storage provisioning and influencing query latency (Sawadogo et al., 2021). Aggregated query timings across scaled data volumes reveal integration bottlenecks and highlight potential optimization targets in metadata management.

Generative Task Benchmarks

For imaginative video generation, LDT-Bench experiments indicate baseline models (e.g., Wan2.1) attain ImageryQA scores in the 48–55% range on long-distance prompts, while the adaptive ImagerySearch method achieves 57.11%, outperforming static scaling approaches. On VBench (realistic video benchmarks), ImagerySearch reaches an average score of 83.48% with improvements in dynamic and subject consistency (Wu et al., 16 Oct 2025).

In code task benchmarks, self-consistency scores and transitivity/symmetry checks expose both strengths and weaknesses in LLM judgment fidelity. Modular architectural layers (prompt engineering, inference, postprocessing) in LaaJ contribute to robust scoring and regression testing; overfitting and proxy gaps necessitate episodic human-in-the-loop validation (Farchi et al., 28 Oct 2024).

6. Limitations, Challenges, and Future Directions

While these benchmarks address critical gaps, several limitations are noted:

  • Overfitting Risk: Static benchmarks risk gradual overfitting of LLMs or data lake optimizations. Mitigations include dynamic sample regeneration and access control over reference datasets.
  • Proxy Judgment Limitation: Automated judges (LaaJ, ImageryQA) are proxies and require regular human sampling to mitigate drift or semantic blind spots.
  • Scalability to Realistic Workloads: For code, the construction of benchmarks from small programs to sizable composites is non-trivial and requires adaptive claim/metric scaling. Video benchmarks require continued expansion to encode more diverse, real-world semantic relations.
  • Coverage: Ensuring that prompt pairs and task types are sufficiently representative of operational scenarios and emerging uses remains challenging.

Future work is projected to extend to deeper integration of structured data, refinement of semantic outlier detection in video, and further advancement in coverage and regeneration techniques.

7. Overall Significance

LDT-Bench and related methodologies provide comprehensive, multidimensional, and reproducible frameworks for benchmarking data lakes with heterogeneous contents and for systematically assessing generative models under semantically demanding tasks. Their explicit focus on real-world scaling, metric rigor, and cross-technology applicability positions them as keystone protocols in both academic and industry research, informing system design, optimization, and comparative evaluation in data-centric and generative AI systems (Sawadogo et al., 2021, Farchi et al., 28 Oct 2024, Wu et al., 16 Oct 2025).

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to LDT-Bench.