DataForge: Interoperability & Synthetic Time Series
- DataForge is a framework that integrates DFML for declarative data format specification and automated file access, streamlining heterogeneous file management.
- It employs LLM-based methods to generate high-quality synthetic time series using periodic segmentation and functional embeddings, ensuring realistic data patterns.
- The system achieves universal file format interoperability and scalability with user-friendly tools like the DFML Editor and supports multimodal condition-based synthesis.
DataForge is a system framework that encompasses two complementary research threads: (1) the integration of structural data format description with auto-generation of file reading and writing programs using Data Format Markup Language (DFML), and (2) the generation of high-quality synthetic time series via LLM-based modeling of functional embeddings. These technologies collectively address heterogeneous file format interoperability, automated data access, and scalable data synthesis, thereby streamlining data management, sharing, and generation workflows in scientific and enterprise contexts.
1. Integration of Structured Data Format Descriptions
DataForge incorporates the methodology described in (Cheng et al., 2021), in which file format information is described using DFML—an XML-based Data Format Markup Language. DFML enables declarative, universal, and machine-readable representations of file structure, including:
- Data types and groupings
- Element positions (start/end, absolute/relative)
- Byte order and modes (text or binary)
- Repetition, intervals, and nested groups
The DFML schema provides elements such as <dataformat>
, <import>
, <location>
, <datatype>
, <separator>
, and <group>
, each with precise attributes to describe complex, hierarchical, or irregular file layouts.
2. Automated File Reader and Writer Program Generation
The auto-generation process begins by parsing a DFML document to extract a linearized sequence of data type entries, each annotated with structural descriptors (positions, types, rarefaction, repetition, byte order). This linearization, as outlined in the pseudocode of Fig. 7 in (Cheng et al., 2021), enables:
- The construction of code that reads files either sequentially or randomly.
- Sequential reading: Parses the file from start to finish according to the ordered data sequence, appropriate for exhaustive data extraction.
- Random reading: Accesses items by absolute position, directly seeking to specified offsets. This enables efficient partial file access.
- Code generation is parameterized not only by the linear data type sequence but also by generalizable programming rules for the target language (e.g., C#).
- The process is exemplified by case studies on the ESRI point shapefile (binary, fixed header, variable records) and SWMM input files (plain text, complex specification/content sections), demonstrating correctness across both modalities and file complexity classes.
The DFML Editor tool is integral, offering a GUI for DFML authoring, element tree navigation, and live XML preview, substantially reducing the manual overhead of precise format annotation.
3. Synthetic Time Series Generation with LLMs
In the context of synthetic time series, DataForge leverages the SDForger (as described in (Rousseau et al., 21 May 2025)) framework. The architecture for time series generation comprises:
- Periodicity-aware segmentation: Raw multivariate time series are segmented based on statistical periodicity (via autocorrelation analysis).
- Basis decomposition and embedding: Each segment is projected onto learned basis functions (Functional Principal Components or FastICA), resulting in a tabular embedding independent of native signal length. For each channel , , where is computed by .
- Textual encoding: Embedding rows are rendered as structured text via fill-in-the-blank templates with feature randomization (to minimize position bias), forming a prompt/answer pair amenable to LLM fine-tuning.
- Autoregressive LLM fine-tuning: An LLM (such as GPT-2) is conditioned on these textual embeddings and fine-tuned, typically with few samples, due to the compactness of the representation.
4. Inference, Decoding, and Utility Evaluation
During inference, the LLM is prompted to generate novel textual embeddings, which are parsed back into numeric form and decoded:
- Sampling: Multinomial sampling with temperature scaling balances sample diversity and fidelity.
- Outlier filtering: Norm-based filtering rejects anomalous generates to maintain global signal characteristics.
- Synthesis: Reconstruction of time series utilizes , where are the LLM-sampled coefficients.
Empirical results indicate that, especially with FastICA-based embeddings, SDForger achieves superior or competitive performance on similarity metrics (marginal distribution, autocorrelation, skewness, kurtosis, Euclidean distance, Dynamic Time Warping) and forecasting utility (as measured by downstream RMSE with Tiny Time Mixers) relative to VAE, GAN, or diffusion models. The framework’s design is dataset-agnostic, encompassing domains such as energy, transport, finance, and environmental data.
5. Textual Conditioning and Multimodal Extensions
A unique aspect of SDForger is its native support for textual conditioning:
- Since the LLM ingests structured text, extraneous conditioning information (e.g., channel ID, context descriptors) can be embedded seamlessly within the prompt without architectural alteration.
- This enables multimodal modeling, for example, by unifying time series and language for context-aware or guided synthetic data generation.
- The pipeline thus accommodates scenarios where temporal data are accompanied or constrained by semantic/natural language inputs.
6. Implications for DataForge Architecture and Future Directions
Integrating DFML-based parsing and LLM-based synthetic data generation imparts distinctive advantages to DataForge:
- Universal format interoperability: DFML enables DataForge to ingest, interpret, and auto-generate accessors for diverse, evolving file layouts without per-format code rewrites.
- Rapid, accurate, and user-friendly data integration: The DFML Editor provides an accessible interface for data engineers or domain experts to describe and update data format schemas.
- Efficient and realistic data synthesis: Through SDForger, DataForge can generate synthetic data instances that match original data statistics and temporal dynamics, supporting simulation, augmentation, and privacy-preserving analytics.
- Open extensibility: Planned open sourcing of SDForger is intended to enable broad research and community-driven advances.
Possible future refinements cited by (Rousseau et al., 21 May 2025) include exploring encoder-only LLMs/masked token prediction, adaptive embedding dimensionality strategies, and richer context modeling.
7. Summary of Research Contributions
DataForge, as synthesized from (Cheng et al., 2021) and (Rousseau et al., 21 May 2025), represents the convergence of declarative data format description, automated code generation, and LLM-driven synthetic time series modeling. Core contributions include:
- The formalization and practical deployment of DFML for heterogeneous data format management and automated program synthesis.
- A functional embedding and text-based approach to time series synthesis, leveraging LLMs for fidelity and multimodal conditioning.
- Systematic empirical validation showing superiority and task relevance across diverse datasets and evaluation metrics.
- Architectural features that promote extensibility, format universality, and usability for a wide range of data science and scientific computing scenarios.