Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 97 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 35 tok/s
GPT-5 High 29 tok/s Pro
GPT-4o 88 tok/s
GPT OSS 120B 471 tok/s Pro
Kimi K2 234 tok/s Pro
2000 character limit reached

DataForge Framework for Synthetic Time Series

Updated 27 August 2025
  • DataForge Framework is a modular system for synthetic time series generation that transforms raw data into a compact textual embedding space.
  • It utilizes basis decompositions and lightweight LLM fine-tuning to ensure synthetic outputs retain key statistical and temporal properties.
  • The decoupled design offers computational advantages and robust performance in low-data regimes across diverse application domains.

DataForge—formally referred to as SDForger—constitutes a flexible and efficient framework for generating high-quality multivariate and univariate time series using LLMs. The approach is distinguished by a compact data representation scheme, enabling synthetic time series generation from limited samples and requiring only lightweight fine-tuning of any autoregressive LLM. By transforming time series data into a tabular embedding space, encoding this representation as structured text, and leveraging text-to-embedding decoding strategies, SDForger achieves superior fidelity and scalability relative to prior generative models in both similarity-based and downstream task evaluations.

1. Architectural Overview and Workflow

The SDForger framework is divided into three sequential phases that together enable synthetic data generation with strong retention of statistical and temporal properties:

  1. Preprocessing & Embedding: Multivariate or univariate time series are segmented using a periodicity-aware strategy. Each segment is projected into a lower-dimensional subspace via basis decompositions, such as Functional Principal Components (FPC) or Fast Independent Component Analysis (FastICA), yielding a tabular embedding matrix EE.
  2. Textual Encoding & LLM Fine-Tuning: Each row of EE (an embedding vector for a segment instance) is converted to a structured textual prompt using a fill-in-the-middle template. Feature order is randomized to mitigate positional biases. The resulting prompts and coefficient pairs constitute training data for fine-tuning an autoregressive LLM so it learns structural correlations among coefficients.
  3. Inference & Decoding: Synthetic data is produced by sampling new textual embeddings from the fine-tuned LLM, parsing the generated text back into embedding space, and reconstructing the time series via inversion of the original basis projection.

This modular design, where projection and sequence modeling are uncoupled, yields significant computational advantages, especially with small datasets and long sequences.

2. Data Representation: Basis Embedding and Textual Transformation

The transformation of raw signals into tabular embeddings is foundational to the DataForge methodology:

  • For each channel cc and instance ii, the observed segment XicX^c_i is expanded as:

Xicj=1kceijcbjcX^c_i \approx \sum_{j=1}^{k_c} e^c_{ij} b^c_j

where bjcb^c_j are the basis functions, and eijce^c_{ij} are the embedding coefficients

eijc=Xic,bjcL2=tXic(t)bjc(t)dte^c_{ij} = \langle X^c_i, b^c_j \rangle_{L^2} = \int_t X^c_i(t) \cdot b^c_j(t) \, dt

  • The full dataset becomes a matrix ERI×KE \in \mathbb{R}^{I \times K}, with K=ckcK = \sum_{c} k_c.

For LLM consumption, embedding vectors are encoded as text:

  • Each row is rendered into a fill-in-the-middle style prompt, e.g., "Input: value_2 is [blank], value_1 is [blank] [sep] Target: e_{i2} is [answer] e_{i1} is [answer]"
  • Feature order is randomly permuted per instance to prevent model overfitting on any static column arrangement, increasing robustness.

3. Synthetic Generation: Inference, Decoding, and Property Retention

SDForger's generative phase consists of a sequence of systematic steps:

  • Sampling Embeddings:

The fine-tuned LLM receives prompts in the training template and generates new text completions. Sampling strategies are typically multinomial, with temperature adjustments to balance diversity and plausibility.

  • Decoding to Time Series:

Numerical embedding coefficients e~ijc\tilde{e}^c_{ij} are parsed from output text and used to reconstruct synthetic signals via:

X~ic=j=1kce~ijcbjc\tilde{X}^c_i = \sum_{j=1}^{k_c} \tilde{e}^c_{ij} b^c_j

This ensures preservation of key statistics—including variance, skewness, kurtosis—and temporal features such as autocorrelation and periodicity observed in the source data.

The result is synthetic time series that closely imitate the statistical and dynamical characteristics of the original observations.

4. Comparative Evaluation and Performance Metrics

Comprehensive experimental benchmarks highlight SDForger's competitive advantages:

  • Feature-based Metrics:
    • Marginal Distribution Difference (MDD)
    • AutoCorrelation Difference (ACD)
    • Skewness Difference (SD)
    • Kurtosis Difference (KD)
  • Distance-based Metrics:
    • Euclidean Distance (ED)
    • Dynamic Time Warping (DTW)
    • Shapelet-based reconstruction error

Empirically, SDForger outperforms VAE-based, GAN, and diffusion model baselines in many scenarios. The framework is particularly effective in low-data regimes and for generating long synthetic sequences, due to the decoupling of segment length from the LLM’s complexity. Experimental results across domains—including energy, transport, finance, and natural phenomena—demonstrate the high similarity of SDForger’s outputs to real data and their utility in downstream forecasting tasks.

5. Innovations: Textual Conditioning and Multimodal Integration

A central innovation is the direct embedding of textual context into the generation process. SDForger's design allows users to infuse prompts with additional information (such as channel annotation or modality tags), thereby conditioning the sampling of synthetic data on arbitrary text. This architectural property facilitates:

  • Seamless integration of time series with complementary textual modalities (e.g., metadata, scenario descriptions).
  • Streamlined realization of multimodal generative modeling tasks, where both structured time series and unconstrained text play interconnected roles in synthesis or downstream applications.

This flexibility is a distinguishing factor, opening new research paths for context-aware and controllable time series generation.

6. Technical Specification: Formulas and Templating Mechanisms

SDForger’s operation is precisely specified in terms of basis extraction and reconstruction:

Component Formula/Description Notes
Embedding coefficient eijc=Xic,bjcL2=tXic(t)bjc(t)e^c_{ij} = \langle X^c_i, b^c_j \rangle_{L^2} = \int_t X^c_i(t) b^c_j(t) For all j=1,,kcj = 1,\ldots,k_c
Embedding table E=[E1,,Ec]RI×KE = [E^1, \ldots, E^c] \in \mathbb{R}^{I\times K} K=ckcK = \sum_c k_c
Reconstruction X~ic=j=1kce~ijcbjc\tilde{X}^c_i = \sum_{j=1}^{k_c} \tilde{e}^c_{ij} b^c_j Same bases used as in decomposition
Fill-in-the-middle prompt Input: value_2 is [blank], value_1 is [blank] [sep] Target: ... Feature order is randomized per sample

Random permutation π\pi is applied to feature indices to mitigate downward modeling bias.

7. Prospective Developments and Open-Source Distribution

Future directions for SDForger include:

  • Investigation of non-autoregressive LLMs (e.g., encoder-only or mask-based architectures) to balance model capacity, expressiveness, and training efficiency.
  • Exploration of adaptive embedding strategies to optimize the trade-off between representational power and learning stability.
  • Expanded use of contextual and textual conditioning for more granular control over synthetic data characteristics, including integration of domain-specific knowledge or external control signals.

The source code for SDForger will be open-sourced, with expected distribution via public code repositories. This will support community experimentation, adaptation, and application of the framework in both research and industry settings.


SDForger (DataForge Framework) establishes a modular paradigm for synthetic time series generation, merging compact latent representations with the versatility of LLMs. Its methodical decoupling of statistical encoding and sequence generation, combined with the innovative use of textual conditioning, sets a foundation for further research at the intersection of time series modeling and language-driven generation.