SDForger Synthetic Time Series Framework

Updated 26 December 2025

SDForger is a framework that generates synthetic multivariate time series by transforming data into textualized embeddings for LLM-based synthesis.
It employs fill-in-the-middle prompts and LoRA fine-tuning to efficiently learn from limited samples while maintaining the original data's statistical properties.
The framework achieves high statistical fidelity using metrics like MDD, ACD, and DTW, enabling enhanced downstream forecasting performance.

SDForger is a framework for generating synthetic multivariate time series utilizing LLMs with a compact tabular embedding that enables efficient, high-fidelity data synthesis. The central innovation is transforming both univariate and multivariate time series into textualized embeddings, making it possible to fine-tune autoregressive LLMs—even with limited computational resources and few real instances—and conditionally generate new samples that preserve the target data’s statistical and temporal structure (Rousseau et al., 21 May 2025).

1. Problem Formulation and Evaluation Metrics

The objective scenario assumes observed multivariate time series

$X = \{ X_i \}_{i=1}^I, \quad X_i \in \mathbb{R}^{C \times L}$

with $C$ channels and sequences of length $L$ . The generative task is to create $\tilde{I}$ new instances

$\tilde X = \{ \tilde X_j \}_{j=1}^{\tilde I}, \quad \tilde X_j \in \mathbb{R}^{C \times L}$

matching the joint distribution of $X$ .

Distributional similarity is quantified primarily via:

Marginal Distribution Difference (MDD):

$\text{MDD} = \frac{1}{CL} \sum_{c=1}^C \sum_{t=1}^L \left| F_{X^c}(x_t) - F_{\tilde X^c}(\tilde x_t) \right|$

where $F$ denotes empirical CDFs.

Autocorrelation Difference (ACD):

$\text{ACD} = \frac{1}{C} \sum_{c=1}^C \sum_{\tau=1}^{\tau_{\max}} \left| \rho_{X^c}(\tau) - \rho_{\tilde X^c}(\tau) \right|$

with $\rho(\tau)$ the autocorrelation.

Additional distance-based metrics employed include Euclidean Distance (ED), Dynamic Time Warping (DTW), and shapelet-based reconstruction error (SHR).

Forecasting accuracy is evaluated by training predictive models—such as Tiny Time Mixer (TTM)—on generated data, real data, or combinations. Performance is measured with standard error metrics: $\mathrm{RMSE} = \sqrt{ \frac{1}{N} \sum_{i=1}^N (\hat y_i - y_i)^2 }, \quad \mathrm{MAPE} = \frac{100}{N} \sum_{i=1}^N \left| \frac{\hat y_i - y_i}{y_i} \right|$ This multi-faceted evaluation benchmarks both statistical similarity and utility for downstream forecasting (Rousseau et al., 21 May 2025).

2. Embedding Construction and Data Representation

Each channel $c$ is modeled as a real-valued function $X_i^c(t)$ over $[0, L]$ . Key steps in constructing embeddings are:

Functional Basis Projection: Basis functions $\{b_j^c(t)\}_{j=1}^{k_c}$ (usually via Functional PCA or FastICA) are selected per channel. Embedding coefficients

$e_{ij}^c = \int_0^L X_i^c(t) b_j^c(t) \, dt, \quad j = 1, \dots, k_c$

are concatenated over channels to form an embedding vector

$E_i = (e_{i1}^1, \ldots, e_{ik_1}^1; e_{i1}^2, \ldots, e_{ik_2}^2; \ldots) \in \mathbb{R}^K$

with $K = \sum_{c=1}^{C} k_c$ .

Normalization: Each sequence is mean-standardized at each timestamp ( $X_i^c(t) \gets (X_i^c(t) - \mu_t)/\sigma_t$ ), where $\mu_t, \sigma_t$ are computed across all instances.
Tokenization: Each real-valued coefficient is mapped to a decimal string in $[a, b]$ (e.g., “0.1234”) and tokenized using the LLM’s vocabulary, relying solely on the model's native BPE or byte-level encoding.

No further quantization is introduced beyond the parameterization defined by the basis decomposition and LLM tokenizer.

3. LLM Fine-Tuning and Prompt Engineering

Prompt Structure: Each embedding vector $E_i$ is permuted via a random permutation $\pi$ and converted into a fill-in-the-middle text prompt:

$\text{Input: } \bigcirc_{k=1}^K [\text{value}_{\pi(k)} \text{ is [blank],}] [\text{sep}] \, \text{Target: } \bigcirc_{k=1}^K [ e_{i,\pi(k)} \text{ [answer]} ]$

where $\bigcirc$ denotes string concatenation. Randomization over $\pi$ mitigates position bias.

Optimization Objective: LLM parameters $\theta$ are optimized to minimize cross-entropy over the combined prompt+target sequence $(\tau_1, \dots, \tau_T)$ :

$\mathcal{L}(\theta) = -\sum_{t=1}^T \log p_{\theta} ( \tau_t | \tau_{<t} )$

Low-Rank Adapters (LoRA): For efficient adaptation, LoRA adapters are inserted within each Transformer block:

$W \leftarrow W_0 + \Delta W, \quad \Delta W = AB, \quad A \in \mathbb{R}^{d \times r}, B \in \mathbb{R}^{r \times d},\ r \ll d$

Only $A$ and $B$ are tuned, with $W_0$ frozen.

This workflow enables SDForger to adapt general-purpose LLMs with a minimal number of embedding instances, often as low as $I = 15$ –$30$ rows, for high-quality time series synthesis (Rousseau et al., 21 May 2025).

4. Synthetic Sequence Sampling and Inverse Mapping

Autoregressive Sampling: At inference, SDForger populates a template with $K$ blanks and prompts the LLM to sample, stepwise, in token space by top- $k$ (fixed $k$ highest probability tokens) or nucleus (top- $p$ cumulative probability) selection.
Inverse Decoding: Generated text $\mathcal{G}$ is parsed to extract feature–value pairs $\{ (\pi(k), \tilde a_{\pi(k)}) \}$ , which are mapped back to the real domain by reconstructing the original channels using the corresponding basis expansion:

$\tilde X^c(t) = \sum_{j=1}^{k_c} \tilde e^c_j b^c_j(t)$

Concatenation across all channels yields synthetic $\tilde X \in \mathbb{R}^{C \times L}$ .

Post-processing Filters: Outputs are filtered to remove NaNs, duplicates, or $\ell_2$ -norm outliers before acceptance.

This process yields synthetic sequences with statistics and dynamics aligned to the original distribution.

5. Statistical Fidelity and Downstream Use

SDForger’s fidelity is established by:

Feature-based metrics: MDD (marginal), skewness, kurtosis, and ACD (autocorrelation differences).
Distance-based metrics: Euclidean distance, DTW, and SHR via shift-invariant dictionary learning.
Multivariate Structure: Cross-covariances between channels and time-lagged dependencies are computed:

$\operatorname{Cov}( \tilde X^c(t), \tilde X^{c'}(t+\tau) ) \approx \operatorname{Cov}( X^c(t), X^{c'}(t+\tau) )$

For downstream forecasting, TTM is trained on (1) zero-shot, (2) real only, (3) synthetic only, or (4) real+synthetic regimes, with accuracy quantified on held-out test sets (RMSE, MAPE, MASE). Results show synthetic data from SDForger alone often matches utility of real data, with mixed training yielding further gains.

6. Comparative Perspective and Algorithmic Features

Compared to GANs and VAEs, SDForger:

Leverages LLM adaptation via textual prompts, not from-scratch training.
Handles long sequences in $O(K)$ time, agnostic to sequence length $L$ .
Supports few-shot adaptation with minimal data.
Enables textual conditioning for semantic or channel-specific control (e.g., “Condition: data is temperature”).

Algorithmic summaries:

Fine-tuning:

# Input: Real time series X, basis {b^c_j}, LLM θ₀.
# 1. Segment into windows X_i.
# 2. Compute embeddings e_{ij}^c = ∫ X_i^c(t) b_j^c(t) dt.
# 3. Form embedding table E∈ℝ^{I×K}.
# 4. Construct prompts P_i^{FT} (fill-in-the-middle).
# 5. Fine-tune θ on {P_i^{FT}}, minimizing cross-entropy.
# Output: Fine-tuned θ*.

Generation:

# Input: θ*, desired sample count Ũ.
# while |generated| < Ũ:
#     a. Sample K-blanks template P^{INF}.
#     b. Autoregressively sample tokens via top-k/top-p.
#     c. Parse text → \tilde e ∈ ℝ^K.
#     d. Filter outputs (NaNs, duplicates, l2 outliers).
#     e. Decode \tilde e → \tilde X via basis sum.
#     f. Append \tilde X to outputs.
# Output: { \tilde X_j }_{j=1}^{Ũ}.

7. Constraints, Extensions, and Outlook

Limitations include the need for moderate embedding dimension $K$ (e.g., $K \leq 25$ ) relative to sample size to prevent overfitting and unstable synthesis. Excessive $K$ slows LLM convergence and degrades sample quality.

Potential extensions:

Encoder-only LLMs (e.g., BERT) with masking for imputation.
Adaptive data-driven basis selection ( $k_c$ per channel).
Joint time series–text pretraining for richer multimodal generation (e.g., incorporating event annotations).
Feeding synthetic data back into LLM pretraining to enhance in-context forecasting ability (Rousseau et al., 21 May 2025).

SDForger establishes a workflow for high-fidelity, few-shot, multimodally conditioned synthetic time series generation, leveraging the infrastructure and text reasoning of modern LLMs.

Markdown Upgrade to Chat

References (1)

Forging Time Series with Language: A Large Language Model Approach to Synthetic Data Generation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SDForger.