Papers
Topics
Authors
Recent
2000 character limit reached

SDForger Synthetic Time Series Framework

Updated 26 December 2025
  • SDForger is a framework that generates synthetic multivariate time series by transforming data into textualized embeddings for LLM-based synthesis.
  • It employs fill-in-the-middle prompts and LoRA fine-tuning to efficiently learn from limited samples while maintaining the original data's statistical properties.
  • The framework achieves high statistical fidelity using metrics like MDD, ACD, and DTW, enabling enhanced downstream forecasting performance.

SDForger is a framework for generating synthetic multivariate time series utilizing LLMs with a compact tabular embedding that enables efficient, high-fidelity data synthesis. The central innovation is transforming both univariate and multivariate time series into textualized embeddings, making it possible to fine-tune autoregressive LLMs—even with limited computational resources and few real instances—and conditionally generate new samples that preserve the target data’s statistical and temporal structure (Rousseau et al., 21 May 2025).

1. Problem Formulation and Evaluation Metrics

The objective scenario assumes observed multivariate time series

X={Xi}i=1I,XiRC×LX = \{ X_i \}_{i=1}^I, \quad X_i \in \mathbb{R}^{C \times L}

with CC channels and sequences of length LL. The generative task is to create I~\tilde{I} new instances

X~={X~j}j=1I~,X~jRC×L\tilde X = \{ \tilde X_j \}_{j=1}^{\tilde I}, \quad \tilde X_j \in \mathbb{R}^{C \times L}

matching the joint distribution of XX.

Distributional similarity is quantified primarily via:

  • Marginal Distribution Difference (MDD):

MDD=1CLc=1Ct=1LFXc(xt)FX~c(x~t)\text{MDD} = \frac{1}{CL} \sum_{c=1}^C \sum_{t=1}^L \left| F_{X^c}(x_t) - F_{\tilde X^c}(\tilde x_t) \right|

where FF denotes empirical CDFs.

  • Autocorrelation Difference (ACD):

ACD=1Cc=1Cτ=1τmaxρXc(τ)ρX~c(τ)\text{ACD} = \frac{1}{C} \sum_{c=1}^C \sum_{\tau=1}^{\tau_{\max}} \left| \rho_{X^c}(\tau) - \rho_{\tilde X^c}(\tau) \right|

with ρ(τ)\rho(\tau) the autocorrelation.

Additional distance-based metrics employed include Euclidean Distance (ED), Dynamic Time Warping (DTW), and shapelet-based reconstruction error (SHR).

Forecasting accuracy is evaluated by training predictive models—such as Tiny Time Mixer (TTM)—on generated data, real data, or combinations. Performance is measured with standard error metrics: RMSE=1Ni=1N(y^iyi)2,MAPE=100Ni=1Ny^iyiyi\mathrm{RMSE} = \sqrt{ \frac{1}{N} \sum_{i=1}^N (\hat y_i - y_i)^2 }, \quad \mathrm{MAPE} = \frac{100}{N} \sum_{i=1}^N \left| \frac{\hat y_i - y_i}{y_i} \right| This multi-faceted evaluation benchmarks both statistical similarity and utility for downstream forecasting (Rousseau et al., 21 May 2025).

2. Embedding Construction and Data Representation

Each channel cc is modeled as a real-valued function Xic(t)X_i^c(t) over [0,L][0, L]. Key steps in constructing embeddings are:

  • Functional Basis Projection: Basis functions {bjc(t)}j=1kc\{b_j^c(t)\}_{j=1}^{k_c} (usually via Functional PCA or FastICA) are selected per channel. Embedding coefficients

eijc=0LXic(t)bjc(t)dt,j=1,,kce_{ij}^c = \int_0^L X_i^c(t) b_j^c(t) \, dt, \quad j = 1, \dots, k_c

are concatenated over channels to form an embedding vector

Ei=(ei11,,eik11;ei12,,eik22;)RKE_i = (e_{i1}^1, \ldots, e_{ik_1}^1; e_{i1}^2, \ldots, e_{ik_2}^2; \ldots) \in \mathbb{R}^K

with K=c=1CkcK = \sum_{c=1}^{C} k_c.

  • Normalization: Each sequence is mean-standardized at each timestamp (Xic(t)(Xic(t)μt)/σtX_i^c(t) \gets (X_i^c(t) - \mu_t)/\sigma_t), where μt,σt\mu_t, \sigma_t are computed across all instances.
  • Tokenization: Each real-valued coefficient is mapped to a decimal string in [a,b][a, b] (e.g., “0.1234”) and tokenized using the LLM’s vocabulary, relying solely on the model's native BPE or byte-level encoding.

No further quantization is introduced beyond the parameterization defined by the basis decomposition and LLM tokenizer.

3. LLM Fine-Tuning and Prompt Engineering

  • Prompt Structure: Each embedding vector EiE_i is permuted via a random permutation π\pi and converted into a fill-in-the-middle text prompt:

Input: k=1K[valueπ(k) is [blank],][sep]Target: k=1K[ei,π(k) [answer]]\text{Input: } \bigcirc_{k=1}^K [\text{value}_{\pi(k)} \text{ is [blank],}] [\text{sep}] \, \text{Target: } \bigcirc_{k=1}^K [ e_{i,\pi(k)} \text{ [answer]} ]

where \bigcirc denotes string concatenation. Randomization over π\pi mitigates position bias.

  • Optimization Objective: LLM parameters θ\theta are optimized to minimize cross-entropy over the combined prompt+target sequence (τ1,,τT)(\tau_1, \dots, \tau_T):

L(θ)=t=1Tlogpθ(τtτ<t)\mathcal{L}(\theta) = -\sum_{t=1}^T \log p_{\theta} ( \tau_t | \tau_{<t} )

WW0+ΔW,ΔW=AB,ARd×r,BRr×d, rdW \leftarrow W_0 + \Delta W, \quad \Delta W = AB, \quad A \in \mathbb{R}^{d \times r}, B \in \mathbb{R}^{r \times d},\ r \ll d

Only AA and BB are tuned, with W0W_0 frozen.

This workflow enables SDForger to adapt general-purpose LLMs with a minimal number of embedding instances, often as low as I=15I = 15–$30$ rows, for high-quality time series synthesis (Rousseau et al., 21 May 2025).

4. Synthetic Sequence Sampling and Inverse Mapping

  • Autoregressive Sampling: At inference, SDForger populates a template with KK blanks and prompts the LLM to sample, stepwise, in token space by top-kk (fixed kk highest probability tokens) or nucleus (top-pp cumulative probability) selection.
  • Inverse Decoding: Generated text G\mathcal{G} is parsed to extract feature–value pairs {(π(k),a~π(k))}\{ (\pi(k), \tilde a_{\pi(k)}) \}, which are mapped back to the real domain by reconstructing the original channels using the corresponding basis expansion:

X~c(t)=j=1kce~jcbjc(t)\tilde X^c(t) = \sum_{j=1}^{k_c} \tilde e^c_j b^c_j(t)

Concatenation across all channels yields synthetic X~RC×L\tilde X \in \mathbb{R}^{C \times L}.

  • Post-processing Filters: Outputs are filtered to remove NaNs, duplicates, or 2\ell_2-norm outliers before acceptance.

This process yields synthetic sequences with statistics and dynamics aligned to the original distribution.

5. Statistical Fidelity and Downstream Use

SDForger’s fidelity is established by:

  • Feature-based metrics: MDD (marginal), skewness, kurtosis, and ACD (autocorrelation differences).
  • Distance-based metrics: Euclidean distance, DTW, and SHR via shift-invariant dictionary learning.
  • Multivariate Structure: Cross-covariances between channels and time-lagged dependencies are computed:

Cov(X~c(t),X~c(t+τ))Cov(Xc(t),Xc(t+τ))\operatorname{Cov}( \tilde X^c(t), \tilde X^{c'}(t+\tau) ) \approx \operatorname{Cov}( X^c(t), X^{c'}(t+\tau) )

For downstream forecasting, TTM is trained on (1) zero-shot, (2) real only, (3) synthetic only, or (4) real+synthetic regimes, with accuracy quantified on held-out test sets (RMSE, MAPE, MASE). Results show synthetic data from SDForger alone often matches utility of real data, with mixed training yielding further gains.

6. Comparative Perspective and Algorithmic Features

Compared to GANs and VAEs, SDForger:

  • Leverages LLM adaptation via textual prompts, not from-scratch training.
  • Handles long sequences in O(K)O(K) time, agnostic to sequence length LL.
  • Supports few-shot adaptation with minimal data.
  • Enables textual conditioning for semantic or channel-specific control (e.g., “Condition: data is temperature”).

Algorithmic summaries:

1
2
3
4
5
6
7
# Input: Real time series X, basis {b^c_j}, LLM θ₀.
# 1. Segment into windows X_i.
# 2. Compute embeddings e_{ij}^c = ∫ X_i^c(t) b_j^c(t) dt.
# 3. Form embedding table E∈ℝ^{I×K}.
# 4. Construct prompts P_i^{FT} (fill-in-the-middle).
# 5. Fine-tune θ on {P_i^{FT}}, minimizing cross-entropy.
# Output: Fine-tuned θ*.

  • Generation:

1
2
3
4
5
6
7
8
9
# Input: θ*, desired sample count Ũ.
# while |generated| < Ũ:
#     a. Sample K-blanks template P^{INF}.
#     b. Autoregressively sample tokens via top-k/top-p.
#     c. Parse text → \tilde e ∈ ℝ^K.
#     d. Filter outputs (NaNs, duplicates, l2 outliers).
#     e. Decode \tilde e → \tilde X via basis sum.
#     f. Append \tilde X to outputs.
# Output: { \tilde X_j }_{j=1}^{Ũ}.

7. Constraints, Extensions, and Outlook

Limitations include the need for moderate embedding dimension KK (e.g., K25K \leq 25) relative to sample size to prevent overfitting and unstable synthesis. Excessive KK slows LLM convergence and degrades sample quality.

Potential extensions:

  • Encoder-only LLMs (e.g., BERT) with masking for imputation.
  • Adaptive data-driven basis selection (kck_c per channel).
  • Joint time series–text pretraining for richer multimodal generation (e.g., incorporating event annotations).
  • Feeding synthetic data back into LLM pretraining to enhance in-context forecasting ability (Rousseau et al., 21 May 2025).

SDForger establishes a workflow for high-fidelity, few-shot, multimodally conditioned synthetic time series generation, leveraging the infrastructure and text reasoning of modern LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to SDForger.