SDForger Synthetic Time Series Framework
- SDForger is a framework that generates synthetic multivariate time series by transforming data into textualized embeddings for LLM-based synthesis.
- It employs fill-in-the-middle prompts and LoRA fine-tuning to efficiently learn from limited samples while maintaining the original data's statistical properties.
- The framework achieves high statistical fidelity using metrics like MDD, ACD, and DTW, enabling enhanced downstream forecasting performance.
SDForger is a framework for generating synthetic multivariate time series utilizing LLMs with a compact tabular embedding that enables efficient, high-fidelity data synthesis. The central innovation is transforming both univariate and multivariate time series into textualized embeddings, making it possible to fine-tune autoregressive LLMs—even with limited computational resources and few real instances—and conditionally generate new samples that preserve the target data’s statistical and temporal structure (Rousseau et al., 21 May 2025).
1. Problem Formulation and Evaluation Metrics
The objective scenario assumes observed multivariate time series
with channels and sequences of length . The generative task is to create new instances
matching the joint distribution of .
Distributional similarity is quantified primarily via:
- Marginal Distribution Difference (MDD):
where denotes empirical CDFs.
- Autocorrelation Difference (ACD):
with the autocorrelation.
Additional distance-based metrics employed include Euclidean Distance (ED), Dynamic Time Warping (DTW), and shapelet-based reconstruction error (SHR).
Forecasting accuracy is evaluated by training predictive models—such as Tiny Time Mixer (TTM)—on generated data, real data, or combinations. Performance is measured with standard error metrics: This multi-faceted evaluation benchmarks both statistical similarity and utility for downstream forecasting (Rousseau et al., 21 May 2025).
2. Embedding Construction and Data Representation
Each channel is modeled as a real-valued function over . Key steps in constructing embeddings are:
- Functional Basis Projection: Basis functions (usually via Functional PCA or FastICA) are selected per channel. Embedding coefficients
are concatenated over channels to form an embedding vector
with .
- Normalization: Each sequence is mean-standardized at each timestamp (), where are computed across all instances.
- Tokenization: Each real-valued coefficient is mapped to a decimal string in (e.g., “0.1234”) and tokenized using the LLM’s vocabulary, relying solely on the model's native BPE or byte-level encoding.
No further quantization is introduced beyond the parameterization defined by the basis decomposition and LLM tokenizer.
3. LLM Fine-Tuning and Prompt Engineering
- Prompt Structure: Each embedding vector is permuted via a random permutation and converted into a fill-in-the-middle text prompt:
where denotes string concatenation. Randomization over mitigates position bias.
- Optimization Objective: LLM parameters are optimized to minimize cross-entropy over the combined prompt+target sequence :
- Low-Rank Adapters (LoRA): For efficient adaptation, LoRA adapters are inserted within each Transformer block:
Only and are tuned, with frozen.
This workflow enables SDForger to adapt general-purpose LLMs with a minimal number of embedding instances, often as low as –$30$ rows, for high-quality time series synthesis (Rousseau et al., 21 May 2025).
4. Synthetic Sequence Sampling and Inverse Mapping
- Autoregressive Sampling: At inference, SDForger populates a template with blanks and prompts the LLM to sample, stepwise, in token space by top- (fixed highest probability tokens) or nucleus (top- cumulative probability) selection.
- Inverse Decoding: Generated text is parsed to extract feature–value pairs , which are mapped back to the real domain by reconstructing the original channels using the corresponding basis expansion:
Concatenation across all channels yields synthetic .
- Post-processing Filters: Outputs are filtered to remove NaNs, duplicates, or -norm outliers before acceptance.
This process yields synthetic sequences with statistics and dynamics aligned to the original distribution.
5. Statistical Fidelity and Downstream Use
SDForger’s fidelity is established by:
- Feature-based metrics: MDD (marginal), skewness, kurtosis, and ACD (autocorrelation differences).
- Distance-based metrics: Euclidean distance, DTW, and SHR via shift-invariant dictionary learning.
- Multivariate Structure: Cross-covariances between channels and time-lagged dependencies are computed:
For downstream forecasting, TTM is trained on (1) zero-shot, (2) real only, (3) synthetic only, or (4) real+synthetic regimes, with accuracy quantified on held-out test sets (RMSE, MAPE, MASE). Results show synthetic data from SDForger alone often matches utility of real data, with mixed training yielding further gains.
6. Comparative Perspective and Algorithmic Features
Compared to GANs and VAEs, SDForger:
- Leverages LLM adaptation via textual prompts, not from-scratch training.
- Handles long sequences in time, agnostic to sequence length .
- Supports few-shot adaptation with minimal data.
- Enables textual conditioning for semantic or channel-specific control (e.g., “Condition: data is temperature”).
Algorithmic summaries:
1 2 3 4 5 6 7 |
# Input: Real time series X, basis {b^c_j}, LLM θ₀. # 1. Segment into windows X_i. # 2. Compute embeddings e_{ij}^c = ∫ X_i^c(t) b_j^c(t) dt. # 3. Form embedding table E∈ℝ^{I×K}. # 4. Construct prompts P_i^{FT} (fill-in-the-middle). # 5. Fine-tune θ on {P_i^{FT}}, minimizing cross-entropy. # Output: Fine-tuned θ*. |
- Generation:
1 2 3 4 5 6 7 8 9 |
# Input: θ*, desired sample count Ũ. # while |generated| < Ũ: # a. Sample K-blanks template P^{INF}. # b. Autoregressively sample tokens via top-k/top-p. # c. Parse text → \tilde e ∈ ℝ^K. # d. Filter outputs (NaNs, duplicates, l2 outliers). # e. Decode \tilde e → \tilde X via basis sum. # f. Append \tilde X to outputs. # Output: { \tilde X_j }_{j=1}^{Ũ}. |
7. Constraints, Extensions, and Outlook
Limitations include the need for moderate embedding dimension (e.g., ) relative to sample size to prevent overfitting and unstable synthesis. Excessive slows LLM convergence and degrades sample quality.
Potential extensions:
- Encoder-only LLMs (e.g., BERT) with masking for imputation.
- Adaptive data-driven basis selection ( per channel).
- Joint time series–text pretraining for richer multimodal generation (e.g., incorporating event annotations).
- Feeding synthetic data back into LLM pretraining to enhance in-context forecasting ability (Rousseau et al., 21 May 2025).
SDForger establishes a workflow for high-fidelity, few-shot, multimodally conditioned synthetic time series generation, leveraging the infrastructure and text reasoning of modern LLMs.