Duration-Based Autoregressive Synthesizer
- The synthesizer is a generative model that explicitly controls time intervals via parametric autoregressive recursions.
- It uses innovation densities such as gamma, exponential, and log-symmetric distributions to accurately capture empirical inter-event times.
- Applications span financial econometrics, queueing analysis, and neural TTS, using maximum likelihood and simulation for effective duration control.
A duration-based autoregressive synthesizer is a generative model whose fundamental building block is the stochastic evolution of interval durations, with each new duration specified by a parametric autoregressive recursion and sampled from a nonnegative distribution—often conditionally, by quantile or mean—whose form is chosen to fit the empirical inter-event times. This paradigm spans financial econometrics (ACD models, quantile log-symmetric forms), queueing analysis (GAS-based generalized gamma duration models), and modern deep generative models requiring explicit duration control (e.g., encoder-decoder autoregressive neural network architectures in TTS). The shared property is the direct modeling or synthesis of durations, either as time intervals between discrete events or as acoustic frame counts interpreted as time.
1. Core Mathematical Formulations
Duration-based autoregressive models specify each duration as a function of past durations and latent process parameters. In high-frequency financial series, the general ACD model is given by: where is an i.i.d. innovation from a nonnegative distribution, and is the conditional mean duration (Yan, 2021). Quantile ACD models further extend this by focusing on specific conditional percentiles: with representing past durations and the collection of parameters. The recursion for the quantile version (QLS-ACD) is typically
where is a monotonic link, e.g., (Saulo et al., 2023).
In deep neural generative settings, as in duration-controlled autoregressive TTS, each output token corresponds to a fixed unit duration (e.g., 20 ms per Encodec frame) and is generated by a Transformer decoder conditioned on a target duration via position-encoding mechanisms (Peng et al., 26 May 2025).
2. Distributional Choices and Innovation Densities
The choice of innovation density is determined by the empirical fitting of durations. Exponential, Weibull, and especially gamma laws are standard: with interpreting increasing hazard rates, and controlling dispersion (Yan, 2021). For quantile-based models, log-symmetric distributions reparametrized by their quantile—such as log-normal with kernel —enable direct conditional quantile modeling, crucial for applications requiring explicit risk measures.
Generalized gamma distributions support more flexible skew and tail behavior in queueing models. The GAS(1,1) model uses the score of the gamma density to update its log-scale parameter : and
so that durations can follow seasonal and autocorrelated patterns (Tomanová et al., 2020).
3. Estimation, Recursion, and Simulation Algorithms
Model parameters , distributional parameters) are estimated via maximum likelihood, with log-likelihood functions conforming to the chosen density: and recursive update of the conditional statistic (mean or quantile). Fitting is typically performed by quasi-Newton optimizers (e.g., BFGS), with standard errors obtained by numerical Hessians at the optimum (Saulo et al., 2023).
Simulation ("synthesis") proceeds by recursively evaluating the conditional expectation or quantile, drawing innovations from the specified law, and updating durations. For QLS-ACD, synthetic durations are generated by inversion: where is sampled from the kernel CDF and denotes the quantile constant (Saulo et al., 2023).
For neural TTS, the decoder generates each token autoregressively, with duration enforced by rotary position embeddings acting on the relative progress through the target utterance (Peng et al., 26 May 2025).
4. Duration Control, Extrapolation, and Applications
Duration-based autoregressive synthesizers are uniquely suited for applications requiring direct control over temporal properties. In financial microstructure, quantile-based ACD enables modeling extreme event intervals and robust value-at-risk (VaR) estimation (Saulo et al., 2023). In queueing systems, GAS durations reproduce clustering observed in real-world arrivals, outperforming Poisson approximations and enabling more accurate performance measures (Tomanová et al., 2020).
Autoregressive neural synthesizers such as VoiceStar offer explicit duration control and extrapolation: at inference, the target duration is specified in acoustic tokens, and PM-RoPE position encodings ensure the network auto-terminates at the desired output length. Extrapolation far beyond the training context is achieved by fractional rotary angles, with empirical results showing robust intelligibility up to 50 s—far beyond previous AR/NAR models (Peng et al., 26 May 2025).
5. Model Diagnostics and Goodness-of-Fit
Diagnostics for duration-based synthesizers focus on evaluating residuals for exponentiality and independence, with Cox–Snell residuals commonly used: with survival function . Adequacy is assessed by QQ-plots versus Exp(1), moment checks (mean , variance , skewness ), and autocorrelation tests (Saulo et al., 2023).
Model comparison is usually through BIC, likelihood, and residual tests, with AC D(1,1)-Gamma models showing superior fit for financial durations: $\begin{array}{l|r|r} \text{Innovation} & \text{LL} & \text{BIC} \ \hline \text{Exponential} & -68,454 & 137,980 \ \text{Weibull} & -20,289 & 89,435 \ \text{Gamma} & \mathbf{-12,733} & \mathbf{73,465} \end{array}$ Uniform residual tests reject exponential/weibull but not gamma, justifying the empirical choice (Yan, 2021).
6. Implementation, Limitations, and Future Directions
Implementation requires careful initialization and burn-in (e.g., discarding 500 points in QLS-ACD). Numerical inversion or rejection sampling may be needed for kernels lacking closed form, and discretization granularity in token-based models (e.g., 0.02 s per token in VoiceStar) bounds achievable duration accuracy (Saulo et al., 2023, Peng et al., 26 May 2025).
Current limitations include reliance on heuristic or external duration prediction in TTS, computational cost in large neural decoders, and potential loss in speaker similarity on extreme extrapolation (Peng et al., 26 May 2025). Plausible future directions are hierarchical/structured codecs and integrated duration predictors for generative tasks, as well as continued refinement of quantile-driven distributions for financial/queueing contexts.
7. Cross-Domain Significance and Unifying Properties
Across applications—from financial econometrics to deep speech synthesis—the duration-based autoregressive synthesizer provides a principled framework for modeling, predicting, and synthesizing time intervals between events, supporting both probabilistic inference and controlled generative workflows. The explicit modeling of duration, often conditionally and at arbitrary quantile levels, enables stress testing, long-horizon simulation, and risk assessment. Recent advances in neural autoregression and position encoding have further extended its utility to domains requiring precise, user-controlled durations and robust extrapolation (Saulo et al., 2023, Peng et al., 26 May 2025, Tomanová et al., 2020, Yan, 2021).