Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TarFlowLM: Continuous Latent Language Model

Updated 3 July 2025
  • TarFlowLM is a continuous latent variable language modeling framework that shifts from token-based autoregressive models to transformer-based normalizing flows.
  • It employs an encoder-decoder architecture with autoregressive priors and mixture-based coupling flows to achieve invertible density estimation and efficient sampling.
  • Its design supports bidirectional context integration, block-wise generation, and hierarchical multi-pass decoding, opening new avenues for language representation and editing.

TarFlowLM is a framework for LLMing that shifts the paradigm from discrete token-based autoregressive models to continuous latent variable modeling using transformer-based normalizing flows. It enables the flexible, invertible modeling of language data in a continuous space, providing new generative and architectural capabilities while maintaining competitive or superior likelihood performance to existing methods.

1. Continuous Latent Variable Modeling with Autoregressive Flows

TarFlowLM models language as a distribution over sequences of continuous latent variables (z1,...,zT)(z_1, ..., z_T), each ztRdz_t \in \mathbb{R}^d, rather than directly modeling token sequences (x1,...,xT)(x_1, ..., x_T).

  • Encoder: q(z1:Tx1:T)q(z_{1:T}|x_{1:T}) maps each token xtx_t to a Gaussian-distributed latent vector through a codebook.
  • Decoder: p(x1:Tz1:T)p(x_{1:T}|z_{1:T}) maps latent vectors back to token distributions, using a tied Bayesian parameterization that enables efficient density estimation and sampling.
  • Autoregressive Prior: p(z1:T)p(z_{1:T}) is implemented via invertible normalizing flows parameterized by transformers, enabling tractable joint densities over latent sequences.

The training objective is the Evidence Lower Bound (ELBO) for VAEs, made explicit via the invertibility of the flows: L(x1:T)=Ez1:Tq(x1:T)[logp(x1:Tz1:T)+logp(z1:T)logq(z1:Tx1:T)]\mathcal{L}(x_{1:T}) = \mathbb{E}_{z_{1:T} \sim q(\cdot|x_{1:T})} \left[ \log p(x_{1:T}|z_{1:T}) + \log p(z_{1:T}) - \log q(z_{1:T}|x_{1:T}) \right] This approach enables exact, tractable likelihood evaluation and invertible mapping between data and latent spaces.

2. Flexible Architectural Capabilities

TarFlowLM’s architecture enables several capabilities that are not readily achievable in standard discrete autoregressive (AR) LLMs:

  • Global Bi-directional Context: By stacking flow layers with alternating AR directions (left-to-right, right-to-left), TarFlowLM allows the latent at each position to integrate information from both past and future tokens.
  • Block-wise/Patched Generation: The model supports the generation and transformation of variable-sized blocks (patches) of latents, with dependencies handled both within and across blocks. Patch-wise modeling allows the joint generation of multiple tokens per step.
  • Hierarchical Multi-pass Generation: The stack of invertible flows naturally enables hierarchical, multi-pass generation. Intermediate latent representations can be decoded at any stage, producing outputs ranging from coarse to fine, with early layers performing global edits and later layers refining details.
  • Flexible Latent Vocabulary Size: The number of Gaussian components in mixture couplings (see Section 3) acts as an internal vocabulary size, tunable independently per layer and decoupled from the base data vocabulary.

3. Mixture-based Coupling Transformations

A primary innovation of TarFlowLM is the use of mixture-based coupling flows, which generalize affine normalizing flow transformations to capture the structured, multi-modal latent distributions introduced by language data.

  • Dimension-wise Mixture CDF Flow: Each latent dimension is modeled using a conditional mixture-of-Gaussians:

p(zt,iz<t,zt,<i)=k=1Vπ[k]N(zt,i;m[k],σ2[k])p(z_{t,i} | z_{<t}, z_{t,<i}) = \sum_{k=1}^V \pi[k] \mathcal{N}(z_{t,i}; m[k], \sigma^2[k])

The invertible transformation is:

ut,i=Φ1(Fmix1(zt,i;z<t,zt,<i))u_{t,i} = \Phi^{-1} \left( F_{\text{mix}-1}(z_{t,i}; z_{<t}, z_{t,<i}) \right)

where Fmix1F_{\text{mix}-1} is the CDF of the mixture, and Φ1\Phi^{-1} is the inverse standard normal CDF.

  • Token-wise Mixture Rosenblatt Flow: Each token’s latent is modeled as a dd-dimensional mixture of Gaussians, with transformations performed via a sequential Rosenblatt transform.

p(ztz<t)=k=1Vπt(z<t)[k]N(zt;μk,σk2)p(z_t | z_{<t}) = \sum_{k=1}^V \pi_t(z_{<t})[k] \mathcal{N}(z_t; \mu_k, \sigma^2_k)

The transformation for component ii is:

ui=Φ1(Fi(ziz<i))u_i = \Phi^{-1}(F_i(z_i | z_{<i}))

These mixture-coupling flows are crucial for modeling the highly structured, multi-modal densities of encoded text. Their invertibility and tractable Jacobians preserve the normalizing flow framework’s efficiency.

4. Theoretical Relationship to Discrete AR LLMs

TarFlowLM generalizes conventional discrete autoregressive LMs. In the regime where the encoder and prior mixture components are tied to discrete token embeddings and their variances shrink to zero (σ20\sigma^2 \to 0), the continuous framework recovers the standard cross-entropy training objective: logπt(zx1,...,zxt1)[xt]-\log \pi_t(z_{x_1}, ..., z_{x_{t-1}})[x_t] This establishes that conventional discrete AR LMs are a special case of TarFlowLM, reinforcing its theoretical foundation and interpretability.

5. Empirical Results and Model Capabilities

Extensive experiments on standard benchmarks demonstrate the modeling power and flexibility of TarFlowLM:

Model Type Method Text8 BPC (↓) OpenWebText PPL (↓)
C/NF (continuous) TarFlowLM Affine ≤ 1.54 ≤ 148.21
C/NF TarFlowLM Mix-1 ≤ 1.37 ≤ 27.11
C/NF TarFlowLM Mix-d ≤ 1.30 ≤ 22.64
D/AR (discrete AR LM) Transformer AR 1.23 17.54
D/Diffusion MD4 ≤ 1.37 ≤ 22.13
  • Mixture-based coupling flows (Mix-1, Mix-d) provide substantial gains over affine flow baselines.
  • Block-wise generation and flexible patch sizes allow the model to generate two or more tokens per step while preserving coherence.
  • Hierarchical editing: Decoding intermediate latent states reveals a process of coarse-to-fine increment in linguistic quality.
  • Ablations confirm that mixture-based couplings are critical for high performance, particularly on multi-modal latent data distributions.

6. Methodological Formulas and Flow Construction

TarFlowLM relies on the following core mathematical constructions:

  • Change of Variables (for normalizing flows):

logp(z)=logpbase(f(z))+logdetJf(z)\log p(\mathbf{z}) = \log p_\text{base}(f(\mathbf{z})) + \log |\det J_f(\mathbf{z})|

  • Autoregressive prior over latents:

p(z1:T)=t=1Tp(ztz<t)p(z_{1:T}) = \prod_{t=1}^T p(z_t | z_{<t})

  • Decoder for token emission:

p(xt=kzt)=Nk(zt)j=1VNj(zt)p(x_t = k | z_t) = \frac{\mathcal{N}_k(z_t)}{\sum_{j=1}^V \mathcal{N}_j(z_t)}

  • 1D Mixture Flow:

ut,i=Φ1(Fmix-1(zt,i))u_{t,i} = \Phi^{-1}\left( F_{\text{mix-}1}(z_{t,i}) \right)

logut,izt,i=logp()(zt,i)logN(ut,i;0,1)\log \left| \frac{\partial u_{t,i}}{\partial z_{t,i}} \right| = \log p_{(\cdot)}(z_{t,i}) - \log \mathcal{N}(u_{t,i}; 0, 1)

7. Significance and Future Directions

TarFlowLM defines a new class of autoregressive LLMs operating in expressive, invertible continuous latent space, integrating transformer-based flows and mixture-based couplings for deep flexibility and bi-directional context integration. Its architecture supports modeling innovations (e.g., block-wise, hierarchical, multi-pass generation) and has direct theoretical connections to standard AR LMs.

A plausible implication is that this framework could enable new editing, sampling, and representation learning methods for language, leveraging the continuous, invertible structure of flows. Though not all directions are explored in the current work, the methodology suggests a wide, promising design space for future sequence modeling research.