Optimal layer depth for injecting the latent variable Z

Determine the optimal Transformer block depth at which to inject the latent random embedding Z within the Free Transformer (a decoder-only Transformer trained as a conditional variational autoencoder that conditions generation on Z) so as to balance encoder capacity and the decoder’s ability to process latent variables.

Background

The Free Transformer extends a decoder-only Transformer with a latent random variable Z that conditions generation, trained via a conditional VAE objective. To reduce overhead, Z is injected at a chosen layer (the middle layer in the experiments) so that half of the Transformer blocks are shared between the encoder and decoder.

The authors note a trade-off in where Z should be injected: injecting too early may reduce encoder capacity, while injecting too late may limit the decoder’s ability to use the latent variables. They explicitly state they did not investigate which depth is best, leaving the choice of injection depth unresolved.

References

While we did not investigate what is the best depth to inject $Z$, doing it too early would reduce the encoder's capacity, and doing it too late would reduce the decoder's capacity to process the latent variables.

— The Free Transformer (2510.17558 - Fleuret, 20 Oct 2025) in Section 3.2 (Model structure)

Optimal layer depth for injecting the latent variable Z

Background

References

Related Problems