Optimal layer depth for injecting the latent variable Z
Determine the optimal Transformer block depth at which to inject the latent random embedding Z within the Free Transformer (a decoder-only Transformer trained as a conditional variational autoencoder that conditions generation on Z) so as to balance encoder capacity and the decoder’s ability to process latent variables.
References
While we did not investigate what is the best depth to inject $Z$, doing it too early would reduce the encoder's capacity, and doing it too late would reduce the decoder's capacity to process the latent variables.
— The Free Transformer
(2510.17558 - Fleuret, 20 Oct 2025) in Section 3.2 (Model structure)