Scaling behavior of the Free Transformer at larger sizes

Investigate the behavior of the Free Transformer when scaled to larger parameter counts and trained on substantially larger datasets, assessing how performance and training dynamics change with scale.

Background

The paper demonstrates performance improvements of the Free Transformer over baseline decoder-only Transformers using 1.5B- and 8B-parameter models trained on up to 1T tokens across multiple benchmarks, particularly in coding and reasoning tasks.

Despite these results, the authors explicitly state that the behavior at larger scales, in both model size and dataset size, remains to be investigated, indicating the need for systematic scaling studies.

References

Finally, the behavior in larger scales, both in parameter count and dataset size, remains to be investigated.

— The Free Transformer (2510.17558 - Fleuret, 20 Oct 2025) in Section 6 (Conclusion)

Scaling behavior of the Free Transformer at larger sizes

Sponsor

Background

References

Related Problems