Optimal LLM Architecture for Parameter and Computational Efficiency
Determine which deep learning architecture for large language models is optimal with respect to parameter efficiency and computational efficiency, in the context of next-token prediction trained with cross-entropy loss.
References
Which architecture is optimal in terms of parameter efficiency or computational efficiency remains an intriguing open problem.
— Beyond the Black Box: A Statistical Model for LLM Reasoning and Inference
(2402.03175 - Dalal et al., 5 Feb 2024) in Section 6.3 (Implications of our model: Deep Learning Architecture)