Optimal LLM Architecture for Parameter and Computational Efficiency

Determine which deep learning architecture for large language models is optimal with respect to parameter efficiency and computational efficiency, in the context of next-token prediction trained with cross-entropy loss.

Background

The paper treats the specific neural architecture as a black box that learns to encode next-token multinomial probabilities from embeddings, emphasizing that the next-token objective with cross-entropy loss is common across architectures.

While Transformers have dominated recent LLM designs, structured state space models such as Mamba have been proposed to address computational inefficiencies. The authors explicitly state that deciding which architecture is optimal for parameter and computational efficiency remains unresolved.

References

Which architecture is optimal in terms of parameter efficiency or computational efficiency remains an intriguing open problem.

— Beyond the Black Box: A Statistical Model for LLM Reasoning and Inference (2402.03175 - Dalal et al., 5 Feb 2024) in Section 6.3 (Implications of our model: Deep Learning Architecture)

Optimal LLM Architecture for Parameter and Computational Efficiency

Background

References

Related Problems