Is Bigger and Deeper Always Better? Probing LLaMA Across Scales and Layers (2312.04333v4)

Published 7 Dec 2023 in cs.CL

Abstract: This paper presents an in-depth analysis of LLMs, focusing on LLaMA, a prominent open-source foundational model in natural language processing. Instead of assessing LLaMA through its generative output, we design multiple-choice tasks to probe its intrinsic understanding in high-order tasks such as reasoning and computation. We examine the model horizontally, comparing different sizes, and vertically, assessing different layers. We unveil several key and uncommon findings based on the designed probing tasks: (1) Horizontally, enlarging model sizes almost could not automatically impart additional knowledge or computational prowess. Instead, it can enhance reasoning abilities, especially in math problem solving, and helps reduce hallucinations, but only beyond certain size thresholds; (2) In vertical analysis, the lower layers of LLaMA lack substantial arithmetic and factual knowledge, showcasing logical thinking, multilingual and recognitive abilities, with top layers housing most computational power and real-world knowledge.

References (29)

Authors (7)

Nuo Chen (100 papers)
Ning Wu (63 papers)
Shining Liang (9 papers)
Ming Gong (246 papers)
Linjun Shou (53 papers)
Dongmei Zhang (193 papers)
Jia Li (380 papers)

Citations (6)

View on Semantic Scholar

Summary

Analyzing LLaMA: Probing Size and Depth in LLMs

The paper "Is Bigger and Deeper Always Better? Probing LLaMA Across Scales and Layers" presents a meticulous investigation into the capabilities of the LLaMA series of LLMs. This paper offers a nuanced examination of how LLM performance varies with changes in model size and network depth. By employing specifically designed multiple-choice probing tasks, the research explores the intrinsic understanding capabilities of LLaMA across reasoning, computation, and knowledge retention.

The exploration begins by debunking the common assumption that larger models inherently possess greater factual knowledge or computational proficiency. The analyses highlight that simply increasing model size does not significantly enhance these aspects. Instead, factual knowledge and basic arithmetic skills remain fairly consistent across varying model sizes. Particularly notable is the observation that increasing the parameter count does not substantially endow models with additional real-world knowledge when trained on the same volume of data.

The research further distinguishes between horizontal (size-wise) and vertical (layer-wise) analysis, revealing improvements in reasoning and truthfulness capabilities with larger model architectures. For instance, in mathematical problem-solving, particularly under scenarios necessitating complex reasoning, an apparent leap in performance occurs once a certain model size threshold is reached. Models with smaller parameter configurations, such as LLaMA 2-7B, demonstrate comparable performance to their slightly larger counterparts, such as the 13B model, across arithmetic operations and knowledge retention tasks. However, a considerable gain is observed when scaling up to 70B parameters, which excels particularly in tasks requiring sophisticated reasoning, thereby mitigating hallucination occurrences and improving reasoning capabilities.

Further insights are garnered from probing individual model layers, revealing that LLaMA's lower layers have limited computational and factual knowledge capabilities. However, logical reasoning and recognitive abilities are present across all layers, albeit in varying degrees. The uppermost layers predominantly harbor computational power and knowledge, whereas lower layers are involved in logical and abstract thinking. Interestingly, optimal performance in some tasks is not always located in the final layer, suggesting that peak computational abilities may lie in preceding layers.

Extending the investigation to a multilingual context, the paper underscores the model's cross-lingual reasoning capacities. A divergent observation is made in non-English languages, where performance diminishes with increased network depth, indicating that lower layers may be responsible for storing language-general features.

The implications of these findings are significant for future LLM development and optimization. They suggest potential avenues for optimizing architecture by specializing layers for different tasks, thus informing the design of more efficient and capable LLMs. The comprehensive analysis of LLaMA across multiple dimensions offers valuable insights into the architecture and scalability of LLMs, providing a framework for future evaluations and enhancements in AI.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1732948168671645748