Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Is Bigger and Deeper Always Better? Probing LLaMA Across Scales and Layers (2312.04333v4)

Published 7 Dec 2023 in cs.CL

Abstract: This paper presents an in-depth analysis of LLMs, focusing on LLaMA, a prominent open-source foundational model in natural language processing. Instead of assessing LLaMA through its generative output, we design multiple-choice tasks to probe its intrinsic understanding in high-order tasks such as reasoning and computation. We examine the model horizontally, comparing different sizes, and vertically, assessing different layers. We unveil several key and uncommon findings based on the designed probing tasks: (1) Horizontally, enlarging model sizes almost could not automatically impart additional knowledge or computational prowess. Instead, it can enhance reasoning abilities, especially in math problem solving, and helps reduce hallucinations, but only beyond certain size thresholds; (2) In vertical analysis, the lower layers of LLaMA lack substantial arithmetic and factual knowledge, showcasing logical thinking, multilingual and recognitive abilities, with top layers housing most computational power and real-world knowledge.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. Language models are few-shot learners. CoRR, abs/2005.14165.
  2. Orca: A few-shot benchmark for chinese conversational machine reading comprehension. arXiv preprint arXiv:2302.13619.
  3. Bridging the gap between language models and cross-lingual sequence labeling. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1909–1923, Seattle, United States. Association for Computational Linguistics.
  4. Alleviating over-smoothing for unsupervised sentence representation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3552–3566, Toronto, Canada. Association for Computational Linguistics.
  5. What would harry say? building dialogue agents for characters in a story. arXiv preprint arXiv:2211.06869.
  6. Breaking language barriers in multilingual mathematical reasoning: Insights and observations. arXiv preprint arXiv:2310.20246.
  7. Wenhu Chen. 2023. Large language models are few(1)-shot table reasoners. In Findings of the Association for Computational Linguistics: EACL 2023, pages 1120–1130, Dubrovnik, Croatia. Association for Computational Linguistics.
  8. Dola: Decoding by contrasting layers improves factuality in large language models. arXiv preprint arXiv:2309.03883.
  9. Training verifiers to solve math word problems. CoRR, abs/2110.14168.
  10. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  11. Pal: Program-aided language models. In International Conference on Machine Learning, pages 10764–10799. PMLR.
  12. Yoav Goldberg. 2019. Assessing bert’s syntactic abilities.
  13. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
  14. What does BERT learn about the structure of language? In ACL (1), pages 3651–3657. Association for Computational Linguistics.
  15. What does BERT learn about the structure of language? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3651–3657, Florence, Italy. Association for Computational Linguistics.
  16. Truthfulqa: Measuring how models mimic human falsehoods. In ACL (1), pages 3214–3252. Association for Computational Linguistics.
  17. Linguistic knowledge and transferability of contextual representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1073–1094, Minneapolis, Minnesota. Association for Computational Linguistics.
  18. OpenAI. 2023. Gpt-4 technical report.
  19. Dissecting contextual word embeddings: Architecture and representation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1499–1509, Brussels, Belgium. Association for Computational Linguistics.
  20. Language models as knowledge bases? In EMNLP/IJCNLP (1), pages 2463–2473. Association for Computational Linguistics.
  21. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
  22. What do you learn from context? probing for sentence structure in contextualized word representations. In International Conference on Learning Representations.
  23. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  24. Llama 2: Open foundation and fine-tuned chat models.
  25. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  26. Attention is all you need. In NIPS, pages 5998–6008.
  27. Emergent abilities of large language models. Trans. Mach. Learn. Res., 2022.
  28. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629.
  29. Reclor: A reading comprehension dataset requiring logical reasoning. In ICLR. OpenReview.net.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Nuo Chen (100 papers)
  2. Ning Wu (63 papers)
  3. Shining Liang (9 papers)
  4. Ming Gong (246 papers)
  5. Linjun Shou (53 papers)
  6. Dongmei Zhang (193 papers)
  7. Jia Li (380 papers)
Citations (6)

Summary

Analyzing LLaMA: Probing Size and Depth in LLMs

The paper "Is Bigger and Deeper Always Better? Probing LLaMA Across Scales and Layers" presents a meticulous investigation into the capabilities of the LLaMA series of LLMs. This paper offers a nuanced examination of how LLM performance varies with changes in model size and network depth. By employing specifically designed multiple-choice probing tasks, the research explores the intrinsic understanding capabilities of LLaMA across reasoning, computation, and knowledge retention.

The exploration begins by debunking the common assumption that larger models inherently possess greater factual knowledge or computational proficiency. The analyses highlight that simply increasing model size does not significantly enhance these aspects. Instead, factual knowledge and basic arithmetic skills remain fairly consistent across varying model sizes. Particularly notable is the observation that increasing the parameter count does not substantially endow models with additional real-world knowledge when trained on the same volume of data.

The research further distinguishes between horizontal (size-wise) and vertical (layer-wise) analysis, revealing improvements in reasoning and truthfulness capabilities with larger model architectures. For instance, in mathematical problem-solving, particularly under scenarios necessitating complex reasoning, an apparent leap in performance occurs once a certain model size threshold is reached. Models with smaller parameter configurations, such as LLaMA 2-7B, demonstrate comparable performance to their slightly larger counterparts, such as the 13B model, across arithmetic operations and knowledge retention tasks. However, a considerable gain is observed when scaling up to 70B parameters, which excels particularly in tasks requiring sophisticated reasoning, thereby mitigating hallucination occurrences and improving reasoning capabilities.

Further insights are garnered from probing individual model layers, revealing that LLaMA's lower layers have limited computational and factual knowledge capabilities. However, logical reasoning and recognitive abilities are present across all layers, albeit in varying degrees. The uppermost layers predominantly harbor computational power and knowledge, whereas lower layers are involved in logical and abstract thinking. Interestingly, optimal performance in some tasks is not always located in the final layer, suggesting that peak computational abilities may lie in preceding layers.

Extending the investigation to a multilingual context, the paper underscores the model's cross-lingual reasoning capacities. A divergent observation is made in non-English languages, where performance diminishes with increased network depth, indicating that lower layers may be responsible for storing language-general features.

The implications of these findings are significant for future LLM development and optimization. They suggest potential avenues for optimizing architecture by specializing layers for different tasks, thus informing the design of more efficient and capable LLMs. The comprehensive analysis of LLaMA across multiple dimensions offers valuable insights into the architecture and scalability of LLMs, providing a framework for future evaluations and enhancements in AI.

X Twitter Logo Streamline Icon: https://streamlinehq.com