Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Measuring Social Norms of Large Language Models (2404.02491v4)

Published 3 Apr 2024 in cs.CL, cs.AI, and cs.LG

Abstract: We present a new challenge to examine whether LLMs understand social norms. In contrast to existing datasets, our dataset requires a fundamental understanding of social norms to solve. Our dataset features the largest set of social norm skills, consisting of 402 skills and 12,383 questions covering a wide set of social norms ranging from opinions and arguments to culture and laws. We design our dataset according to the K-12 curriculum. This enables the direct comparison of the social understanding of LLMs to humans, more specifically, elementary students. While prior work generates nearly random accuracy on our benchmark, recent LLMs such as GPT3.5-Turbo and LLaMA2-Chat are able to improve the performance significantly, only slightly below human performance. We then propose a multi-agent framework based on LLMs to improve the models' ability to understand social norms. This method further improves LLMs to be on par with humans. Given the increasing adoption of LLMs in real-world applications, our finding is particularly important and presents a unique direction for future improvements.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Ixl design principles.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  3. How is chatgpt’s behavior changing over time?
  4. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  5. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
  6. Agent instructs large language models to be general zero-shot reasoners. arXiv preprint arXiv:2310.03710.
  7. Google Gemini Team. 2023. Gemini: A family of highly capable multimodal models. https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf.
  8. On calibration of modern neural networks. In ICML, pages 1321–1330.
  9. Measuring massive multitask language understanding. In ICLR.
  10. Measuring mathematical problem solving with the math dataset. NeurIPS.
  11. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
  12. IXL. 2014a. How does the smartscore work? https://www.ixl.com/help-center/article/1272663/how_does_the_smartscore_work.
  13. IXL. 2014b. Understanding the ixl smartscore. https://blog.ixl.com/wp-content/uploads/2014/11/SmartScore-guide.pdf.
  14. Unifiedqa: Crossing format boundaries with a single QA system. In Findings of EMNLP, pages 1896–1907.
  15. IXL Learning. 2019. The impact of ixl math and ixl ela on student achievement in grades pre-k to 12.
  16. Camel: Communicative agents for" mind" exploration of large scale language model society. arXiv preprint arXiv:2303.17760.
  17. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
  18. Taskmatrix.ai: Completing tasks by connecting foundation models with millions of apis.
  19. Fimo: A challenge formal dataset for automated theorem proving. arXiv preprint arXiv:2309.04295.
  20. Bolaa: Benchmarking and orchestrating llm-augmented autonomous agents. arXiv preprint arXiv:2308.05960.
  21. Learn to explain: Multimodal reasoning via thought chains for science question answering. In NeurIPS.
  22. OpenAI. 2023. Gpt-4 technical report.
  23. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744.
  24. Compressing recurrent neural networks with tensor ring for action recognition. In AAAI, pages 4683–4690. AAAI Press.
  25. Preparing lessons for progressive training on language models. arXiv preprint arXiv:2401.09192.
  26. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  27. Learning internal representations by error propagation, parallel distributed processing, explorations in the microstructure of cognition, ed. de rumelhart and j. mcclelland. vol. 1. 1986. Biometrika, 71:599–607.
  28. Toolformer: Language models can teach themselves to use tools.
  29. Palt: parameter-lite transfer of language models for knowledge graph completion. arXiv preprint arXiv:2210.13715.
  30. Measuring vision-language stem skills of neural models. arXiv preprint arXiv:2402.17205.
  31. Large language models can be easily distracted by irrelevant context. In International Conference on Machine Learning, pages 31210–31227. PMLR.
  32. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
  33. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7.
  34. Oguzhan Topsakal and Tahir Cetin Akinci. 2023. Creating large language model applications utilizing langchain: A primer on developing llm apps fast. International Conference on Applied Engineering and Natural Sciences, 1(1):1050–1056.
  35. Llama: Open and efficient foundation language models.
  36. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  37. Chatlog: Recording and analyzing chatgpt across time. arXiv preprint arXiv:2304.14106.
  38. DeepStruct: Pretraining of language models for structure prediction. In Findings of the Association for Computational Linguistics: ACL 2022, pages 803–823, Dublin, Ireland. Association for Computational Linguistics.
  39. Language models are open knowledge graphs. arXiv preprint arXiv:2010.11967.
  40. Dt-solver: Automated theorem proving with dynamic-tree sampling guided by proof-level value function. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12632–12646.
  41. Tensor networks meet neural networks: A survey and future perspectives. arXiv preprint arXiv:2302.09019.
  42. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  43. Visual chatgpt: Talking, drawing and editing with visual foundation models.
  44. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation Framework. ArXiv:2308.08155 [cs].
  45. The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864.
  46. Dq-lore: Dual queries with low rank approximation re-ranking for in-context learning. arXiv preprint arXiv:2310.02954.
  47. Trigo: Benchmarking formal mathematical proof reduction for generative language models. arXiv preprint arXiv:2310.10180.
  48. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629.
  49. Exchange-of-thought: Enhancing large language model capabilities through cross-model communication.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Ye Yuan (274 papers)
  2. Kexin Tang (3 papers)
  3. Jianhao Shen (18 papers)
  4. Ming Zhang (313 papers)
  5. Chenguang Wang (59 papers)
Citations (2)