Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Navigating Prompt Complexity for Zero-Shot Classification: A Study of Large Language Models in Computational Social Science (2305.14310v3)

Published 23 May 2023 in cs.CL

Abstract: Instruction-tuned LLMs have exhibited impressive language understanding and the capacity to generate responses that follow specific prompts. However, due to the computational demands associated with training these models, their applications often adopt a zero-shot setting. In this paper, we evaluate the zero-shot performance of two publicly accessible LLMs, ChatGPT and OpenAssistant, in the context of six Computational Social Science classification tasks, while also investigating the effects of various prompting strategies. Our experiments investigate the impact of prompt complexity, including the effect of incorporating label definitions into the prompt; use of synonyms for label names; and the influence of integrating past memories during foundation model training. The findings indicate that in a zero-shot setting, current LLMs are unable to match the performance of smaller, fine-tuned baseline transformer models (such as BERT-large). Additionally, we find that different prompting strategies can significantly affect classification accuracy, with variations in accuracy and F1 scores exceeding 10\%.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Ask me anything: A simple strategy for prompting language models. arXiv preprint arXiv:2210.02441.
  2. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509.
  3. The longest month: analyzing covid-19 vaccination opinions dynamics from tweets in the month following the first vaccine announcement. Ieee Access, 9:33203–33223.
  4. SemEval-2017 task 8: RumourEval: Determining rumour veracity and support for rumours. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 69–76, Vancouver, Canada. Association for Computational Linguistics.
  5. 8-bit optimizers via block-wise quantization. arXiv preprint arXiv:2110.02861.
  6. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
  7. Semeval-2022 task 6: isarcasmeval, intended sarcasm detection in english and arabic. In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), pages 802–814.
  8. Systematic evaluation of gpt-3 for zero-shot personality estimation. arXiv preprint arXiv:2306.01183.
  9. Automatic identification and classification of bragging in social media. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3945–3959.
  10. Openassistant conversations–democratizing large language model alignment. arXiv preprint arXiv:2304.07327.
  11. Chatgpt: Beginning of an end of manual annotation? use case of automatic genre identification. arXiv preprint arXiv:2303.03953.
  12. Bishal Lamichhane. 2023. Evaluation of chatgpt for nlp-based mental health applications. arXiv preprint arXiv:2303.15727.
  13. The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688.
  14. Detectgpt: Zero-shot machine-generated text detection using probability curvature. arXiv preprint arXiv:2301.11305.
  15. Vaxxhesitancy: A dataset for studying hesitancy towards covid-19 vaccination on twitter. In Proceedings of the International AAAI Conference on Web and Social Media, volume 17, pages 1052–1062.
  16. Elite Olshtain and Liora Weinbach. 1987. 10. complaints: A study of speech act behavior among native and non-native speakers of hebrew. In The pragmatic perspective, page 195. John Benjamins.
  17. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  18. Winds of change: Impact of covid-19 on vaccine-related opinions of twitter users. In Proceedings of the International AAAI Conference on Web and Social Media, volume 16, pages 782–793.
  19. Automatically identifying complaints in social media. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5008–5019.
  20. Michael V Reiss. 2023. Testing the reliability of chatgpt for text annotation and classification: A cautionary remark. arXiv preprint arXiv:2304.11085.
  21. Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207.
  22. Detectllm: Leveraging log rank information for zero-shot detection of machine-generated text. arXiv preprint arXiv:2306.05540.
  23. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261.
  24. Petter Törnberg. 2023. Chatgpt-4 outperforms experts and crowd workers in annotating political twitter messages with zero-shot learning. arXiv preprint arXiv:2304.06588.
  25. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  26. Yau-Shian Wang and Yingshan Chang. 2022. Toxicity detection with generative prompt-based inference. arXiv preprint arXiv:2205.12390.
  27. Understanding abuse: A typology of abusive language detection subtasks. In Proceedings of the First Workshop on Abusive Language Online, pages 78–84.
  28. Zeerak Waseem and Dirk Hovy. 2016. Hateful symbols or hateful people? predictive features for hate speech detection on twitter. In Proceedings of the NAACL Student Research Workshop, pages 88–93, San Diego, California. Association for Computational Linguistics.
  29. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
  30. Chain of thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems.
  31. Large language models can be used to estimate the ideologies of politicians in a zero-shot learning setting. arXiv preprint arXiv:2303.12057.
  32. Can large language models transform computational social science? arXiv preprint arXiv:2305.03514.
  33. Detection and resolution of rumours in social media: A survey. ACM Computing Surveys (CSUR), 51(2):1–36.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Yida Mu (14 papers)
  2. Ben P. Wu (2 papers)
  3. William Thorne (3 papers)
  4. Ambrose Robinson (3 papers)
  5. Nikolaos Aletras (72 papers)
  6. Carolina Scarton (52 papers)
  7. Kalina Bontcheva (64 papers)
  8. Xingyi Song (30 papers)
Citations (13)