Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Detecting Statements in Text: A Domain-Agnostic Few-Shot Solution (2405.05705v1)

Published 9 May 2024 in cs.CL

Abstract: Many tasks related to Computational Social Science and Web Content Analysis involve classifying pieces of text based on the claims they contain. State-of-the-art approaches usually involve fine-tuning models on large annotated datasets, which are costly to produce. In light of this, we propose and release a qualitative and versatile few-shot learning methodology as a common paradigm for any claim-based textual classification task. This methodology involves defining the classes as arbitrarily sophisticated taxonomies of claims, and using Natural Language Inference models to obtain the textual entailment between these and a corpus of interest. The performance of these models is then boosted by annotating a minimal sample of data points, dynamically sampled using the well-established statistical heuristic of Probabilistic Bisection. We illustrate this methodology in the context of three tasks: climate change contrarianism detection, topic/stance classification and depression-relates symptoms detection. This approach rivals traditional pre-train/fine-tune approaches while drastically reducing the need for data annotation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774.
  2. The Occupational Depression Inventory: A new tool for clinicians and epidemiologists. Journal of Psychosomatic Research, 138: 110249.
  3. An interval estimation problem for controlled observations. Problemy Peredachi Informatsii, 10(3): 51–61.
  4. Active learning and sampling. In Foundations and Applications of Sensor Management, 177–200. Springer.
  5. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240): 1–113.
  6. Computer-assisted classification of contrarian claims about climate change. Scientific reports, 11(1): 1–12.
  7. Probabilistic bisection converges almost as quickly as stochastic approximation. Mathematics of Operations Research, 44(2): 651–667.
  8. Sequential recovery of analytic periodic edges in binary image models. Mathematical Methods of Statistics, 12(1): 95–115.
  9. On calibration of modern neural networks. In International conference on machine learning, 1321–1330. PMLR.
  10. Horstein, M. 1963. Sequential transmission using noiseless feedback. IEEE Transactions on Information Theory, 9(3): 136–143.
  11. Detecting COVID-19-Related Fake News Using Feature Extraction. Frontiers in Public Health, 9.
  12. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Jurafsky, D.; Chai, J.; Schluter, N.; and Tetreault, J., eds., Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 7871–7880. Online: Association for Computational Linguistics.
  13. Sentiprompt: Sentiment knowledge enhanced prompt-tuning for aspect-based sentiment analysis. arXiv preprint arXiv:2109.08306.
  14. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. ACM Comput. Surv., 55(9).
  15. Personalization, gender, and social media: gubernatorial candidates’ social media strategies. Information, Communication & Society, 20(2): 264–283.
  16. Semeval-2016 task 6: Detecting stance in tweets. In Proceedings of the 10th international workshop on semantic evaluation (SemEval-2016), 31–41.
  17. SemEval-2016 Task 6: Detecting Stance in Tweets. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), 31–41. San Diego, California: Association for Computational Linguistics.
  18. Pelc, A. 1989. Searching with known error probability. Theoretical Computer Science, 63(2): 185–202.
  19. BDI-Sen: A Sentence Dataset for Clinical Symptoms of Depression. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2996–3006.
  20. Fake news challenge. Exploring how artificial intelligence technologies could be leveraged to combat fake news. url: https://www. fakenewschallenge. org/(visited on 03/13/2020).
  21. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Inui, K.; Jiang, J.; Ng, V.; and Wan, X., eds., Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 3982–3992. Hong Kong, China: Association for Computational Linguistics.
  22. A survey of deep active learning. ACM Computing Surveys (CSUR), 54(9): 1–40.
  23. Information directed sampling for stochastic root finding. In 2015 Winter Simulation Conference (WSC), 3142–3143. IEEE.
  24. Caught in a networked collusion? Homogeneity in conspiracy-related discussion networks on YouTube. Information Systems, 103: 101866.
  25. Self-diagnosis and self-debiasing: A proposal for reducing corpus-based bias in nlp. Transactions of the Association for Computational Linguistics, 9: 1408–1424.
  26. Depression at work: exploring depression in major US companies from online reviews. Proceedings of the ACM on Human-Computer Interaction, 6(CSCW2): 1–21.
  27. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  28. Collaborative 20 questions for target localization. IEEE Transactions on Information Theory, 60(4): 2233–2252.
  29. Attention is all you need. Advances in neural information processing systems, 30.
  30. Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection. 1667–1682.
  31. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 1112–1122. Association for Computational Linguistics.
  32. Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment Approach. In Inui, K.; Jiang, J.; Ng, V.; and Wan, X., eds., Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 3914–3923. Hong Kong, China: Association for Computational Linguistics.
  33. Universal Natural Language Processing with Limited Annotations: Try Few-shot Textual Entailment as a Start. In Webber, B.; Cohn, T.; He, Y.; and Liu, Y., eds., Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 8229–8239. Online: Association for Computational Linguistics.
  34. Automated fact-checking: A survey. Language and Linguistics Compass, 15(10): e12438.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com