Detecting Statements in Text: A Domain-Agnostic Few-Shot Solution (2405.05705v1)
Abstract: Many tasks related to Computational Social Science and Web Content Analysis involve classifying pieces of text based on the claims they contain. State-of-the-art approaches usually involve fine-tuning models on large annotated datasets, which are costly to produce. In light of this, we propose and release a qualitative and versatile few-shot learning methodology as a common paradigm for any claim-based textual classification task. This methodology involves defining the classes as arbitrarily sophisticated taxonomies of claims, and using Natural Language Inference models to obtain the textual entailment between these and a corpus of interest. The performance of these models is then boosted by annotating a minimal sample of data points, dynamically sampled using the well-established statistical heuristic of Probabilistic Bisection. We illustrate this methodology in the context of three tasks: climate change contrarianism detection, topic/stance classification and depression-relates symptoms detection. This approach rivals traditional pre-train/fine-tune approaches while drastically reducing the need for data annotation.
- GPT-4 Technical Report. arXiv preprint arXiv:2303.08774.
- The Occupational Depression Inventory: A new tool for clinicians and epidemiologists. Journal of Psychosomatic Research, 138: 110249.
- An interval estimation problem for controlled observations. Problemy Peredachi Informatsii, 10(3): 51–61.
- Active learning and sampling. In Foundations and Applications of Sensor Management, 177–200. Springer.
- Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240): 1–113.
- Computer-assisted classification of contrarian claims about climate change. Scientific reports, 11(1): 1–12.
- Probabilistic bisection converges almost as quickly as stochastic approximation. Mathematics of Operations Research, 44(2): 651–667.
- Sequential recovery of analytic periodic edges in binary image models. Mathematical Methods of Statistics, 12(1): 95–115.
- On calibration of modern neural networks. In International conference on machine learning, 1321–1330. PMLR.
- Horstein, M. 1963. Sequential transmission using noiseless feedback. IEEE Transactions on Information Theory, 9(3): 136–143.
- Detecting COVID-19-Related Fake News Using Feature Extraction. Frontiers in Public Health, 9.
- BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Jurafsky, D.; Chai, J.; Schluter, N.; and Tetreault, J., eds., Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 7871–7880. Online: Association for Computational Linguistics.
- Sentiprompt: Sentiment knowledge enhanced prompt-tuning for aspect-based sentiment analysis. arXiv preprint arXiv:2109.08306.
- Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. ACM Comput. Surv., 55(9).
- Personalization, gender, and social media: gubernatorial candidates’ social media strategies. Information, Communication & Society, 20(2): 264–283.
- Semeval-2016 task 6: Detecting stance in tweets. In Proceedings of the 10th international workshop on semantic evaluation (SemEval-2016), 31–41.
- SemEval-2016 Task 6: Detecting Stance in Tweets. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), 31–41. San Diego, California: Association for Computational Linguistics.
- Pelc, A. 1989. Searching with known error probability. Theoretical Computer Science, 63(2): 185–202.
- BDI-Sen: A Sentence Dataset for Clinical Symptoms of Depression. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2996–3006.
- Fake news challenge. Exploring how artificial intelligence technologies could be leveraged to combat fake news. url: https://www. fakenewschallenge. org/(visited on 03/13/2020).
- Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Inui, K.; Jiang, J.; Ng, V.; and Wan, X., eds., Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 3982–3992. Hong Kong, China: Association for Computational Linguistics.
- A survey of deep active learning. ACM Computing Surveys (CSUR), 54(9): 1–40.
- Information directed sampling for stochastic root finding. In 2015 Winter Simulation Conference (WSC), 3142–3143. IEEE.
- Caught in a networked collusion? Homogeneity in conspiracy-related discussion networks on YouTube. Information Systems, 103: 101866.
- Self-diagnosis and self-debiasing: A proposal for reducing corpus-based bias in nlp. Transactions of the Association for Computational Linguistics, 9: 1408–1424.
- Depression at work: exploring depression in major US companies from online reviews. Proceedings of the ACM on Human-Computer Interaction, 6(CSCW2): 1–21.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Collaborative 20 questions for target localization. IEEE Transactions on Information Theory, 60(4): 2233–2252.
- Attention is all you need. Advances in neural information processing systems, 30.
- Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection. 1667–1682.
- A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 1112–1122. Association for Computational Linguistics.
- Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment Approach. In Inui, K.; Jiang, J.; Ng, V.; and Wan, X., eds., Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 3914–3923. Hong Kong, China: Association for Computational Linguistics.
- Universal Natural Language Processing with Limited Annotations: Try Few-shot Textual Entailment as a Start. In Webber, B.; Cohn, T.; He, Y.; and Liu, Y., eds., Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 8229–8239. Online: Association for Computational Linguistics.
- Automated fact-checking: A survey. Language and Linguistics Compass, 15(10): e12438.