SQBC: Active Learning using LLM-Generated Synthetic Data for Stance Detection in Online Political Discussions (2404.08078v1)
Abstract: Stance detection is an important task for many applications that analyse or support online political discussions. Common approaches include fine-tuning transformer based models. However, these models require a large amount of labelled data, which might not be available. In this work, we present two different ways to leverage LLM-generated synthetic data to train and improve stance detection agents for online political discussions: first, we show that augmenting a small fine-tuning dataset with synthetic data can improve the performance of the stance detection model. Second, we propose a new active learning method called SQBC based on the "Query-by-Comittee" approach. The key idea is to use LLM-generated synthetic data as an oracle to identify the most informative unlabelled samples, that are selected for manual labelling. Comprehensive experiments show that both ideas can improve the stance detection performance. Curiously, we observed that fine-tuning on actively selected samples can exceed the performance of using the full dataset.
- Stance detection on social media: State of the art and trends. Information Processing & Management, 58(4):102597, 2021. ISSN 0306-4573. doi: https://doi.org/10.1016/j.ipm.2021.102597. URL https://www.sciencedirect.com/science/article/pii/S0306457321000960.
- Automatic debate text summarization in online debate forum. Procedia Computer Science, 116:11–19, 2017. ISSN 1877-0509. doi: https://doi.org/10.1016/j.procs.2017.10.003. URL https://www.sciencedirect.com/science/article/pii/S1877050917320409. Discovery and innovation of computer science technology in artificial intelligence era: The 2nd International Conference on Computer Science and Computational Intelligence (ICCSCI 2017).
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Burstein, J., Doran, C., and Solorio, T. (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
- A survey on stance detection for mis- and disinformation identification. In Carpuat, M., de Marneffe, M.-C., and Meza Ruiz, I. V. (eds.), Findings of the Association for Computational Linguistics: NAACL 2022, pp. 1259–1277, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-naacl.94. URL https://aclanthology.org/2022.findings-naacl.94.
- Stance detection: A survey. ACM Comput. Surv., 53(1), feb 2020. ISSN 0360-0300. doi: 10.1145/3369026. URL https://doi.org/10.1145/3369026.
- Active learning and visual analytics for stance classification with alva. ACM Trans. Interact. Intell. Syst., 7(3), oct 2017. ISSN 2160-6455. doi: 10.1145/3132169. URL https://doi.org/10.1145/3132169.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019. doi: https://doi.org/10.48550/arXiv.1907.11692.
- On the importance of data size in probing fine-tuned models. In Muresan, S., Nakov, P., and Villavicencio, A. (eds.), Findings of the Association for Computational Linguistics: ACL 2022, pp. 228–238, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-acl.20. URL https://aclanthology.org/2022.findings-acl.20.
- Active learning approaches for labeling text: Review and assessment of the performance of active learning approaches. Political Analysis, 28(4):532–551, 2020. doi: 10.1017/pan.2020.4.
- Automated topic categorisation of citizens’ contributions: Reducing manual labelling efforts through active learning. In Electronic Government: 21st IFIP WG 8.5 International Conference, EGOV 2022, Linköping, Sweden, September 6–8, 2022, Proceedings, pp. 369–385, Berlin, Heidelberg, 2022. Springer-Verlag. ISBN 978-3-031-15085-2. doi: 10.1007/978-3-031-15086-9˙24. URL https://doi.org/10.1007/978-3-031-15086-9_24.
- Making sense of citizens’ input through artificial intelligence: A review of methods for computational text analysis to support the evaluation of contributions in public participation. Digit. Gov.: Res. Pract., jun 2023. doi: 10.1145/3603254. URL https://doi.org/10.1145/3603254. Just Accepted.
- Classifying speech acts in political communication: A transformer-based approach with weak supervision and active learning. In 2023 18th Conference on Computer Science and Intelligence Systems (FedCSIS), pp. 739–748, 2023. doi: 10.15439/2023F3485.
- Query by committee. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, COLT ’92, pp. 287–294, New York, NY, USA, 1992. Association for Computing Machinery. ISBN 089791497X. doi: 10.1145/130385.130417. URL https://doi.org/10.1145/130385.130417.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. doi: https://doi.org/10.48550/arXiv.2302.13971.
- X-Stance: A multilingual multi-target dataset for stance detection. In Proceedings of the 5th Swiss Text Analytics Conference (SwissText) & 16th Conference on Natural Language Processing (KONVENS), Zurich, Switzerland, jun 2020. URL http://ceur-ws.org/Vol-2624/paper9.pdf.
- Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
- What Creates Interactivity in Online News Discussions? An Exploratory Analysis of Discussion Factors in User Comments on News Items. Journal of Communication, 64(6):1111–1138, 10 2014. ISSN 0021-9916. doi: 10.1111/jcom.12123. URL https://doi.org/10.1111/jcom.12123.
- Stefan Sylvius Wagner (10 papers)
- Maike Behrendt (7 papers)
- Marc Ziegele (3 papers)
- Stefan Harmeling (42 papers)