It HAS to be Subjective: Human Annotator Simulation via Zero-shot Density Estimation (2310.00486v1)
Abstract: Human annotator simulation (HAS) serves as a cost-effective substitute for human evaluation such as data annotation and system assessment. Human perception and behaviour during human evaluation exhibit inherent variability due to diverse cognitive processes and subjective interpretations, which should be taken into account in modelling to better mimic the way people perceive and interact with the world. This paper introduces a novel meta-learning framework that treats HAS as a zero-shot density estimation problem, which incorporates human variability and allows for the efficient generation of human-like annotations for unlabelled test inputs. Under this framework, we propose two new model classes, conditional integer flows and conditional softmax flows, to account for ordinal and categorical annotations, respectively. The proposed method is evaluated on three real-world human evaluation tasks and shows superior capability and efficiency to predict the aggregated behaviours of human annotators, match the distribution of human annotations, and simulate the inter-annotator disagreements.
- Cecilia Ovesdotter Alm. Subjective natural language problems: Motivations, applications, characterizations, and implications. In Proc. ACL, Portland, USA, 2011.
- Deep evidential regression. In Proc. NeurIPS, Online, 2020.
- A sequence-to-sequence model for user simulation in spoken dialogue systems. In Proc. Interspeech, San Francisco, USA, 2016.
- Weight uncertainty in neural network. In Proc. ICML, Lille, France, 2015.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
- Automated grading of essays: a review. In Proc. IHCI, Daegu, South Korea, 2020.
- IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation, 42:335–359, 2008.
- WavLM: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022.
- Meta-learning adaptive deep kernel gaussian processes for molecular property prediction. In Proc. ICLR, Kigali, Rwanda, 2023.
- Every rating matters: Joint learning of subjective labels and individual annotators for speech emotion classification. In Proc. ICASSP, Brighton, UK, 2019.
- Deep reinforcement learning from human preferences. In Proc. NIPS, Long Beach, USA, 2017.
- An investigation of emotion prediction uncertainty using gaussian mixture regression. In Proc. Interspeech, Stockholm, Sweden, 2017.
- Dealing with disagreements: Looking beyond the majority vote in subjective annotations. Transactions of the Association for Computational Linguistics, 10:92–110, 2022.
- Confidence measures for speech emotion recognition: A start. In Speech Communication; 10. ITG Symposium, pp. 1–4. VDE, 2012.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proc. NAACL, Minneapolis, USA, 2019.
- Density estimation using real NVP. In Proc. ICLR, Toulon, France, 2017.
- Measuring and mitigating unintended bias in text classification. In Proc. AAAI, New Orleans, USA, 2018.
- Hate speech detection with comment embeddings. In Proc. WWW, Florence, Italy, 2015.
- Modeling subjectiveness in emotion recognition with deep neural networks: Ensembles vs soft labels. In Proc. IJCNN, Vancouver, Canada, 2016.
- Joseph L Fleiss. Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5):378, 1971.
- Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In Proc. ICML, New York, USA, 2016.
- User modeling for task oriented dialogues. In Proc. SLT, Athens, Greece, 2018.
- From hard to soft: Towards more human-like emotion recognition by modelling the perception uncertainty. In Proc. ACM MM, Mountain View, USA, 2017.
- The evolution of cognitive bias. The Handbook of Evolutionary Psychology, pp. 724–746, 2015.
- Experiments in emotional speech. In Proc. SSPR, Tokyo, Japan, 2003.
- Argmax flows and multinomial diffusion: Learning categorical distributions. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Proc. NeurIPS, Online, 2021.
- Social biases in NLP models as barriers for persons with disabilities. In Proc. ACL, Online, 2020.
- Deep learning for robust feature generation in audiovisual emotion recognition. In Proc. ICASSP, Vancouver, Canada, 2013.
- Auto-encoding variational Bayes. In Proc. ICLR, Banff, Canada, 2014.
- Simple and scalable predictive uncertainty estimation using deep ensembles. In Proc. NIPS, Long Beach, USA, 2017.
- MBNet: MOS prediction for synthesized speech with mean-bias network. In Proc. ICASSP, Toronto, Canada, 2021.
- Domain-independent user simulation with transformers for task-oriented dialogue systems. In Proc. SIGDIAL, Singapore City, Singapore, 2021.
- RoBERTa: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- MOSNet: Deep learning-based objective assessment for voice conversion. In Proc. Interspeech, Graz, Austria, 2019.
- R. Lotfian and C. Busso. Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings. IEEE Transactions on Affective Computing, 10(4):471–483, 2019.
- Perceptual quality assessment for multi-exposure image fusion. IEEE Transactions on Image Processing, 24(11):3345–3356, 2015.
- Predictive uncertainty estimation via prior networks. In Proc. NeurIPS, Montreal, Canada, 2018.
- SOMOS: The Samsung open MOS dataset for the evaluation of neural text-to-speech synthesis. In Proc. Interspeech, Incheon, Korea, 2022.
- HateXplain: A benchmark dataset for explainable hate speech detection. In Proc. AAAI, Vancouver, Canada, 2021.
- A corpus-based approach to finding happiness. In Proc. AAAI Spring Symposium, Stanford, USA, 2006.
- Training language models to follow instructions with human feedback. In Proc. NeurIPS, New Orleans, USA, 2022.
- AutoMOS: Learning a non-intrusive assessor of naturalness-of-speech. In Proc, NeurIPS Workshop, Barcelona, Spain, 2016.
- Learning part-of-speech taggers with inter-annotator agreement loss. In Proc. EACL, Gothenburg, Sweden, 2014.
- A review of affective computing: From unimodal analysis to multimodal fusion. Information Fusion, 37:98–125, 2017.
- MELD: A multimodal multi-party dataset for emotion recognition in conversations. In Proc. ACL, Florence, Italy, 2019.
- Statistical modality tagging from rule-based annotations and crowdsourcing. In Proc. ExProM Workshop, Jeju, Korea, 2012.
- Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
- An automated essay scoring systems: A systematic literature review. Artificial Intelligence Review, 55(3):2495–2527, 2022.
- Data programming: Creating large training sets, quickly. In Proc. NeurIPS, Barcelona, Spain, 2016.
- SpeechBrain: A general-purpose speech toolkit. arXiv preprint arXiv:2106.04624, 2021.
- Learning to simulate. In Proc. ICLR, New Orleans, USA, 2019.
- Agenda-based user simulation for bootstrapping a POMDP dialogue system. In Proc. NAACL, Vancouver, Canada, 2007.
- Study of subjective and objective quality assessment of video. IEEE Transactions on Image Processing, 19(6):1427–1441, 2010.
- How to build user simulators to train RL-based dialog systems. In Proc. EMNLP-IJCNLP, Hong Kong, China, 2019.
- Prototypical networks for few-shot learning. In Proc. NIPS, Long Beach, USA, 2017.
- normflows: A PyTorch package for normalizing flows. Journal of Open Source Software, 8(86):5361, 2023.
- NIMA: Neural image assessment. IEEE Transactions on Image Processing, 27(8):3998–4011, 2018.
- Matching networks for one shot learning. In Proc. NIPS, Barcelona, Spain, 2016.
- Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1–2):1–305, 2008.
- Quality-Net: An end-to-end non-intrusive speech quality assessment model based on BLSTM. In Proc. Interspeech, Hyderabad, India, 2018.
- Learning subjective language. Computational linguistics, 30(3):277–308, 2004.
- Gaussian processes for machine learning, volume 2. MIT press Cambridge, MA, 2006.
- Learning hyper label model for programmatic weak supervision. In Proc. ICLR, Online, 2022a.
- Estimating the uncertainty in emotion class labels with utterance-specific Dirichlet priors. IEEE Transactions on Affective Computing, 2022b.
- Estimating the uncertainty in emotion attributes using deep evidential regression. In Proc. ACL, Toronto, Canada, 2023.
- Assessing user interface aesthetics based on the inter-subjectivity of judgment. In Proc. BCS HCI, Poole, UK, 2016.