POV Learning: Individual Alignment of Multimodal Models using Human Perception (2405.04443v2)
Abstract: Aligning machine learning systems with human expectations is mostly attempted by training with manually vetted human behavioral samples, typically explicit feedback. This is done on a population level since the context that is capturing the subjective Point-Of-View (POV) of a concrete person in a specific situational context is not retained in the data. However, we argue that alignment on an individual level can boost the subjective predictive performance for the individual user interacting with the system considerably. Since perception differs for each person, the same situation is observed differently. Consequently, the basis for decision making and the subsequent reasoning processes and observable reactions differ. We hypothesize that individual perception patterns can be used for improving the alignment on an individual level. We test this, by integrating perception information into machine learning systems and measuring their predictive performance wrt.~individual subjective assessments. For our empirical study, we collect a novel data set of multimodal stimuli and corresponding eye tracking sequences for the novel task of Perception-Guided Crossmodal Entailment and tackle it with our Perception-Guided Multimodal Transformer. Our findings suggest that exploiting individual perception signals for the machine learning of subjective human assessments provides a valuable cue for individual alignment. It does not only improve the overall predictive performance from the point-of-view of the individual user but might also contribute to steering AI systems towards every person's individual expectations and values.
- Neural machine translation by jointly learning to align and translate, 2016.
- Perceptual decision making: drift-diffusion model is equivalent to a bayesian model. Frontiers in Human Neuroscience 8 (2014).
- Uniter: Universal image-text representation learning, 2020.
- Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
- The Psychology of Visual Perception. Holt, Rinehart and Winston, 1980.
- Deep residual learning for image recognition, 2015.
- Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.
- Ai alignment: A comprehensive survey, 2024.
- Visual fixations and the computation and comparison of value in simple choice. Nature neuroscience 13 10 (2010), 1292–8.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations, 2016.
- Visualbert: A simple and performant baseline for vision and language, 2019.
- Microsoft coco: Common objects in context, 2015.
- Decoupled weight decay regularization, 2019.
- Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks, 2019.
- OpenAI. Gpt-4 technical report, 2023.
- Training language models to follow instructions with human feedback, 2022.
- Pytorch: An imperative style, high-performance deep learning library, 2019.
- Eye-tracking-based classification of information search behavior using machine learning: Evidence from experiments in physical shops and virtual reality shopping environments. Information Systems Research 31, 3 (2020), 675–691.
- Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models, 2016.
- Vl-bert: Pre-training of generic visual-linguistic representations, 2020.
- Lxmert: Learning cross-modality encoder representations from transformers, 2019.
- The attentional drift diffusion model of simple perceptual decision-making. Frontiers in Neuroscience 11 (2017).
- Team, G. Gemini: A family of highly capable multimodal models, 2024.
- Attention is all you need, 2017.
- Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework, 2022.
- Simvlm: Simple visual language model pretraining with weak supervision, 2021.
- Huggingface’s transformers: State-of-the-art natural language processing, 2020.
- Visual entailment: A novel task for fine-grained image understanding, 2019.