Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
98 tokens/sec
GPT-4o
13 tokens/sec
Gemini 2.5 Pro Pro
37 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

Tailoring Vaccine Messaging with Common-Ground Opinions (2405.10861v2)

Published 17 May 2024 in cs.CL, cs.AI, and cs.CY

Abstract: One way to personalize chatbot interactions is by establishing common ground with the intended reader. A domain where establishing mutual understanding could be particularly impactful is vaccine concerns and misinformation. Vaccine interventions are forms of messaging which aim to answer concerns expressed about vaccination. Tailoring responses in this domain is difficult, since opinions often have seemingly little ideological overlap. We define the task of tailoring vaccine interventions to a Common-Ground Opinion (CGO). Tailoring responses to a CGO involves meaningfully improving the answer by relating it to an opinion or belief the reader holds. In this paper we introduce TAILOR-CGO, a dataset for evaluating how well responses are tailored to provided CGOs. We benchmark several major LLMs on this task; finding GPT-4-Turbo performs significantly better than others. We also build automatic evaluation metrics, including an efficient and accurate BERT model that outperforms finetuned LLMs, investigate how to successfully tailor vaccine messaging to CGOs, and provide actionable recommendations from this investigation. Code and model weights: https://github.com/rickardstureborg/tailor-cgo Dataset: https://huggingface.co/datasets/DukeNLP/tailor-cgo

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Out of one, many: Using language models to simulate human samples. Political Analysis, 31(3):337–351.
  2. Probing pre-trained language models for cross-cultural differences in values. In Proceedings of the First Workshop on Cross-Cultural Considerations in NLP (C3NLP), pages 114–130, Dubrovnik, Croatia. Association for Computational Linguistics.
  3. Modeling factual claims with semantic frames. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 2511–2520, Marseille, France. European Language Resources Association.
  4. A Benchmark Dataset of Check-Worthy Factual Claims. Proceedings of the International AAAI Conference on Web and Social Media, 14:821–829.
  5. Training a helpful and harmless assistant with reinforcement learning from human feedback.
  6. On the opportunities and risks of foundation models. ArXiv.
  7. Grounding ‘grounding’ in NLP. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4283–4305, Online. Association for Computational Linguistics.
  8. Hyundong Cho and Jonathan May. 2020. Grounding conversations with improvised dialogues.
  9. Herbert H. Clark and Susan E. Brennan. 1991. Grounding in communication. In Perspectives on socially shared cognition., pages 127–149. American Psychological Association, Washington, DC, US.
  10. Herbert H. Clark and Thomas B. Carlson. 1982. Hearers and speech acts. In Language, page 332–373. Linguistic Society of America.
  11. Herbert H. Clark and Edward F. Schaefer. 1989. Contributing to discourse. Cognitive Science, 13(2):259–294.
  12. Computer-assisted classification of contrarian claims about climate change. Scientific Reports, 11(1):22320. Number: 1 Publisher: Nature Publishing Group.
  13. QLoRA: Efficient finetuning of quantized LLMs. In Thirty-seventh Conference on Neural Information Processing Systems.
  14. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  15. Born again neural networks. In International Conference on Machine Learning, pages 1607–1616. PMLR.
  16. Common ground in asking and understanding questions. Language and Speech, 31(4):321–335. PMID: 3271894.
  17. Emmanuel Hadoux and Anthony Hunter. 2019. Comfort or safety? gathering and using the concerns of a participant for better persuasion. Argument & Computation, 10:1–35.
  18. Measuring massive multitask language understanding.
  19. Personalized persuasion: Tailoring persuasive appeals to recipients’ personality traits. Psychological Science, 23(6):578–581. PMID: 22547658.
  20. Personalization in goal-oriented dialog. arXiv preprint arXiv:1706.07503.
  21. GENIE: Toward reproducible and standardized human evaluation for text generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11444–11458, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  22. Personalisation within bounds: A risk taxonomy and policy framework for the alignment of large language models with personalised feedback.
  23. G-eval: Nlg evaluation using gpt-4 with better human alignment.
  24. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In International Conference on Learning Representations.
  25. Self-distillation amplifies regularization in hilbert space. Advances in Neural Information Processing Systems, 33:3351–3361.
  26. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pages 27730–27744. Curran Associates, Inc.
  27. Sachiko Ozawa and Meghan L Stack. 2013. Public trust and vaccine acceptance-international perspectives. Human Vaccines & Immunotherapeutics, 9(8):1774–1778. PMID: 23733039.
  28. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290.
  29. Lamp: When large language models meet personalization.
  30. Whose opinions do language models reflect? In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org.
  31. Learning the legibility of visual text perturbations. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 3260–3273, Dubrovnik, Croatia. Association for Computational Linguistics.
  32. Large language models are not yet human-level evaluators for abstractive summarization. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 4215–4233, Singapore. Association for Computational Linguistics.
  33. Gabriel Simmons. 2023. Moral mimicry: Large language models produce moral rationalizations tailored to political identity.
  34. Human language modeling. In Findings of the Association for Computational Linguistics: ACL 2022, pages 622–636, Dublin, Ireland. Association for Computational Linguistics.
  35. Learning to summarize from human feedback. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA. Curran Associates Inc.
  36. Learning to summarize from human feedback. Neural Information Processing Systems.
  37. Large language models are inconsistent and biased evaluators.
  38. Characterizing the confidence of large language model-based automatic evaluation metrics. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers), pages 76–89, St. Julian’s, Malta. Association for Computational Linguistics.
  39. Development and validation of VaxConcerns: A taxonomy of vaccine concerns and misinformation with Crowdsource-Viability. Vaccine.
  40. FEVER: a large-scale dataset for Fact Extraction and VERification. arXiv:1803.05355 [cs]. ArXiv: 1803.05355.
  41. Persuasion for good: Towards a personalized persuasive dialogue system for social good. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5635–5649, Florence, Italy. Association for Computational Linguistics.
  42. Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models.
  43. Extracting and inferring personal attributes from dialogue. In Proceedings of the 4th Workshop on NLP for Conversational AI, pages 58–69, Dublin, Ireland. Association for Computational Linguistics.
  44. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc.
  45. Leveraging similar users for personalized language modeling with limited data. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1742–1752, Dublin, Ireland. Association for Computational Linguistics.
  46. Personalizing dialogue agents: I have a dog, do you have pets too? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2204–2213, Melbourne, Australia. Association for Computational Linguistics.
  47. Judging llm-as-a-judge with mt-bench and chatbot arena.
  48. Personalized dialogue generation with diversified traits. arXiv preprint arXiv:1901.09672.
  49. Reflect, not reflex: Inference-based common ground improves dialogue response quality. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10450–10468, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  50. Hierarchical multi-label classification of online vaccine concerns. ArXiv, abs/2402.01783.

Summary

  • The paper introduces a novel evaluation framework and Tailor-CGO dataset of 22,400 responses to assess personalized vaccine messaging.
  • It benchmarks various LLMs and finds that GPT-4-Turbo outperforms others in integrating opinions authentically into vaccine responses.
  • It demonstrates that using conducive opinion topics enhances message quality, providing actionable insights for improving public health communication.

Tailoring Vaccine Messaging to Common-Ground Opinions: An Evaluation Framework and Dataset for LLMs

Introduction

The paper addresses a pertinent issue in vaccine communication: the personalization of responses by tailoring them to common-ground opinions (CGOs). Through the development of a novel evaluation framework, the research investigates the ability of several LLMs to generate responses that resonate with users' beliefs and opinions concerning vaccines.

Task Definition and Dataset Creation

The paper introduces a task for generating vaccine-related responses tailored to specific CGOs. A successful response is defined by its ability to:

  1. Address the vaccine concern comprehensively.
  2. Integrate the provided opinion.
  3. Accept the opinion authentically.
  4. Link the opinion meaningfully to the concern.
  5. Strengthen the response by the incorporation of the opinion.

To support this task, the authors created the Tailor-CGO dataset, comprising 22,400 responses from six different LLMs. This dataset is pivotal for evaluating the models' performance in tailoring messages to diverse opinion types.

Model Evaluation and Findings

The paper benchmarks several LLMs including Llama-2, Vicuna, WizardLM, GPT-3.5, GPT-4, and GPT-4-Turbo. The findings indicate that GPT-4-Turbo outperforms others in generating well-tailored responses. A notable conclusion is the variance in performance across different models and the enhancement in tailoring efficacy with increasing model complexity.

Annotation and Automatic Evaluation

Human annotations indicated a preference for relative scoring over absolute scoring due to higher inter-annotator agreement. The annotated data was used to train automatic evaluators, including GPT-4-Turbo, BERT, and Llama-2 models. The fine-tuned BERT model demonstrated superior performance in automatic evaluation, highlighting the feasibility of leveraging distilled models for cost-effective assessment.

Analysis of Tailoring Strategies

The research explores which opinions are most effective for tailoring vaccine messages. Topics such as 'self-perception' yielded higher-quality responses, while polarized topics like 'religion' and 'race' were less effective. This nuanced analysis underscores the importance of strategic opinion selection for impactful public health messaging.

Practical Implications

The paper's findings have significant implications for public health professionals:

  1. Leveraging LLMs can enhance personalized communication in vaccine advocacy.
  2. Identifying and utilizing conducive opinion topics can improve message reception and engagement.
  3. Employing automatic evaluation metrics can streamline the assessment process, facilitating scalable deployment of tailored messaging.

Future Directions

Future research could explore:

  • Advanced methods for identifying audience-specific opinions.
  • More intricate tasks for LLMs to generate nuanced and persuasive vaccine communication.
  • Further optimization of automatic evaluators to enhance performance and reliability.

Conclusion

The paper provides a comprehensive framework for tailoring vaccine messaging using LLMs, highlighting the efficacy of current models and the importance of strategic opinion selection. It opens avenues for practical application in public health and lays the groundwork for future advancements in personalized AI-driven communication.

Overall, the research advances our understanding of the interplay between AI and human communication, offering valuable insights into how sophisticated models can be harnessed to address critical societal challenges.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.