AdvisorQA: Towards Helpful and Harmless Advice-seeking Question Answering with Collective Intelligence (2404.11826v1)
Abstract: As the integration of LLMs into daily life is on the rise, there is a clear gap in benchmarks for advising on subjective and personal dilemmas. To address this, we introduce AdvisorQA, the first benchmark developed to assess LLMs' capability in offering advice for deeply personalized concerns, utilizing the LifeProTips subreddit forum. This forum features a dynamic interaction where users post advice-seeking questions, receiving an average of 8.9 advice per query, with 164.2 upvotes from hundreds of users, embodying a collective intelligence framework. Therefore, we've completed a benchmark encompassing daily life questions, diverse corresponding responses, and majority vote ranking to train our helpfulness metric. Baseline experiments validate the efficacy of AdvisorQA through our helpfulness metric, GPT-4, and human evaluation, analyzing phenomena beyond the trade-off between helpfulness and harmlessness. AdvisorQA marks a significant leap in enhancing QA systems for providing personalized, empathetic advice, showcasing LLMs' improved understanding of human subjectivity.
- Anmol Arora and Ananya Arora. 2023. The promise of large language models in health care. The Lancet, 401(10377):641.
- Multilingual sentiment and subjectivity analysis. Multilingual natural language processing, 6:1–19.
- Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR.
- SubjQA: A Dataset for Subjectivity and Review Comprehension. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5480–5494, Online. Association for Computational Linguistics.
- A non-factoid question-answering taxonomy. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1196–1207.
- WikiHowQA: A comprehensive benchmark for multi-document non-factoid question answering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5291–5314, Toronto, Canada. Association for Computational Linguistics.
- Ralph Allan Bradley and Milton E Terry. 1952. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345.
- Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217.
- Argument mining for review helpfulness prediction. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8914–8922, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- A computational framework for behavioral assessment of llm therapists.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- Training verifiers to solve math word problems, 2021. URL https://arxiv. org/abs/2110.14168.
- Haikang Deng and Colin Raffel. 2023. Reward-augmented decoding: Efficient controlled text generation with a unidirectional reward model. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 11781–11791, Singapore. Association for Computational Linguistics.
- Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314.
- Feature selection for helpfulness prediction of online product reviews: An empirical study. PloS one, 14(12):e0226902.
- ELI5: Long form question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3558–3567, Florence, Italy. Association for Computational Linguistics.
- When the majority is wrong: Modeling annotator disagreement for subjective tasks. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6715–6726, Singapore. Association for Computational Linguistics.
- Dialogue response ranking training with large-scale human feedback data. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 386–395, Online. Association for Computational Linguistics.
- Compositional preference models for aligning lms.
- Aligning ai with shared human values.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
- Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
- Social context and the real-world consequences of social anxiety. Psychological Medicine, 50(12):1989–2000.
- A multi-task benchmark for korean legal language understanding and judgement prediction. Advances in Neural Information Processing Systems, 35:32537–32551.
- Personalized soups: Personalized large language model alignment via post-hoc parameter merging. arXiv preprint arXiv:2310.11564.
- Mistral 7b. arXiv preprint arXiv:2310.06825.
- Lifetox: Unveiling implicit toxicity in life advice. arXiv preprint arXiv:2311.09585.
- Critic-guided decoding for controlled text generation. In Findings of the Association for Computational Linguistics: ACL 2023, pages 4598–4612, Toronto, Canada. Association for Computational Linguistics.
- The past, present and better future of feedback learning in large language models for subjective human preferences and values. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2409–2430, Singapore. Association for Computational Linguistics.
- Hurdles to progress in long-form question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4940–4957, Online. Association for Computational Linguistics.
- Performance of chatgpt on usmle: Potential for ai-assisted medical education using large language models. PLoS digital health, 2(2):e0000198.
- SummEdits: Measuring LLM ability at factual reasoning through the lens of summarization. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9662–9676, Singapore. Association for Computational Linguistics.
- KoSBI: A dataset for mitigating social bias risks towards safer large language model applications. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track), pages 208–224, Toronto, Canada. Association for Computational Linguistics.
- Qasa: advanced question answering on scientific articles. In International Conference on Machine Learning, pages 19036–19052. PMLR.
- Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension.
- DExperts: Decoding-time controlled text generation with experts and anti-experts. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6691–6706, Online. Association for Computational Linguistics.
- P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 61–68, Dublin, Ireland. Association for Computational Linguistics.
- P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv preprint arXiv:2110.07602.
- R Duncan Luce. 2012. Individual choice behavior: A theoretical analysis. Courier Corporation.
- Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pages 142–150.
- Brian McHale. 1983. Unspeakable sentences, unnatural acts: Linguistics and poetics revisited. Poetics Today, 4(1):17–45.
- Foundation models for generalist medical artificial intelligence. Nature, 616(7956):259–265.
- StereoSet: Measuring stereotypical bias in pretrained language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5356–5371, Online. Association for Computational Linguistics.
- Lever: Learning to verify language-to-code generation with execution. In International Conference on Machine Learning, pages 26106–26128. PMLR.
- Argument mining with structured SVMs and RNNs. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 985–995, Vancouver, Canada. Association for Computational Linguistics.
- R OpenAI. 2023. Gpt-4 technical report. arXiv, pages 2303–08774.
- Joonsuk Park and Claire Cardie. 2014. Identifying appropriate support for propositions in online user comments. In Proceedings of the first workshop on argumentation mining, pages 29–38.
- Robin L Plackett. 1975. The analysis of permutations. Journal of the Royal Statistical Society Series C: Applied Statistics, 24(2):193–202.
- Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels, Belgium. Association for Computational Linguistics.
- Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290.
- Why don’t you do it right? analysing annotators’ disagreement in subjective tasks. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2428–2441, Dubrovnik, Croatia. Association for Computational Linguistics.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- Multiview clickbait detection via jointly modeling subjective and objective preference. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 11807–11816, Singapore. Association for Computational Linguistics.
- Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642.
- Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models. arXiv preprint arXiv:2104.08663.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- A theoretical analysis of ndcg type ranking measures. In Conference on learning theory, pages 25–54. PMLR.
- Helpsteer: Multi-attribute helpfulness dataset for steerlm.
- Zhilin Wang and Pablo E. Torres. 2022. How to be helpful on online support forums? In Proceedings of the 4th Workshop of Narrative Understanding (WNU2022), pages 20–28, Seattle, United States. Association for Computational Linguistics.
- Subjective crowd disagreements for subjective data: Uncovering meaningful CrowdOpinion with population-level learning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 950–966, Toronto, Canada. Association for Computational Linguistics.
- Opinionfinder: A system for subjectivity analysis. In Proceedings of HLT/EMNLP 2005 interactive demonstrations, pages 34–35.
- A critical evaluation of evaluations for long-form question answering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3225–3245, Toronto, Canada. Association for Computational Linguistics.
- Rumor detection on social media with crowd intelligence and ChatGPT-assisted networks. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5705–5717, Singapore. Association for Computational Linguistics.
- Crispr: Eliminating bias neurons from an instruction-following language model. arXiv preprint arXiv:2311.09627.
- Flask: Fine-grained language model evaluation based on alignment skill sets. arXiv preprint arXiv:2307.10928.
- Bartscore: Evaluating generated text as text generation. Advances in Neural Information Processing Systems, 34:27263–27277.
- Texygen: A benchmarking platform for text generation models. In The 41st international ACM SIGIR conference on research & development in information retrieval, pages 1097–1100.