Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multi-Conditional Ranking with Large Language Models (2404.00211v3)

Published 30 Mar 2024 in cs.CL and cs.LG

Abstract: Utilizing LLMs to rank a set of items has become a common approach in recommendation and retrieval systems. Typically, these systems focus on ordering a substantial number of documents in a monotonic order based on a given query. However, real-world scenarios often present a different challenge: ranking a comparatively smaller set of items, but according to a variety of diverse and occasionally conflicting conditions. In this paper, we define and explore the task of multi-conditional ranking by introducing MCRank, a benchmark tailored for assessing multi-conditional ranking across various item types and conditions. Our analysis of LLMs using MCRank indicates a significant decrease in performance as the number and complexity of items and conditions grow. To overcome this limitation, we propose a novel decomposed reasoning method, consisting of EXtracting and Sorting the conditions, and then Iteratively Ranking the items (EXSIR). Our extensive experiments show that this decomposed reasoning method enhances LLMs' performance significantly, achieving up to a 14.4% improvement over existing LLMs. We also provide a detailed analysis of LLMs performance across various condition categories, and examine the effectiveness of decomposition step. Furthermore, we compare our method with existing approaches such as Chain-of-Thought and existing ranking models, demonstrating the superiority of our approach and complexity of MCR task. We released our dataset and code.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. Self-rag: Learning to retrieve, generate, and critique through self-reflection. arXiv preprint arXiv:2310.11511, 2023.
  2. Cross-age reference coding for age-invariant face recognition and retrieval. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13, pp. 768–783. Springer, 2014.
  3. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  4. Trec complex answer retrieval overview. In TREC, 2017.
  5. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. arXiv preprint arXiv:1903.00161, 2019.
  6. Successive prompting for decomposing complex questions. arXiv preprint arXiv:2212.04092, 2022.
  7. T-rex: A large scale alignment of natural language with knowledge base triples. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 2018.
  8. Large language model based multi-agents: A survey of progress and challenges. arXiv preprint arXiv:2402.01680, 2024.
  9. Large language models are zero-shot rankers for recommender systems. In European Conference on Information Retrieval, pp. 364–381. Springer, 2024.
  10. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  11. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pp.  39–48, 2020.
  12. Chatgpt: Jack of all trades, master of none. Information Fusion, 99:101861, 2023.
  13. Can language models understand physical concepts? arXiv preprint arXiv:2305.14057, 2023.
  14. Ms marco: A human-generated machine reading comprehension dataset. 2016.
  15. Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp.  188–197, 2019.
  16. Multi-stage document ranking with bert. arXiv preprint arXiv:1910.14424, 2019.
  17. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  18. The art of socratic questioning: Recursive thinking with large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  4177–4199, 2023.
  19. Large language models are effective text rankers with pairwise ranking prompting. arXiv preprint arXiv:2306.17563, 2023.
  20. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
  21. Are emergent abilities of large language models a mirage? Advances in Neural Information Processing Systems, 36, 2024.
  22. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
  23. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  24. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  25. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
  26. Large language models are better reasoners with self-verification. arXiv preprint arXiv:2212.09561, 2022.
  27. A survey on large language models for recommendation. arXiv preprint arXiv:2305.19860, 2023.
  28. The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864, 2023.
  29. Mave: A product dataset for multi-source attribute value extraction. In Proceedings of the fifteenth ACM international conference on web search and data mining, pp.  1256–1265, 2022.
  30. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36, 2024.
  31. Evaluating recommender systems: survey and framework. ACM computing surveys, 55(8):1–38, 2022.
  32. Large language models for information retrieval: A survey. arXiv preprint arXiv:2308.07107, 2023.
  33. Open-source large language models are strong zero-shot query likelihood models for document ranking. arXiv preprint arXiv:2310.13243, 2023a.
  34. A setwise approach for effective and highly efficient zero-shot ranking with large language models. arXiv preprint arXiv:2310.09497, 2023b.

Summary

  • The paper defines multi-conditional ranking and introduces MCRank to benchmark LLMs on tasks with varied and conflicting conditions.
  • It proposes EXSIR, a decomposed reasoning method that prioritizes and applies conditions iteratively to boost ranking accuracy.
  • Experimental results reveal up to a 12% accuracy improvement with GPT-4, demonstrating EXSIR’s efficacy in complex ranking scenarios.

Multi-Conditional Ranking with LLMs: Introducing MCRank and EXSIR Method

Introduction

The ubiquity of recommendation and retrieval systems in digital platforms necessitates advanced methods for ranking a set of items. While significant progress has been made in ranking large document collections, the unique challenge of ranking a smaller set of items based on multiple and potentially conflicting conditions has been less explored. This paper addresses this gap by defining the task of multi-conditional ranking (MCR), presenting MCRank—a benchmark tailored for evaluating MCR across various item types and conditions—and proposing a novel decomposed reasoning method, EXSIR, for enhancing LLMs performance on MCR tasks.

MCRank Benchmark

MCRank is designed to rigorously test LLMs' abilities in multi-conditional ranking tasks. The benchmark includes diverse categories of conditions such as positional, locational, temporal, trait-based, and reasoning types, across scenarios involving one to three conditions and sets of 3, 5, or 7 items, classified into token-level and paragraph-level items. The crafted dataset allows for comprehensive evaluation of model capability in handling complex ranking tasks that are closer to real-world applications like recommendation systems, educational question ordering, and job application sorting.

EXSIR: A Decomposed Reasoning Method

This paper introduces EXSIR (EXtract and Sort the conditions, then Iteratively Rank the items), a decomposed reasoning method that significantly improves LLMs' efficiency in multi-conditional ranking tasks. The method involves first extracting and sorting conditions based on priority, followed by iteratively applying these sorted conditions to rank the items. This approach is instrumental in overcoming the observed performance decline of LLMs, including GPT-4, ChatGPT, and Mistral, as the complexity of the ranking task increases.

Experimental Results

The evaluation of LLMs on MCRank using EXSIR demonstrates notable improvements in performance across various settings, with GPT-4 showing up to a 12% accuracy enhancement. This highlights the effectiveness of the decomposed reasoning method in bolstering LLMs' capacity to handle intricate multi-conditional ranking tasks. Detailed analysis of performance across condition categories and the success of the decomposition step further underscores the robustness of the EXSIR method.

Implications and Future Directions

The findings from this research have both practical and theoretical implications. Practically, the EXSIR method and the MCRank benchmark lay the groundwork for developing more sophisticated ranking systems that can navigate the complexities of multiple conditions. Theoretically, the paper adds to our understanding of decomposed reasoning in AI and its application in enhancing LLMs performance.

Future research might explore extending the EXSIR method to other forms of decomposed reasoning tasks beyond ranking, assessing the viability of incorporating user interaction in ranking systems, and evaluating the potential of multi-agent systems where tasks are divided among specialized models for improved efficiency.

Conclusion

This paper presents a significant step forward in the domain of multi-conditional ranking, introducing the comprehensive MCRank benchmark and the EXSIR method. Experimentation demonstrates the enhanced capability of LLMs in accurately performing multi-conditional ranking tasks when leveraging decomposed reasoning. These contributions are expected to facilitate future advancements in the development of more effective and sophisticated recommendation and retrieval systems.