Emergent Mind

Multi-Conditional Ranking with Large Language Models

Published Mar 30, 2024 in cs.CL and cs.LG


Utilizing large language models (LLMs) to rank a set of items has become a common approach in recommendation and retrieval systems. Typically, these systems focus on ordering a substantial number of documents in a monotonic order based on a given query. However, real-world scenarios often present a different challenge: ranking a comparatively smaller set of items, but according to a variety of diverse and occasionally conflicting conditions. In this paper, we define and explore the task of multi-conditional ranking by introducing MCRank, a benchmark tailored for assessing multi-conditional ranking across various item types and conditions. Our analysis of LLMs using MCRank indicates a significant decrease in performance as the number and complexity of items and conditions grow. To overcome this limitation, we propose a novel decomposed reasoning method, consisting of EXtracting and Sorting the conditions, and then Iterativly Ranking the items (EXSIR). Our extensive experiments show that this decomposed reasoning method enhances LLMs' performance significantly, achieving up to a 12% improvement over existing LLMs. We also provide a detailed analysis of LLMs performance across various condition categories, and examine the effectiveness of decomposition step. Furthermore, we compare our method with existing approaches such as Chain-of-Thought and an encoder-type ranking model, demonstrating the superiority of our approach and complexity of MCR task. We released our dataset and code.
Comparison of EXSIR and zero-shot CoT on paragraph-level items, including ColBERT as a benchmark.


  • This paper introduces the concept of multi-conditional ranking (MCR), provides a specialized benchmark named MCRank for evaluating LLMs on MCR tasks, and proposes a novel method, EXSIR, to enhance LLMs' performance.

  • MCRank tests LLMs' abilities to rank items based on multiple conditions across various scenarios, aiming to reflect real-world applications like recommendation and sorting systems.

  • The EXSIR method, which stands for EXtract and Sort the conditions, then Iteratively Rank the items, significantly boosts LLMs' efficiency in handling complex MCR tasks by decomposing the reasoning process.

  • Experimental results show that employing EXSIR leads to up to a 12% accuracy improvement in LLMs like GPT-4 on the MCRank benchmark, indicating a robust method for enhancing multi-conditional ranking tasks.


The ubiquity of recommendation and retrieval systems in digital platforms necessitates advanced methods for ranking a set of items. While significant progress has been made in ranking large document collections, the unique challenge of ranking a smaller set of items based on multiple and potentially conflicting conditions has been less explored. This paper addresses this gap by defining the task of multi-conditional ranking (MCR), presenting MCRank—a benchmark tailored for evaluating MCR across various item types and conditions—and proposing a novel decomposed reasoning method, EXSIR, for enhancing LLMs performance on MCR tasks.

MCRank Benchmark

MCRank is designed to rigorously test LLMs' abilities in multi-conditional ranking tasks. The benchmark includes diverse categories of conditions such as positional, locational, temporal, trait-based, and reasoning types, across scenarios involving one to three conditions and sets of 3, 5, or 7 items, classified into token-level and paragraph-level items. The crafted dataset allows for comprehensive evaluation of model capability in handling complex ranking tasks that are closer to real-world applications like recommendation systems, educational question ordering, and job application sorting.

EXSIR: A Decomposed Reasoning Method

This paper introduces EXSIR (EXtract and Sort the conditions, then Iteratively Rank the items), a decomposed reasoning method that significantly improves LLMs' efficiency in multi-conditional ranking tasks. The method involves first extracting and sorting conditions based on priority, followed by iteratively applying these sorted conditions to rank the items. This approach is instrumental in overcoming the observed performance decline of LLMs, including GPT-4, ChatGPT, and Mistral, as the complexity of the ranking task increases.

Experimental Results

The evaluation of LLMs on MCRank using EXSIR demonstrates notable improvements in performance across various settings, with GPT-4 showing up to a 12% accuracy enhancement. This highlights the effectiveness of the decomposed reasoning method in bolstering LLMs' capacity to handle intricate multi-conditional ranking tasks. Detailed analysis of performance across condition categories and the success of the decomposition step further underscores the robustness of the EXSIR method.

Implications and Future Directions

The findings from this research have both practical and theoretical implications. Practically, the EXSIR method and the MCRank benchmark lay the groundwork for developing more sophisticated ranking systems that can navigate the complexities of multiple conditions. Theoretically, the study adds to our understanding of decomposed reasoning in AI and its application in enhancing LLMs performance.

Future research might explore extending the EXSIR method to other forms of decomposed reasoning tasks beyond ranking, assessing the viability of incorporating user interaction in ranking systems, and evaluating the potential of multi-agent systems where tasks are divided among specialized models for improved efficiency.


This paper presents a significant step forward in the domain of multi-conditional ranking, introducing the comprehensive MCRank benchmark and the EXSIR method. Experimentation demonstrates the enhanced capability of LLMs in accurately performing multi-conditional ranking tasks when leveraging decomposed reasoning. These contributions are expected to facilitate future advancements in the development of more effective and sophisticated recommendation and retrieval systems.

Get summaries of trending AI papers delivered straight to your inbox

Unsubscribe anytime.

  1. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
  2. Cross-age reference coding for age-invariant face recognition and retrieval. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13, pp. 768–783. Springer
  3. PaLM: Scaling Language Modeling with Pathways
  4. Trec complex answer retrieval overview. In TREC
  5. DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs
  6. Successive Prompting for Decomposing Complex Questions
  7. T-rex: A large scale alignment of natural language with knowledge base triples. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
  8. Large Language Model based Multi-Agents: A Survey of Progress and Challenges
  9. Large language models are zero-shot rankers for recommender systems. In European Conference on Information Retrieval, pp. 364–381. Springer
  10. Mistral 7B
  11. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pp.  39–48
  12. Chatgpt: Jack of all trades, master of none. Information Fusion, 99:101861
  13. Can Language Models Understand Physical Concepts?
  14. Ms marco: A human-generated machine reading comprehension dataset. 2016.
  15. Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp.  188–197
  16. Multi-Stage Document Ranking with BERT
  17. GPT-4 Technical Report
  18. The art of socratic questioning: Recursive thinking with large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  4177–4199
  19. Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting
  20. SQuAD: 100,000+ Questions for Machine Comprehension of Text
  21. Are emergent abilities of large language models a mirage? Advances in Neural Information Processing Systems, 36
  22. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
  23. Gemini: A Family of Highly Capable Multimodal Models
  24. LLaMA: Open and Efficient Foundation Language Models
  25. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837
  26. Large Language Models are Better Reasoners with Self-Verification
  27. A Survey on Large Language Models for Recommendation
  28. The Rise and Potential of Large Language Model Based Agents: A Survey
  29. Mave: A product dataset for multi-source attribute value extraction. In Proceedings of the fifteenth ACM international conference on web search and data mining, pp.  1256–1265
  30. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36
  31. Evaluating recommender systems: survey and framework. ACM computing surveys, 55(8):1–38
  32. Large Language Models for Information Retrieval: A Survey
  33. Open-source Large Language Models are Strong Zero-shot Query Likelihood Models for Document Ranking
  34. A Setwise Approach for Effective and Highly Efficient Zero-shot Ranking with Large Language Models

Show All 34

Test Your Knowledge

You answered out of questions correctly.

Well done!