In-Context Ranking (ICR)

Updated 8 October 2025

In-context Ranking (ICR) is a framework where an item’s relevance is determined by both its features and the context provided through co-occurring items or demonstration examples.
Demonstration engineering in ICR uses prompt-based control with architectures like FETA-Net and FATE-Net to optimize trade-offs between relevance, fairness, and diversity.
ICR systems integrate contextual cues and multi-objective ranking strategies, achieving significant gains in metrics such as nDCG and fairness with minimal loss in precision.

In-context Ranking (ICR) refers to a broad family of ranking approaches in which the ranking function or decision for any given set of items is explicitly sensitive to, and conditioned by, the context in which the ranking is performed. In this paradigm—prominent in LLMs, information retrieval (IR), and preference learning—the model’s output for an item is not solely a function of that item’s features but is modulated by the features of other items in the set, the prompt context, or incorporated demonstration examples. Modern ICR systems span techniques leveraging prompt-based learning in LLMs, contextual citation or document networks, demonstration-based control of ranking properties, and scalable generative model architectures that optimize both for holistic reasoning and computational efficiency.

1. Contextual Mechanisms in Ranking Functions

ICR methodologies fundamentally reject context-independent scoring, instead representing the utility or relevance of an item as $U(x, C)$ : a function of the item $x$ and its context $C$ , where $C$ denotes the set of co-occurring alternatives (Pfannschmidt et al., 2018). This formulation captures phenomena such as the compromise, attraction, and similarity effects in preference learning. Two principal neural architectures embodying this view are FETA-Net (First Evaluate Then Aggregate) and FATE-Net (First Aggregate Then Evaluate), which model, respectively, pairwise item-context interactions and holistic aggregation of contextual cues before scoring. Both guarantee permutation invariance with respect to the item order in context.

The field has expanded to include generative LLM-based ICR, where a task instruction, query, and large set of candidate documents are concatenated as input. The model’s contextual reasoning—rather than isolated feature scoring—determines relative relevance (Gupta et al., 6 Oct 2025). In re-ranking, contemporary methods exploit LLM-internal mechanisms such as attention maps as proxies for contextual relevance, without requiring score generation via language output (Chen et al., 3 Oct 2024).

2. Demonstration Engineering and Prompt-based Control

A central mechanism for accomplishing fine-grained behavioral control in ICR lies in demonstration engineering: the deliberate selection and ordering of in-context examples to steer the output ranking behavior of LLMs (Sinhababu et al., 23 May 2025). The approach is realized in two stages: (1) an initial retrieval system (e.g., BM25, ColBERT) produces shortlists, and (2) an LLM is prompted with contextually similar demonstration rankings engineered to encode desired trade-offs—such as group fairness, polarity diversity, or topical diversity.

Demonstrations are constructed by retrieving a similar query from a query log (e.g., MS MARCO), redistributing its top candidate documents to reflect a specified distributional target (e.g., balanced gender/stance/topic), and presenting this as an ordered list in the LLM input prompt. A greedy KL-divergence minimization is used to select which document to place next in the demonstration example (given a categorical distribution $\tau(R_Q)$ for group attributes), ensuring the attribute distribution in the demonstration matches the target as closely as possible. The LLM then “learns” the target behavior in situ and outputs a permutation over candidates for the current query.

Crucially, ablation studies reveal that demonstration examples have a strongly causal effect on LLM ranking output: inverting, adversarially permuting, or degrading the demonstration order measurably reduces both auxiliary property fulfillment (such as fairness or diversity) and standard relevance metrics. This “model reprogramming” via demonstration selection affords dynamic, non-parametric control over ranking behavior.

3. Balancing Relevance, Fairness, and Diversity

Traditional IR models focus on optimizing for relevance, often measured by metrics such as nDCG. However, real-world search and recommendation systems frequently face requirements for diversity (avoiding homogeneous topical or group representations) or for fairness (ensuring balanced representation across sensitive attributes like gender or stance). The ICR framework supports these objectives by encoding desired attribute distributions in demonstration rankings, thereby guiding the LLM to output lists that satisfy these properties (Sinhababu et al., 23 May 2025).

Formally, for group fairness or diversity objectives, the demonstration engineering method defines a target distribution $\tau(R_Q)$ and uses a selection function

$D_{p+1} = \arg\min_{D \in \mathcal{C}_{p+1}} \mathrm{KL}(\tau(R_Q), \tau(\langle D_1, \ldots, D_p \rangle \cup \{D\})),$

where $\mathcal{C}_{p+1}$ is the set of available candidates from each attribute partition. The iterative selection greedily minimizes the discrepancy (KL divergence) between the evolving demonstration and the target distribution.

Empirical results across TREC DL 2019/2020 (topical diversity), TREC Fairness, and Touche (group fairness/polarity diversity) demonstrate up to 19% relative gains in $\alpha$ -nDCG and significant improvements in fairness/demographic metrics, with only minor losses in nDCG compared to relevance-only optimization. Post-hoc re-ranking and supervised learning (e.g., FA*IR, DELTR, MMR) are outperformed by ICR with demonstration engineering, especially when using powerful LLMs such as GPT-4o-mini.

4. Evaluation Across Multiple Test Collections

ICR with demonstration engineering is evaluated on four test collections targeting distinct auxiliary properties:

Dataset	Auxiliary Objective	Baseline Methods
TREC DL 2019/2020	Topical Diversity	MMR, Supervised SetEncoders
TREC Fairness 2022	Group Fairness	FA*IR, DELTR
Touche	Polarity Diversity	Baseline re-ranking

Findings confirm:

Improved $\alpha$ -nDCG (diversity) performance versus both unsupervised diversification and listwise baselines.
Superior group fairness (AWRF, M1) with minimal loss in relevance compared to both post-hoc and learning-to-rank baselines.
Sensitive, causal dependence of output metrics upon demonstration example engineering, confirmed by declines in both property satisfaction and nDCG under adversarial or static demonstration ordering.

Ablation confirms that relevance can be maintained at high levels while promoting auxiliary objectives, and the process generalizes across multiple datasets without task-specific re-training.

5. Practical Implications and Adaptability

In-context demonstration engineering allows practitioners to control ranking policy on the fly by modifying prompt examples, without retraining or explicit supervision for each task or objective. This reduces engineering overhead: standard supervised pipelines require task-specific model training and tuning for each ranking scenario, whereas ICR as studied here enables “universal” model adaptation via prompt manipulation.

This approach is inherently flexible and is applicable to dynamic or rapidly changing ranking requirements. As objectives or operational constraints change (e.g., new fairness mandates, emerging needs for topical diversity), one can alter system behavior instantaneously by updating demonstration rankings. The causal dependence of ranking output on demonstration order supports real-time operationalization and facilitates rapid prototyping and deployment of new ranking policies.

Further, this paradigm aligns with broader trends in task definition via prompt-based LLM control: it implies that other complex objectives beyond ranking (e.g., summarization, dialogue) may be similarly engineered into LLMs via carefully constructed demonstrations.

6. Future Directions and Theoretical Boundaries

The demonstration engineering approach to ICR invites several lines of future research:

Formal robustness analysis: Under what circumstances can demonstration engineering fail, and what are the boundary conditions on LLM generalization given imperfect or adversarial examples?
Multi-objective demonstration design: How can one simultaneously optimize for multiple, possibly conflicting, ranking objectives using composite prompt examples?
Integration with explicit instruction: Hybrid systems incorporating both explicit instructions (“balance gender 50/50”) and demonstration examples could further refine LLM re-ranking control.
Extension beyond ranking: The (causal) influence of in-context examples on model output, as observed for ranking, may generalize to other task behavioral properties (e.g., calibration, abstention, query reformulation).
Scalability and automation: Automating demonstration selection and ordering (possibly with reinforcement learning or active learning) may optimize trade-offs at scale.

A plausible implication is that as LLMs scale and become more widely deployed in IR and recommendation, flexible, demonstration-based ICR offers both a potent practical tool for meeting complex, evolving objectives and a testbed for understanding the limitations and emergent properties of prompt-conditioned reasoning.

References Table

Paper Title	arXiv ID	Relevance
Modeling Ranking Properties with In-Context Learning	(Sinhababu et al., 23 May 2025)	Demonstration engineering for balancing ranking objectives

In sum, in-context ranking reframes the ranking problem as a prompt-conditioned task, with demonstration selection serving as a primary mechanism to encode multi-objective trade-offs dynamically and efficiently—a paradigm validated across diverse IR benchmarks and auxiliary objectives.