Vote-in-Context (ViC)

Updated 9 November 2025

Vote-in-Context (ViC) is a framework that integrates candidate content with external context to enable adaptive, zero-shot re-ranking and aggregation.
In retrieval systems, ViC serializes content and metadata for Vision-Language Models, achieving significant improvements in recall and fusion performance.
In social choice, ViC applies a menu transformation to incorporate public anchoring signals, leading to enhanced selection probability and welfare outcomes.

Vote-in-Context (ViC) refers to a class of mechanisms and algorithmic strategies in which candidate or alternative selection is influenced by external contextual information—whether by explicitly encoding content/meta-data into machine learning prompts or by formally modeling the effects of public signals on collective decisions within voting mechanisms. In modern retrieval and fusion systems, ViC leverages both candidate content and external rank signals to enable zero-shot, context-aware reranking; in the formal analysis of social choice, Vote-in-Context encapsulates schemes where voting rules are robustly adapted to incorporate information externalities without requiring fine-grained (cardinal) preference elicitation. Recent advances demonstrate the effectiveness of ViC designs for cross-modal retrieval tasks as well as their welfare-theoretic implications in expectation.

1. Fundamental Principles of Vote-in-Context

ViC unifies two conceptually related but technically distinct domains: multi-retriever fusion in machine learning and theoretical voting mechanisms responsive to informational context.

Fusion and Retrieval Perspective: Traditional fusion approaches (e.g., CombSUM, RRF) operate on candidate ranks or scores across multiple retrievers, neglecting the actual candidate content. Vote-in-Context reframes fusion as a zero-shot reasoning task for a frozen, off-the-shelf Vision-LLM (VLM). The paradigm serializes both content evidence (e.g., image grids, subtitles) and retriever metadata (list position, duplicate frequency) into the VLM's prompt, enabling the model to perform adaptive, in-context "voting" over candidate sets.
Voting and Social Choice Perspective: In social choice, ViC mechanisms formalize how external information—modeled as an "anchoring" point or public signal—shifts voter preferences, and devise aggregation procedures to simulate these shifts through menu transformations without direct access to cardinal utilities.

Both approaches maintain training-free operation and exploit the mechanism's ability to simulate the effect of context solely by processing appropriately serialized or transformed inputs.

2. Technical Methodologies

A. Machine Learning Implementation

Candidate List Construction: For a query $q$ , the candidate sequence $C(q)$ is constructed by round-robin interleaving the top $k_{max}$ items from each of $M$ retrievers, forming a list of length $K$ . Each candidate's position in $C(q)$ provides an implicit rank signal; duplicate entries (occurring when a candidate appears in several retrievers' top- $k$ ) encode consensus.
Prompt Engineering: The VLM prompt consists of a serialized query $Q(q)$ (text or "S-Grid") followed by the interleaved candidate representations $E(C_i)$ . For videos, $E(C_i)$ is the $S$ -Grid: a compact $s \times s$ grid of sampled frames with optional subtitles. No explicit numerical ranks are included; the model infers relative importance from order and repetition.
Inference: The VLM $g_\Theta$ is queried with $(Q(q), E(C_1), ..., E(C_K))$ , returning a permutation $\hat\pi$ over candidates, effects reranking or fusion. The system is entirely training-free—no model weights are updated for the reranker/fuser.

Voters and Alternatives: Let $N = \{1, \ldots, n\}$ be voters, $A = [m] = \{1, ..., m\}$ alternatives.
Informational Context: Each voter $i$ with cardinal utility $u_i \in \Delta \subset \mathbb{R}^m_+$ (unit sum) receives an anchoring point $w \in \Delta$ and weight $\alpha \in [0,1]$ , producing effective utility $v_i = (1-\alpha)u_i + \alpha w$ .
Coarse Elicitation and Intermediary Mechanism: Rather than eliciting $u_i$ , only a coarse report $r_i \in \mathcal{R}$ (e.g., Plurality) is submitted. The mechanism simulates the effect of anchoring via a menu transformation $\phi(r) = (r - \alpha w)/(1-\alpha)$ and applies the same positional aggregation rule.

3. Mathematical Formalizations

Machine Learning: Serializing Video Content

S-Grid Mapping:

Frames: $t_i = \lfloor (i-1) \cdot F / (s^2 - 1) \rfloor,~i=1\ldots s^2$
Grid: $\mathrm{Grid}(v;s) = [\text{frame}_{t_1}, ..., \text{frame}_{t_{s^2}}]$
S-Grid: $\mathrm{S\!-\!Grid}(v) = (\mathrm{Grid}(v;s), a_v)$ where $a_v$ is an optional subtitle string

Prompt Structure (Pseudo):

Query: "Q(q)"

Candidates (interleaved order; duplicates allowed):
  1) [ImageGrid_1], Subtitle: "…"
  ...
  K) [ImageGrid_K], Subtitle: "…"

Instruction:
  "Rank these K candidates by relevance… Items appearing multiple times indicate multiple retriever votes; earlier positions indicate higher original rank…"

Transformed Menu: $\mathcal{M} = \{\phi(r): r \in \mathcal{R}\}$ with $\phi(r) = (r - \alpha w)/(1 - \alpha)$
Outcome Distribution: For Plurality, probability of context-favored $a^* = \arg\max_a w_a$ is

$P(\text{Outcome}(M) = a^*) \geq \Pr[\text{Binom}(n, q_{e_{a^*}}) \ge n/2]$

Expected Welfare: If $\mathrm{Outcome}(M) = Y$ , expected social welfare is

$\mathbb{E}[W(M)] = n \langle v , \nu^{\alpha, w}(f) \rangle$

where $v = \mathbb{E}_\mu[u]$ and $\nu^{\alpha, w}(f)_a = P(\mathrm{Outcome}(M)=a)$ .

4. Empirical Results and Performance

Retrieval and Fusion

Datasets: MSR-VTT, DiDeMo, ActivityNet, VATEX; tasks: text-to-video (t2v), video-to-text (v2t)
Zero-Shot Protocol: No fine-tuning of the VLM or reranker/fuser
First-Stage Retrievers: CLIP4Clip, VAST, GRAM, InternVideo2-6B (all frozen)
Baselines: CombSUM/CombMNZ (score fusion), RRF (rank fusion)
Performance: On MSR-VTT,
- Single-list reranking with ViC: CLIP4Clip t2v recall@1 improves from 34.4 to 64.2; up to +40 recall@1 over SOTA on some benchmarks
- Ensemble fuser: CombSUM t2v recall@1 of 84.4 increases to 87.1 with ViC
- VATEX v2t: 77.6 (baseline) to 99.6 (ViC)
Ablations:
- Grid size: $2 \times 2$ or $3 \times 3$ optimal; $1 \times 1$ under-represents temporality, $4 \times 4$ is overly compressed
- Metadata ("No Duplicates"): Removing duplicate candidate signal reduces recall@1, confirming importance of consensus encoding

Probability and Welfare Guarantees: Under mild alignment between $w$ and mean utilities, introducing context strictly improves probability of selecting the context-favored alternative and, when appropriate, increases expected social welfare.
Theoretical Bounds: The mechanism $M$ (via $\phi$ ) exactly simulates context-anchored aggregation with only ordinal reports; probability of welfare loss is bounded by a Chernoff-type tail.

5. Limitations and Practical Constraints

VLM Computational Cost: ViC substitutes arithmetic fusion with a full VLM forward pass over $K$ multimodal inputs; inference time grows linearly in $K$ .
Effective Attention Window: Performance degrades as $K$ exceeds the VLM's context capacity.
Recall Bound in Two-Stage Systems: If a relevant candidate is absent from first-stage retrieval, ViC cannot recover it.
S-Grid Lossiness: Uniform temporal frame sampling may omit significant short events in videos.
Social Choice Generality: The formal model presumes homogeneous anchoring and i.i.d. utilities; handling heterogeneous or adversarial information remains open.

6. Broader Implications and Future Directions

Retrieval Systems: ViC demonstrates that prompt engineering and in-context serialization convert large, instruction-following VLMs into state-of-the-art, training-free rerankers and fusion modules for multimodal retrieval.
Extensibility: Extensions to additional modalities (audio, tabular) through analogous serialization strategies are suggested as a next step.
Voting Mechanisms: The intermediary mechanism approach allows ordinal voting systems to incorporate public information robustly, with formal guarantees, while maintaining practical feasibility (no need for cardinal elicitation).
Research Challenges: Future work may involve targeted prompt or parameter-efficient tuning for smaller VLMs, query-aware keyframe selection for S-Grids, and extending social choice analysis to multiple or conflicting contexts.

7. Summary Table of Key Mechanism Components

Dimension	Retrieval ViC (VLMs)	Social Choice ViC
Context source	Retriever list composition, metadata, candidate order	Public anchoring signal $(w, \alpha)$
Mechanism type	Prompt engineering; in-context model-based reasoning	Menu transformation, context-aware aggregation
Output	Reranked/fused candidate list	Selected alternative; welfare guarantee

Vote-in-Context provides a systematic, evidence-driven method of adapting aggregation to external contextual information, yielding substantial empirical performance gains in retrieval while establishing formal improvements in selection probability and welfare in voting scenarios (Eltahir et al., 3 Nov 2025, Chen et al., 11 Apr 2024).

PDF Markdown Chat (Pro)

References (2)

Vote-in-Context: Turning VLMs into Zero-Shot Rank Fusers (2025)

Robustness of voting mechanisms to external information in expectation (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Vote-in-Context (ViC).

Vote-in-Context (ViC)

1. Fundamental Principles of Vote-in-Context

2. Technical Methodologies

A. Machine Learning Implementation

B. Social Choice Modeling

3. Mathematical Formalizations

Machine Learning: Serializing Video Content

Social Choice: Context-Aware Voting

4. Empirical Results and Performance

Retrieval and Fusion

Social Choice Outcomes

5. Limitations and Practical Constraints

6. Broader Implications and Future Directions

7. Summary Table of Key Mechanism Components

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics