Vote-in-Context (ViC)
- Vote-in-Context (ViC) is a framework that integrates candidate content with external context to enable adaptive, zero-shot re-ranking and aggregation.
- In retrieval systems, ViC serializes content and metadata for Vision-Language Models, achieving significant improvements in recall and fusion performance.
- In social choice, ViC applies a menu transformation to incorporate public anchoring signals, leading to enhanced selection probability and welfare outcomes.
Vote-in-Context (ViC) refers to a class of mechanisms and algorithmic strategies in which candidate or alternative selection is influenced by external contextual information—whether by explicitly encoding content/meta-data into machine learning prompts or by formally modeling the effects of public signals on collective decisions within voting mechanisms. In modern retrieval and fusion systems, ViC leverages both candidate content and external rank signals to enable zero-shot, context-aware reranking; in the formal analysis of social choice, Vote-in-Context encapsulates schemes where voting rules are robustly adapted to incorporate information externalities without requiring fine-grained (cardinal) preference elicitation. Recent advances demonstrate the effectiveness of ViC designs for cross-modal retrieval tasks as well as their welfare-theoretic implications in expectation.
1. Fundamental Principles of Vote-in-Context
ViC unifies two conceptually related but technically distinct domains: multi-retriever fusion in machine learning and theoretical voting mechanisms responsive to informational context.
- Fusion and Retrieval Perspective: Traditional fusion approaches (e.g., CombSUM, RRF) operate on candidate ranks or scores across multiple retrievers, neglecting the actual candidate content. Vote-in-Context reframes fusion as a zero-shot reasoning task for a frozen, off-the-shelf Vision-LLM (VLM). The paradigm serializes both content evidence (e.g., image grids, subtitles) and retriever metadata (list position, duplicate frequency) into the VLM's prompt, enabling the model to perform adaptive, in-context "voting" over candidate sets.
- Voting and Social Choice Perspective: In social choice, ViC mechanisms formalize how external information—modeled as an "anchoring" point or public signal—shifts voter preferences, and devise aggregation procedures to simulate these shifts through menu transformations without direct access to cardinal utilities.
Both approaches maintain training-free operation and exploit the mechanism's ability to simulate the effect of context solely by processing appropriately serialized or transformed inputs.
2. Technical Methodologies
A. Machine Learning Implementation
- Candidate List Construction: For a query , the candidate sequence is constructed by round-robin interleaving the top items from each of retrievers, forming a list of length . Each candidate's position in provides an implicit rank signal; duplicate entries (occurring when a candidate appears in several retrievers' top-) encode consensus.
- Prompt Engineering: The VLM prompt consists of a serialized query (text or "S-Grid") followed by the interleaved candidate representations . For videos, is the -Grid: a compact grid of sampled frames with optional subtitles. No explicit numerical ranks are included; the model infers relative importance from order and repetition.
- Inference: The VLM is queried with , returning a permutation over candidates, effects reranking or fusion. The system is entirely training-free—no model weights are updated for the reranker/fuser.
B. Social Choice Modeling
- Voters and Alternatives: Let be voters, alternatives.
- Informational Context: Each voter with cardinal utility (unit sum) receives an anchoring point and weight , producing effective utility .
- Coarse Elicitation and Intermediary Mechanism: Rather than eliciting , only a coarse report (e.g., Plurality) is submitted. The mechanism simulates the effect of anchoring via a menu transformation and applies the same positional aggregation rule.
3. Mathematical Formalizations
Machine Learning: Serializing Video Content
S-Grid Mapping:
- Frames:
- Grid:
- S-Grid: where is an optional subtitle string
Prompt Structure (Pseudo):
1 2 3 4 5 6 7 8 9 |
Query: "Q(q)" Candidates (interleaved order; duplicates allowed): 1) [ImageGrid_1], Subtitle: "…" ... K) [ImageGrid_K], Subtitle: "…" Instruction: "Rank these K candidates by relevance… Items appearing multiple times indicate multiple retriever votes; earlier positions indicate higher original rank…" |
Social Choice: Context-Aware Voting
- Transformed Menu: with
- Outcome Distribution: For Plurality, probability of context-favored is
- Expected Welfare: If , expected social welfare is
where and .
4. Empirical Results and Performance
Retrieval and Fusion
- Datasets: MSR-VTT, DiDeMo, ActivityNet, VATEX; tasks: text-to-video (t2v), video-to-text (v2t)
- Zero-Shot Protocol: No fine-tuning of the VLM or reranker/fuser
- First-Stage Retrievers: CLIP4Clip, VAST, GRAM, InternVideo2-6B (all frozen)
- Baselines: CombSUM/CombMNZ (score fusion), RRF (rank fusion)
- Performance: On MSR-VTT,
- Single-list reranking with ViC: CLIP4Clip t2v recall@1 improves from 34.4 to 64.2; up to +40 recall@1 over SOTA on some benchmarks
- Ensemble fuser: CombSUM t2v recall@1 of 84.4 increases to 87.1 with ViC
- VATEX v2t: 77.6 (baseline) to 99.6 (ViC)
- Ablations:
- Grid size: or optimal; under-represents temporality, is overly compressed
- Metadata ("No Duplicates"): Removing duplicate candidate signal reduces recall@1, confirming importance of consensus encoding
Social Choice Outcomes
- Probability and Welfare Guarantees: Under mild alignment between and mean utilities, introducing context strictly improves probability of selecting the context-favored alternative and, when appropriate, increases expected social welfare.
- Theoretical Bounds: The mechanism (via ) exactly simulates context-anchored aggregation with only ordinal reports; probability of welfare loss is bounded by a Chernoff-type tail.
5. Limitations and Practical Constraints
- VLM Computational Cost: ViC substitutes arithmetic fusion with a full VLM forward pass over multimodal inputs; inference time grows linearly in .
- Effective Attention Window: Performance degrades as exceeds the VLM's context capacity.
- Recall Bound in Two-Stage Systems: If a relevant candidate is absent from first-stage retrieval, ViC cannot recover it.
- S-Grid Lossiness: Uniform temporal frame sampling may omit significant short events in videos.
- Social Choice Generality: The formal model presumes homogeneous anchoring and i.i.d. utilities; handling heterogeneous or adversarial information remains open.
6. Broader Implications and Future Directions
- Retrieval Systems: ViC demonstrates that prompt engineering and in-context serialization convert large, instruction-following VLMs into state-of-the-art, training-free rerankers and fusion modules for multimodal retrieval.
- Extensibility: Extensions to additional modalities (audio, tabular) through analogous serialization strategies are suggested as a next step.
- Voting Mechanisms: The intermediary mechanism approach allows ordinal voting systems to incorporate public information robustly, with formal guarantees, while maintaining practical feasibility (no need for cardinal elicitation).
- Research Challenges: Future work may involve targeted prompt or parameter-efficient tuning for smaller VLMs, query-aware keyframe selection for S-Grids, and extending social choice analysis to multiple or conflicting contexts.
7. Summary Table of Key Mechanism Components
| Dimension | Retrieval ViC (VLMs) | Social Choice ViC |
|---|---|---|
| Context source | Retriever list composition, metadata, candidate order | Public anchoring signal |
| Mechanism type | Prompt engineering; in-context model-based reasoning | Menu transformation, context-aware aggregation |
| Output | Reranked/fused candidate list | Selected alternative; welfare guarantee |
Vote-in-Context provides a systematic, evidence-driven method of adapting aggregation to external contextual information, yielding substantial empirical performance gains in retrieval while establishing formal improvements in selection probability and welfare in voting scenarios (Eltahir et al., 3 Nov 2025, Chen et al., 11 Apr 2024).