Papers
Topics
Authors
Recent
2000 character limit reached

Vote-in-Context (ViC)

Updated 9 November 2025
  • Vote-in-Context (ViC) is a framework that integrates candidate content with external context to enable adaptive, zero-shot re-ranking and aggregation.
  • In retrieval systems, ViC serializes content and metadata for Vision-Language Models, achieving significant improvements in recall and fusion performance.
  • In social choice, ViC applies a menu transformation to incorporate public anchoring signals, leading to enhanced selection probability and welfare outcomes.

Vote-in-Context (ViC) refers to a class of mechanisms and algorithmic strategies in which candidate or alternative selection is influenced by external contextual information—whether by explicitly encoding content/meta-data into machine learning prompts or by formally modeling the effects of public signals on collective decisions within voting mechanisms. In modern retrieval and fusion systems, ViC leverages both candidate content and external rank signals to enable zero-shot, context-aware reranking; in the formal analysis of social choice, Vote-in-Context encapsulates schemes where voting rules are robustly adapted to incorporate information externalities without requiring fine-grained (cardinal) preference elicitation. Recent advances demonstrate the effectiveness of ViC designs for cross-modal retrieval tasks as well as their welfare-theoretic implications in expectation.

1. Fundamental Principles of Vote-in-Context

ViC unifies two conceptually related but technically distinct domains: multi-retriever fusion in machine learning and theoretical voting mechanisms responsive to informational context.

  • Fusion and Retrieval Perspective: Traditional fusion approaches (e.g., CombSUM, RRF) operate on candidate ranks or scores across multiple retrievers, neglecting the actual candidate content. Vote-in-Context reframes fusion as a zero-shot reasoning task for a frozen, off-the-shelf Vision-LLM (VLM). The paradigm serializes both content evidence (e.g., image grids, subtitles) and retriever metadata (list position, duplicate frequency) into the VLM's prompt, enabling the model to perform adaptive, in-context "voting" over candidate sets.
  • Voting and Social Choice Perspective: In social choice, ViC mechanisms formalize how external information—modeled as an "anchoring" point or public signal—shifts voter preferences, and devise aggregation procedures to simulate these shifts through menu transformations without direct access to cardinal utilities.

Both approaches maintain training-free operation and exploit the mechanism's ability to simulate the effect of context solely by processing appropriately serialized or transformed inputs.

2. Technical Methodologies

A. Machine Learning Implementation

  • Candidate List Construction: For a query qq, the candidate sequence C(q)C(q) is constructed by round-robin interleaving the top kmaxk_{max} items from each of MM retrievers, forming a list of length KK. Each candidate's position in C(q)C(q) provides an implicit rank signal; duplicate entries (occurring when a candidate appears in several retrievers' top-kk) encode consensus.
  • Prompt Engineering: The VLM prompt consists of a serialized query Q(q)Q(q) (text or "S-Grid") followed by the interleaved candidate representations E(Ci)E(C_i). For videos, E(Ci)E(C_i) is the SS-Grid: a compact s×ss \times s grid of sampled frames with optional subtitles. No explicit numerical ranks are included; the model infers relative importance from order and repetition.
  • Inference: The VLM gΘg_\Theta is queried with (Q(q),E(C1),...,E(CK))(Q(q), E(C_1), ..., E(C_K)), returning a permutation π^\hat\pi over candidates, effects reranking or fusion. The system is entirely training-free—no model weights are updated for the reranker/fuser.

B. Social Choice Modeling

  • Voters and Alternatives: Let N={1,,n}N = \{1, \ldots, n\} be voters, A=[m]={1,...,m}A = [m] = \{1, ..., m\} alternatives.
  • Informational Context: Each voter ii with cardinal utility uiΔR+mu_i \in \Delta \subset \mathbb{R}^m_+ (unit sum) receives an anchoring point wΔw \in \Delta and weight α[0,1]\alpha \in [0,1], producing effective utility vi=(1α)ui+αwv_i = (1-\alpha)u_i + \alpha w.
  • Coarse Elicitation and Intermediary Mechanism: Rather than eliciting uiu_i, only a coarse report riRr_i \in \mathcal{R} (e.g., Plurality) is submitted. The mechanism simulates the effect of anchoring via a menu transformation ϕ(r)=(rαw)/(1α)\phi(r) = (r - \alpha w)/(1-\alpha) and applies the same positional aggregation rule.

3. Mathematical Formalizations

Machine Learning: Serializing Video Content

S-Grid Mapping:

  • Frames: ti=(i1)F/(s21), i=1s2t_i = \lfloor (i-1) \cdot F / (s^2 - 1) \rfloor,~i=1\ldots s^2
  • Grid: Grid(v;s)=[framet1,...,framets2]\mathrm{Grid}(v;s) = [\text{frame}_{t_1}, ..., \text{frame}_{t_{s^2}}]
  • S-Grid: S ⁣ ⁣Grid(v)=(Grid(v;s),av)\mathrm{S\!-\!Grid}(v) = (\mathrm{Grid}(v;s), a_v) where ava_v is an optional subtitle string

Prompt Structure (Pseudo):

1
2
3
4
5
6
7
8
9
Query: "Q(q)"

Candidates (interleaved order; duplicates allowed):
  1) [ImageGrid_1], Subtitle: "…"
  ...
  K) [ImageGrid_K], Subtitle: "…"

Instruction:
  "Rank these K candidates by relevance… Items appearing multiple times indicate multiple retriever votes; earlier positions indicate higher original rank…"

Social Choice: Context-Aware Voting

  • Transformed Menu: M={ϕ(r):rR}\mathcal{M} = \{\phi(r): r \in \mathcal{R}\} with ϕ(r)=(rαw)/(1α)\phi(r) = (r - \alpha w)/(1 - \alpha)
  • Outcome Distribution: For Plurality, probability of context-favored a=argmaxawaa^* = \arg\max_a w_a is

P(Outcome(M)=a)Pr[Binom(n,qea)n/2]P(\text{Outcome}(M) = a^*) \geq \Pr[\text{Binom}(n, q_{e_{a^*}}) \ge n/2]

  • Expected Welfare: If Outcome(M)=Y\mathrm{Outcome}(M) = Y, expected social welfare is

E[W(M)]=nv,να,w(f)\mathbb{E}[W(M)] = n \langle v , \nu^{\alpha, w}(f) \rangle

where v=Eμ[u]v = \mathbb{E}_\mu[u] and να,w(f)a=P(Outcome(M)=a)\nu^{\alpha, w}(f)_a = P(\mathrm{Outcome}(M)=a).

4. Empirical Results and Performance

Retrieval and Fusion

  • Datasets: MSR-VTT, DiDeMo, ActivityNet, VATEX; tasks: text-to-video (t2v), video-to-text (v2t)
  • Zero-Shot Protocol: No fine-tuning of the VLM or reranker/fuser
  • First-Stage Retrievers: CLIP4Clip, VAST, GRAM, InternVideo2-6B (all frozen)
  • Baselines: CombSUM/CombMNZ (score fusion), RRF (rank fusion)
  • Performance: On MSR-VTT,
    • Single-list reranking with ViC: CLIP4Clip t2v recall@1 improves from 34.4 to 64.2; up to +40 recall@1 over SOTA on some benchmarks
    • Ensemble fuser: CombSUM t2v recall@1 of 84.4 increases to 87.1 with ViC
    • VATEX v2t: 77.6 (baseline) to 99.6 (ViC)
  • Ablations:
    • Grid size: 2×22 \times 2 or 3×33 \times 3 optimal; 1×11 \times 1 under-represents temporality, 4×44 \times 4 is overly compressed
    • Metadata ("No Duplicates"): Removing duplicate candidate signal reduces recall@1, confirming importance of consensus encoding

Social Choice Outcomes

  • Probability and Welfare Guarantees: Under mild alignment between ww and mean utilities, introducing context strictly improves probability of selecting the context-favored alternative and, when appropriate, increases expected social welfare.
  • Theoretical Bounds: The mechanism MM (via ϕ\phi) exactly simulates context-anchored aggregation with only ordinal reports; probability of welfare loss is bounded by a Chernoff-type tail.

5. Limitations and Practical Constraints

  • VLM Computational Cost: ViC substitutes arithmetic fusion with a full VLM forward pass over KK multimodal inputs; inference time grows linearly in KK.
  • Effective Attention Window: Performance degrades as KK exceeds the VLM's context capacity.
  • Recall Bound in Two-Stage Systems: If a relevant candidate is absent from first-stage retrieval, ViC cannot recover it.
  • S-Grid Lossiness: Uniform temporal frame sampling may omit significant short events in videos.
  • Social Choice Generality: The formal model presumes homogeneous anchoring and i.i.d. utilities; handling heterogeneous or adversarial information remains open.

6. Broader Implications and Future Directions

  • Retrieval Systems: ViC demonstrates that prompt engineering and in-context serialization convert large, instruction-following VLMs into state-of-the-art, training-free rerankers and fusion modules for multimodal retrieval.
  • Extensibility: Extensions to additional modalities (audio, tabular) through analogous serialization strategies are suggested as a next step.
  • Voting Mechanisms: The intermediary mechanism approach allows ordinal voting systems to incorporate public information robustly, with formal guarantees, while maintaining practical feasibility (no need for cardinal elicitation).
  • Research Challenges: Future work may involve targeted prompt or parameter-efficient tuning for smaller VLMs, query-aware keyframe selection for S-Grids, and extending social choice analysis to multiple or conflicting contexts.

7. Summary Table of Key Mechanism Components

Dimension Retrieval ViC (VLMs) Social Choice ViC
Context source Retriever list composition, metadata, candidate order Public anchoring signal (w,α)(w, \alpha)
Mechanism type Prompt engineering; in-context model-based reasoning Menu transformation, context-aware aggregation
Output Reranked/fused candidate list Selected alternative; welfare guarantee

Vote-in-Context provides a systematic, evidence-driven method of adapting aggregation to external contextual information, yielding substantial empirical performance gains in retrieval while establishing formal improvements in selection probability and welfare in voting scenarios (Eltahir et al., 3 Nov 2025, Chen et al., 11 Apr 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Vote-in-Context (ViC).