Papers
Topics
Authors
Recent
Search
2000 character limit reached

Iterative Information Collector (IIC)

Updated 8 February 2026
  • Iterative Information Collector (IIC) is a neural framework that recasts exemplar selection as a sequential decision process using reinforcement learning.
  • It employs a GRU-based state encoder atop a frozen dense retriever and uses stratified sampling to balance exploration and exploitation during retrieval.
  • Empirical results show IIC’s robust transferability across diverse datasets and LLM families, significantly outperforming standard one-shot k-shot retrieval methods.

The Iterative Information Collector (IIC), also referred to as the iterative retriever, is a neural framework that recasts in-context exemplar selection for LLMs as a sequential decision process driven by reinforcement learning. Unlike standard k-shot retrieval—where top-K examples are selected in a single pass by similarity—IIC constructs the exemplar set iteratively, accounting for dependencies and interactions among exemplars. This approach is formalized as a Markov decision process that optimizes a retrieval policy for downstream LLM performance via log-probability feedback from the target model. IIC introduces a lightweight, stateful state encoder atop a frozen dense retriever, and achieves superior retrieval for semantic parsing ICL tasks, with robust generalization across datasets and LLM families (2406.14739).

1. Formulation as a Combinatorial Optimization Problem

IIC addresses the challenge of selecting K exemplars (xi,yi)(x_i, y_i) from a dataset DD to maximize the conditional likelihood PLM(yx;(x,y)K)P_{LM}(y|x; (x',y')^{K}), a combinatorial optimization problem that is NP-hard due to the exponential number of candidate sets:

R(x)=argmax(x,y)KDPLM(yx;(x,y)K).R^{\star}(x) = \arg\max_{(x',y')^K \subset D} P_{LM}(y|x; (x',y')^{K}).

Traditional one-shot retrievers deploy a similarity heuristic:

R(x)=topK(x,y)DS(x,(x,y)).R(x) = \text{topK}_{(x',y') \in D} S(x,(x',y')).

IIC instead models retrieval as a Markov decision process (MDP), selecting exemplars sequentially. The state sts_t is a vector encoding the sequence of exemplars chosen so far, actions ata_t correspond to selecting a new exemplar (xi,yi)(x_i, y_i), and transitions are realized by a GRU-based state encoder. The retrieval policy πθ(atst)\pi_{\theta}(a_t|s_t) computes a query vector qtq_t, and candidate exemplars are scored by dot product with embeddings from a frozen text encoder FencF_{enc}. The MDP objective maximizes expected cumulative reward J(θ)=Eτπθ[t=1Kr(st,at)]J(\theta) = \mathbb{E}_{\tau\sim\pi_\theta}\left[\sum_{t=1}^{K} r(s_t, a_t)\right], where rewards are derived from LLM feedback (Section 2).

2. Reinforcement Learning Framework and Reward Shaping

The IIC training procedure is grounded in reinforcement learning with LLM-driven reward shaping. For a given test query xx and gold output yy^{\star}, the stepwise reward reflects the incremental gain in the LLM's log-probability of the correct output upon adding an exemplar:

r(st,at)logPLM(yx;St+1)logPLM(yx;St).r(s_t, a_t) \approx \log P_{LM}(y^{\star}|x; S_{t+1}) - \log P_{LM}(y^{\star}|x; S_t).

This decomposition provides a per-step, dense signal capturing the marginal utility of each chosen exemplar.

Policy optimization is conducted via Proximal Policy Optimization (PPO) in an actor-critic setting:

  • The policy πθ(as)\pi_\theta(a|s) (actor) determines the candidate probabilities.
  • The value function Vϕ(s)V_\phi(s) (critic) estimates expected returns from a given state.
  • The advantage estimator A^t\hat{A}_t is computed via generalized advantage estimation (GAE):

δt=r(st,at)+γVϕ(st+1)Vϕ(st),A^t==0Kt(γλ)δt+.\delta_t = r(s_t, a_t) + \gamma V_\phi(s_{t+1}) - V_\phi(s_t), \quad \hat{A}_t = \sum_{\ell=0}^{K-t} (\gamma \lambda)^{\ell} \delta_{t+\ell}.

Full optimization minimizes the PPO clipped surrogate loss, value head MSE loss, and includes an entropy bonus for exploration.

3. State Encoding and Network Architecture

IIC's architecture augments a frozen dense retriever (Contriever; d768d \approx 768) with a lightweight, trainable state encoder:

  • The text encoder Fenc()F_{enc}(\cdot) is frozen and initialized from Contriever, mapping text to Rd\mathbb{R}^d.
  • A GRU with hidden size dd encodes the history of chosen exemplars; each transition updates the state st+1=GRU(st,Fenc(xt))s_{t+1} = \text{GRU}(s_t, F_{enc}(x_t)).
  • The policy head is a one-layer MLP Q(st)Q(s_t) (RdRd\mathbb{R}^d \rightarrow \mathbb{R}^d).
  • Value estimation uses a linear head: V(st)=vst+bV(s_t) = v^\top s_t + b.

This design introduces approximately 4 million additional parameters atop the 110M of Contriever.

4. Iterative Retrieval Procedure

At inference, IIC retrieves exemplars as follows (no gradients):

  1. Initialize the state s0s_0.
  2. For each tt in $1..K$:
    • Compute qt=Q(st)q_t = Q(s_t).
    • Score all candidates via inner product: qtFenc(c)/βq_t \cdot F_{enc}(c) / \beta, where β\beta is a temperature hyperparameter.
    • Use "STRATIFIED_SAMPLE": select top K/NsK/N_s candidates, partition the remainder into (Ns1)(N_s-1) strata, draw equally, then renormalize and (optionally) sample from the resulting distribution.
    • Select action ata_t (greedily or by sampling). Update state: st+1=GRU(st,Fenc(at.x))s_{t+1} = \text{GRU}(s_t, F_{enc}(a_t.x)). Append ata_t to the retrieved set.

The stratified sampling mechanism balances exploitation of high-scoring exemplars and exploration of diverse candidates.

5. Generalization, Evaluation, and Empirical Results

IIC is trained with a smaller LLM (Llama-2-7b) as an environment simulator; at inference, the fixed retriever policy is deployed with different or larger LLMs (e.g., Llama-2-70b, CodeLlama-70b, Mistral-7b). Empirically, policies trained on one LLM achieve high transferability: in 75% of test-LM×dataset configurations, IIC surpasses strong baselines by at least 1 EM@1 point and remains competitive elsewhere.

Key evaluation settings include:

  • Datasets: SMCalFlow (dialogue-to-AMR), TreeDST (dialogue state tracking), MTOP-EN (multilingual parsing).
  • Baselines: BM25, Contriever, EPR (contrastive fine-tuning), CEIL (diversity via DPP).
  • Metrics: Exact Match @k (EM@1, EM@3), SMatch F1 (AMR-style partial match).

In all cases, IIC (or "ITERR") outperforms competing methods. For instance, on SMCalFlow with 10 exemplars, EM@1 rises from 44.0 (Contriever) to 54.1 (ITERR), and SMatch-F from 67.6 to 77.3. Similar gains are observed on TreeDST and MTOP.

Ablation studies reveal that EPR initialization, GRU-based state encoding, and stratified sampling are critical: removing EPR init drops EM@1 by nearly 9 points, Transformer decoder instability replaces the GRU, and omitting stratified sampling degrades retrieval quality.

6. Significance and Implications

IIC transforms k-shot in-context retrieval into a stateful, sequential decision paradigm that explicitly models exemplar interactions. Its reinforcement learning strategy, driven by incremental log-probability improvements in the LM, enables end-to-end retrieval policies that are robust to variations in both tasks and downstream LLMs. With minimal parameter overhead, it achieves substantial performance gains, establishing a new retrieval framework for downstream LM tasks where the choice and order of exemplars are inherently non-i.i.d. and interaction-dependent (2406.14739).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Iterative Information Collector (IIC).