Iterative Information Collector (IIC)
- Iterative Information Collector (IIC) is a neural framework that recasts exemplar selection as a sequential decision process using reinforcement learning.
- It employs a GRU-based state encoder atop a frozen dense retriever and uses stratified sampling to balance exploration and exploitation during retrieval.
- Empirical results show IIC’s robust transferability across diverse datasets and LLM families, significantly outperforming standard one-shot k-shot retrieval methods.
The Iterative Information Collector (IIC), also referred to as the iterative retriever, is a neural framework that recasts in-context exemplar selection for LLMs as a sequential decision process driven by reinforcement learning. Unlike standard k-shot retrieval—where top-K examples are selected in a single pass by similarity—IIC constructs the exemplar set iteratively, accounting for dependencies and interactions among exemplars. This approach is formalized as a Markov decision process that optimizes a retrieval policy for downstream LLM performance via log-probability feedback from the target model. IIC introduces a lightweight, stateful state encoder atop a frozen dense retriever, and achieves superior retrieval for semantic parsing ICL tasks, with robust generalization across datasets and LLM families (2406.14739).
1. Formulation as a Combinatorial Optimization Problem
IIC addresses the challenge of selecting K exemplars from a dataset to maximize the conditional likelihood , a combinatorial optimization problem that is NP-hard due to the exponential number of candidate sets:
Traditional one-shot retrievers deploy a similarity heuristic:
IIC instead models retrieval as a Markov decision process (MDP), selecting exemplars sequentially. The state is a vector encoding the sequence of exemplars chosen so far, actions correspond to selecting a new exemplar , and transitions are realized by a GRU-based state encoder. The retrieval policy computes a query vector , and candidate exemplars are scored by dot product with embeddings from a frozen text encoder . The MDP objective maximizes expected cumulative reward , where rewards are derived from LLM feedback (Section 2).
2. Reinforcement Learning Framework and Reward Shaping
The IIC training procedure is grounded in reinforcement learning with LLM-driven reward shaping. For a given test query and gold output , the stepwise reward reflects the incremental gain in the LLM's log-probability of the correct output upon adding an exemplar:
This decomposition provides a per-step, dense signal capturing the marginal utility of each chosen exemplar.
Policy optimization is conducted via Proximal Policy Optimization (PPO) in an actor-critic setting:
- The policy (actor) determines the candidate probabilities.
- The value function (critic) estimates expected returns from a given state.
- The advantage estimator is computed via generalized advantage estimation (GAE):
Full optimization minimizes the PPO clipped surrogate loss, value head MSE loss, and includes an entropy bonus for exploration.
3. State Encoding and Network Architecture
IIC's architecture augments a frozen dense retriever (Contriever; ) with a lightweight, trainable state encoder:
- The text encoder is frozen and initialized from Contriever, mapping text to .
- A GRU with hidden size encodes the history of chosen exemplars; each transition updates the state .
- The policy head is a one-layer MLP ().
- Value estimation uses a linear head: .
This design introduces approximately 4 million additional parameters atop the 110M of Contriever.
4. Iterative Retrieval Procedure
At inference, IIC retrieves exemplars as follows (no gradients):
- Initialize the state .
- For each in $1..K$:
- Compute .
- Score all candidates via inner product: , where is a temperature hyperparameter.
- Use "STRATIFIED_SAMPLE": select top candidates, partition the remainder into strata, draw equally, then renormalize and (optionally) sample from the resulting distribution.
- Select action (greedily or by sampling). Update state: . Append to the retrieved set.
The stratified sampling mechanism balances exploitation of high-scoring exemplars and exploration of diverse candidates.
5. Generalization, Evaluation, and Empirical Results
IIC is trained with a smaller LLM (Llama-2-7b) as an environment simulator; at inference, the fixed retriever policy is deployed with different or larger LLMs (e.g., Llama-2-70b, CodeLlama-70b, Mistral-7b). Empirically, policies trained on one LLM achieve high transferability: in 75% of test-LM×dataset configurations, IIC surpasses strong baselines by at least 1 EM@1 point and remains competitive elsewhere.
Key evaluation settings include:
- Datasets: SMCalFlow (dialogue-to-AMR), TreeDST (dialogue state tracking), MTOP-EN (multilingual parsing).
- Baselines: BM25, Contriever, EPR (contrastive fine-tuning), CEIL (diversity via DPP).
- Metrics: Exact Match @k (EM@1, EM@3), SMatch F1 (AMR-style partial match).
In all cases, IIC (or "ITERR") outperforms competing methods. For instance, on SMCalFlow with 10 exemplars, EM@1 rises from 44.0 (Contriever) to 54.1 (ITERR), and SMatch-F from 67.6 to 77.3. Similar gains are observed on TreeDST and MTOP.
Ablation studies reveal that EPR initialization, GRU-based state encoding, and stratified sampling are critical: removing EPR init drops EM@1 by nearly 9 points, Transformer decoder instability replaces the GRU, and omitting stratified sampling degrades retrieval quality.
6. Significance and Implications
IIC transforms k-shot in-context retrieval into a stateful, sequential decision paradigm that explicitly models exemplar interactions. Its reinforcement learning strategy, driven by incremental log-probability improvements in the LM, enables end-to-end retrieval policies that are robust to variations in both tasks and downstream LLMs. With minimal parameter overhead, it achieves substantial performance gains, establishing a new retrieval framework for downstream LM tasks where the choice and order of exemplars are inherently non-i.i.d. and interaction-dependent (2406.14739).