In-Context Generation Techniques

Updated 4 July 2026

In-context generation is a method where models leverage provided examples or retrieved evidence to dynamically generate outputs without updating parameters.
The approach distinguishes between skill recognition and skill learning by implicitly modeling data-generation functions through in-context examples.
It has practical applications in machine translation, data synthesis, image editing, and video imitation, while addressing challenges like semantic drift and bias.

to=arxiv_search.search 大发时时彩怎么无码不卡高清免费 code {"^{^{^{^{^{^{^{^3query3}}}}}}} generation\"^{^{^{^{^{^{^³}}}}}} OR title:\3^{^{^{^{^{^{^{^"In-Context}}}}}}} Generation\"","max_results":^{^{^{^{3all:\3query3^{^{^{^{,"sort_by":"submittedDate","sort_order":"descending"}}}}}}}}} საქმე to=arxiv_search.search 北京赛车前json code {"^{^{^{^{^{^{^{^3query3}}}}}}} AND (abs:generation OR ti:generation)","max_results":^{^{^{^{3all:\3query3^{^{^{^{,"sort_by":"submittedDate","sort_order":"descending"}}}}}}}}} to=arxiv_search.search 东臣ീയ code {"^{^{^{^{^{^{^{^3query3}}}}}}} OR title:\3^{^{^{^{&&&)\",\"(Sun et al., 2024)\",\"(Zhang et al., 2024)\",\"(Recasens et al., 11 Jun 2025)\",\"(Fang et al., 23 Feb 2025)\",\"(Lee et al., 31 May 2025)\",\"(Yang et al., 2023)\",\"(Kim et al., 2022)\",\"(&&&^{^{^{^{3all:\3query3^{^{^{^{&&&)\",\"(&&&}}}}}}}}}}}} OR title:\3^{^{^{^{&&&)\",\"(&&&^{^{^{^{3all:\33^{^{^{^{&&&)\"]","max_results":}}}}}}}}}}}} In-context generation denotes a class of generation procedures in which a model is conditioned on examples, retrieved evidence, or reference signals supplied in its context window and then produces a new output without parameter updates. In the data-generation perspective, a LLM can be viewed as implicitly modeling a family of data-generation functions PRESERVED_PLACEHOLDER_^{^{^{^{3query3^{^{^{^;}}}}}}} at inference, it may either select a previously learned function that explains the demonstrations (“skill recognition”) or adapt to the demonstrations to induce a new function (“skill learning”) (&&&^{^{^{^{3all:\3^{^{^{^&&&).}}}}}}} Recent work uses this pattern across black-box LLM prompting, synthetic tabular data generation, machine translation, question generation, image editing, image synthesis, and video imitation (&&&^{^{^{^{3all:\3^{^{^{^5&&&,}}}}}}} Fang et al., 23 Feb 2025, Zhang et al., 2024).

^{^{^{^{3all:\3^{^{^{^.}}}}}}} Conceptual and theoretical foundations

A systematic formulation treats pre-training as learning a mixture over latent data-generation functions,

PRESERVED_PLACEHOLDER_^{^{^{^{3all:\3^{^{^{^}}}}}}}

where each PRESERVED_PLACEHOLDER_^{^{^³}} OR title:\3^{^{^{^}}} defines a conditional distribution or, in classification language, a function $f_\theta$ from input to label (&&&^{^{^{^{3all:\3^{^{^{^&&&).}}}}}}} Under skill recognition, the model performs implicit Bayesian inference over $\theta$ using the demonstrations $D=\{(x_i,y_i)\}_{i=1}^n$ and concentrates on a latent concept $\theta^*$ that best explains them. Under skill learning, the model behaves as an on-the-fly learner, implicitly selecting

$f^*=\arg\min_{f\in\mathcal F}\sum_{i=1}^{n}\ell(f(x_i),y_i),$

and then applying $f^*(x_{\mathrm{test}})$ to the ^{^{^{^{^{^{^{^{3query3^{^{^{^{^{^{^{^}}}}}}}}}}}}}}} (&&&^{^{^{^{3all:\3^{^{^{^&&&).}}}}}}}

This perspective has been sharpened by analyses of linear self-attention. In retrieval-augmented generation, one recent formulation introduces a unified linear predictor $y=W_1x_1+W_2x_2$ , where PRESERVED_PLACEHOLDER_^{^{^{^{3all:\3query3^{^{^{^}}}}}}} is a ^{^{^{^{^{^{^{^3query3}}}}}}} feature and PRESERVED_PLACEHOLDER_^{^{^{^{3all:\3all:\3^{^{^{^}}}}}}} is a retrieval-derived feature, and shows that one linear self-attention layer can implement one gradient-descent step on the corresponding linearized RAG objective (&&&^{^{^{^{3all:\33^{^{^{^&&&).}}}}}}} The result is exact in the constructed linear regime: one forward update changes only the PRESERVED_PLACEHOLDER_^{^{^{^3all:\3}}} OR title:\3^{^{^{^}}} slot of the ^{^{^{^{^{^{^{^{3query3^{^{^{^{^{^{^{^}}}}}}}}}}}}}}} token by PRESERVED_PLACEHOLDER_^{^{^{^{3all:\33^{^{^{^}}}}}}} (&&&^{^{^{^{3all:\33^{^{^{^&&&).}}}}}}} The same work also shows the boundary of the analogy: the correspondence remains stable under controlled linear extensions, but becomes feature-distribution dependent under nonlinear architectures, especially under skewed or heavy-tailed feature distributions (&&&^{^{^{^{3all:\33^{^{^{^&&&).}}}}}}}

Taken together, these formulations treat in-context generation not as a special-purpose prompt trick but as forward-pass adaptation. This suggests that the central technical question is how context should be constructed, compressed, regularized, or audited so that the induced adaptation is useful rather than misleading.

^{^{^³}} OR title:\3^{^{^{^.}}} Context construction in LLMs

A major line of work replaces manually curated demonstrations with generated or optimized context. Self-Generated In-Context Learning (SG-ICL) uses the same autoregressive model as a demonstration generator: given a test instance and a candidate label, the model samples demonstrations with temperature PRESERVED_PLACEHOLDER_^{^{^{^{3all:\3max_results32^{^{^{^}}}}}}} uses PRESERVED_PLACEHOLDER_^{^{^{^{3all:\3^{^{^⁵}}}}}} generated pairs, and then predicts with an inference template over the generated pool (Kim et al., 2022). On SST-^{^{^³}} OR title:\3^{^{^{^,}}} SST-5, RTE, and CB, SG-ICL consistently improves over zero-shot learning, has markedly lower variance than randomly selected gold demonstrations, and is “generally worth approximately ^{^{^{^{3query3^{^{^{^.6}}}}}}} gold training samples”; equivalently, 8 self-generated demonstrations match roughly 5 gold demonstrations (Kim et al., 2022).

Auto-ICL generalizes this idea by having the model autonomously generate either demonstrations, instructions, or both, in a first stage and answer with them in a second stage (Yang et al., 2023). In generating mode, the reported average accuracy is 68.^{^{^{^{3all:\3^{^{^{^,}}}}}}} compared with 63.9 for Zero-Shot-CoT and 38.3 for Zero-Shot; in retrieving mode, Auto-ICL reaches 75.7, exceeding Few-Shot, APE, Instruction Induction, and Auto-CoT on the reported averages (Yang et al., 2023). The same study reports that instruction-only context is strongest in retrieving mode, whereas demonstration+instruction is best in generating mode (Yang et al., 2023).

ProGen adds feedback from a task-specific model. It iteratively grows a synthetic dataset, scores examples by a robust influence function using Reverse Cross-Entropy on a synthetic validation set, and feeds the top-PRESERVED_PLACEHOLDER_^{^{^{^{3all:\3^{^{^⁶}}}}}} “helpful” examples back as in-context demonstrations for the next generation round (&&&^{^{^³}} OR title:\3^{^{^{^8&&&).}}} On five text classification datasets, ProGen improves average zero-shot accuracy from 8^{^{^³}} OR title:\3^{^{^{^.94}}} to 86.5^{^{^{^{3all:\3^{^{^{^}}}}}}} for DistilBERT and from 75.56 to 8^{^{^{^{3query3^{^{^{^.99}}}}}}} for an LSTM, and matches or exceeds ZeroGen with only ^{^{^{^{3all:\3^{^{^{^%}}}}}}} of its synthetic dataset size (&&&^{^{^³}} OR title:\3^{^{^{^8&&&).}}}

A more explicit optimization of context appears in Li et al.’s two-stage framework for black-box LLMs. The method leaves the original prompt PRESERVED_PLACEHOLDER_^{^{^{^{3all:\3^{^{^⁷}}}}}} intact, learns a policy PRESERVED_PLACEHOLDER_^{^{^{^{3all:\3^{^{^⁸}}}}}} that generates a semantically aligned derived prompt PRESERVED_PLACEHOLDER_^{^{^{^{3all:\3query32^{^{^{^}}}}}}} queries a fixed response model PRESERVED_PLACEHOLDER_^{^{^³}} OR title:\3query3^{^{^{^}}} on PRESERVED_PLACEHOLDER_^{^{^³}} OR title:\3all:\3^{^{^{^,}}} and then wraps the PRESERVED_PLACEHOLDER_^{^{^³}} OR title:\3 OR title:\3^{^{^{^}}} pair into a single-shot in-context demonstration for the original prompt (&&&^{^{^{^{3all:\3^{^{^{^5&&&).}}}}}}} Training maximizes

PRESERVED_PLACEHOLDER_^{^{^³}} OR title:\33^{^{^{^}}}

implemented with a ReMax-style policy gradient while keeping the response model immutable (&&&^{^{^{^{3all:\3^{^{^{^5&&&).}}}}}}} The inference template explicitly asks the model to “emulate” the way the derived-prompt response answers its question while replying to the original prompt, thereby anchoring the final answer in PRESERVED_PLACEHOLDER_^{^{^³}} OR title:\3^{^{^⁴}} rather than replacing it (&&&^{^{^{^{3all:\3^{^{^{^5&&&).}}}}}}} On Vicuna Eval with GPT-4, “OURS vs. Original Prompt” wins 9^{^{^{^{3query3^{^{^{^{.^{^{^{^3query3}}}}}}}}}}} of the time and “OURS vs. BPO” 88.8%; on Self-Instruct Eval with GPT-4, the corresponding win rates are 76.^{^{^³}} OR title:\3^{^{^{^%}}} and 7^{^{^{^{3all:\3^{^{^{^.4%;}}}}}}} on GPT-3.5, the method maintains 7^{^{^{^{3query3^{^{^{^%+}}}}}}} win rate over the original prompt and 65%+ over BPO across benchmarks (&&&^{^{^{^{3all:\3^{^{^{^5&&&).}}}}}}}

3. Structured-data and task-specific generation

In structured-data generation, in-context generation is used both as a substitute for fine-tuning and as a target for prompt optimization. TabGen-ICL formulates tabular synthesis with a fixed LLM and an iterative residual-aware selector: at iteration PRESERVED_PLACEHOLDER_^{^{^³}} OR title:\3sort_by32^{^{^{^}}} it chooses a subset

PRESERVED_PLACEHOLDER_^{^{^³}} OR title:\3^{^{^⁶}}

where PRESERVED_PLACEHOLDER_^{^{^³}} OR title:\3^{^{^⁷}} is the set of generated rows so far and PRESERVED_PLACEHOLDER_^{^{^³}} OR title:\3^{^{^⁸}} alternates between Jensen–Shannon divergence and Kolmogorov–Smirnov distance (Fang et al., 23 Feb 2025). The selected examples are JSON-serialized into the prompt, and the loop progressively narrows the gap between generated and real distributions. Across five real-world tables, TabGen-ICL reduces the error rate by 3.5%–4^{^{^³}} OR title:\3^{^{^{^{.^{^{^³}}}}}} OR title:\3^{^{^{^%}}} on fidelity metrics relative to random selection (Fang et al., 23 Feb 2025).

The same dependence on context creates a fairness risk. In LLM-based tabular generation, few-shot prompts consist of PRESERVED_PLACEHOLDER_^{^{^³}} OR title:\3^{^{^⁹}} demonstrations $f_\theta$ ^{^{^{^{3query3^{^{^{^,}}}}}}} and a bias parameter

$f_\theta$ ^{^{^{^{3all:\3^{^{^{^}}}}}}}

controls the label imbalance for a protected subgroup in those demonstrations (Recasens et al., 11 Jun 2025). The reported empirical result is that even mild in-context bias leads to global statistical distortion: as $f_\theta$ ^{^{^³}} OR title:\3^{^{^{^}}} increases, the generated $f_\theta$ 3 tracks it nearly linearly, the effect strengthens with larger context size, and all tested models leak in-context imbalances (Recasens et al., 11 Jun 2025). In the adversarial setting, an attacker controlling a fraction $f_\theta$ 4 of the in-context records can drive downstream fairness violations; at $f_\theta$ 5, a Random Forest trained on the synthetic data exhibits $f_\theta$ 6, while utility degrades only modestly and fidelity remains high, with TVC $f_\theta$ 7 and JSD $f_\theta$ 8 in the reported example (Recasens et al., 11 Jun 2025).

For machine translation in low-resource settings, Demonstration Augmentation for Translation (DAT) generates a candidate pool $f_\theta$ 9 of source-side examples, filters it with an MMR objective balancing relevance and diversity, generates target sides zero-shot, and then translates the ^{^{^{^{^{^{^{^{3query3^{^{^{^{^{^{^{^}}}}}}}}}}}}}}} with the resulting few-shot prompt (Lee et al., 31 May 2025). In the reported setup, $\theta$ ^{^{^{^{3query3^{^{^{^,}}}}}}} $\theta$ ^{^{^{^{3all:\3^{^{^{^,}}}}}}} and $\theta$ ^{^{^³}} OR title:\3^{^{^{^}}} (Lee et al., 31 May 2025). On English to Nepali, Khmer, Pashto, Zulu, and Swahili with Llama-3.^{^{^{^{3all:\3^{^{^{^{-7^{^{^{^3query3}}}}}}}}}}} DAT outperforms zero-shot in 4 out of 5 languages and avoids the severe backfire observed for fixed human pairs, including a ^{^{^³}} OR title:\3all:\3^{^{^{^.6}}} COMET-point drop on Khmer for the few-shot fixed-pair baseline relative to zero-shot (Lee et al., 31 May 2025).

Task-specific generation also includes educational and commonsense settings. In automatic question generation from educational passages, GPT-4 with ICL and a Hybrid model combining ICL and retrieval both outperform baseline models; among automated metrics, ICL( $\theta$ 3) obtains ROUGE-L 55.95, METEOR 34.6^{^{^³}} OR title:\3^{^{^{^,}}} ChRF 6^{^{^{^{3query3^{^{^{^.48,}}}}}}} and BERTScore 75.9^{^{^³}} OR title:\3^{^{^{^,}}} while the Hybrid model is best on all reported human measures except Answerability (&&&4^{^{^³}} OR title:\3^{^{^{^&&&).}}} In commonsense generation, a two-step diversification wrapper first produces default outputs, then asks the model to generate sentences different from its previous outputs when diversity is low; on CommonGen with GPT3.5-turbo, this raises the harmonic mean of diversity and BERTScore from 39.6 for default ICL to 7^{^{^³}} OR title:\3^{^{^{^.7}}} for the proposed ICD selection, while substantially lowering self-BLEU-4 from 7^{^{^³}} OR title:\3^{^{^{^.4}}} to ^{^{^³}} OR title:\3all:\3^{^{^{^{.^{^{^{^{3query3^{^{^{^}}}}}}}}}}} (Zhang et al., 2024).

4. Visual and multimodal in-context generation

In image generation, one influential design principle is to separate contextual appearance from structural control. Context Diffusion augments a latent diffusion UNet with a visual-context encoder, a frozen CLIP text encoder, a ^{^{^{^{^{^{^{^3query3}}}}}}} encoder derived from ControlNet, and modified cross-attention that attends jointly to text embeddings $\theta$ 4 and visual-context embeddings $\theta$ 5 (&&&^{^{^³}} OR title:\3^{^{^{^&&&).}}} Prompt dropout replaces the text prompt with the empty string with probability $\theta$ 6, forcing the model to rely on visual context (&&&^{^{^³}} OR title:\3^{^{^{^&&&).}}} The reported user study shows especially large gains when only context is present: in-domain, the method wins 8^{^{^{^{3query3^{^{^{^{.^{^{^³}}}}}}}}}} OR title:\3^{^{^{^%}}} versus 4.5% for Prompt Diffusion under context-only conditioning; out-of-domain, the corresponding figures are 63.7% versus ^{^{^³}} OR title:\3 OR title:\3^{^{^{^.8%}}} (&&&^{^{^³}} OR title:\3^{^{^{^&&&).}}}

X-Prompt extends in-context generation to a purely auto-regressive vision-LLM by compressing each in-context example into a small set of learnable “X-Prompt” tokens through cross-attention (Sun et al., 2024). If $\theta$ 7 is an example sequence and $\theta$ 8 are the learnable compression tokens, the model forces information flow through $\theta$ 9 and blocks direct attention from raw example tokens to target tokens (Sun et al., 2024). This makes the total context length

$D=\{(x_i,y_i)\}_{i=1}^n$ ^{^{^{^{3query3^{^{^{^}}}}}}}

with a practical window of up to 5,^{^{^{^3all:\3}}} OR title:\3query3^{^{^{^}}} tokens (Sun et al., 2024). In zero-shot unseen tasks, the reported gains are large: low-light enhancement PSNR improves from 9.^{^{^{^{3all:\3^{^{^⁴}}}}}} to ^{^{^{^{3all:\3^{^{^{^{7.^{^{^{^{3query3query3}}}}}}}}}}}} derain PSNR from 7.9^{^{^³}} OR title:\3^{^{^{^}}} to ^{^{^{^{3all:\3^{^{^{^{8.^{^{^{^{3all:\3query3}}}}}}}}}}}} object addition $D=\{(x_i,y_i)\}_{i=1}^n$ ^{^{^{^{3all:\3^{^{^{^}}}}}}} from $D=\{(x_i,y_i)\}_{i=1}^n$ ^{^{^³}} OR title:\3^{^{^{^}}} to ^{^{^{^{3query3^{^{^{^{.^{^{^{^{3query3query3}}}}}}}}}}}} OR title:\3^{^{^{^,}}} and unseen depth-color palette RMSE from ^{^{^{^{3query3^{^{^{^.745}}}}}}} to ^{^{^{^{3query3^{^{^{^{.39^{^{^{^{3query3^{^{^{^}}}}}}}}}}}}}}} (Sun et al., 2024).

Video In-context Learning applies the same principle to tokenized video. A decoder-only LLaMA-style Transformer is trained self-supervised on ^{^{^{^{3all:\3^{^{^{^6-frame}}}}}}} clips tokenized by a pretrained VQ-GAN into 4,^{^{^{^{3query3^{^{^⁹⁶}}}}}} tokens plus bos/eos, with no explicit demo/^{^{^{^{^{^{^{^{3query3^{^{^{^{^{^{^{^}}}}}}}}}}}}}}} structure during training (Zhang et al., 2024). At inference, demonstration clips and ^{^{^{^{^{^{^{^{3query3^{^{^{^{^{^{^{^}}}}}}}}}}}}}}} frames are concatenated into one causal prefix, and the model autoregressively samples future frames (Zhang et al., 2024). With the ^{^{^{^{3all:\3^{^{^{^{.^{^{^{^3all:\3}}}}}}}}}}} model, in-class demonstrations raise probing accuracy from ^{^{^³}} OR title:\3^{^{^{^9.6%}}} to 36.7% and V-Acc by ^{^{^{^{3all:\3^{^{^{^.8}}}}}}} points, while PSNR and FID improve with model scale (Zhang et al., 2024). The paper characterizes the resulting behavior as zero-shot imitation from demonstration videos (Zhang et al., 2024).

Latent-space flow and diffusion transformers now provide a unified setting for in-context image generation and editing. FLUX.^{^{^{^{3all:\3^{^{^{^}}}}}}} Kontext uses simple sequence concatenation of text and image latents, offsets the “time” coordinate of each context image in factorized 3D RoPE, and trains a rectified-flow transformer with a conditional flow-matching objective in latent space (&&&^{^{^{^3all:\3}}} OR title:\3^{^{^{^&&&).}}} On KontextBench, a benchmark with ^{^{^{^{3all:\3^{^{^{^{,^{^{^{^3query3}}}}}}}}}}} OR title:\3^{^{^⁶}} image-prompt pairs across local editing, global editing, character reference, style reference, and text editing, FLUX.^{^{^{^{3all:\3^{^{^{^}}}}}}} Kontext[pro] and [max] rank at or near the top in human ELO evaluations; on five successive edits, AuraFace cosine similarity averages ^{^{^{^{3query3^{^{^{^{.9^{^{^{^3query3}}}}}}}}}}} for FLUX.^{^{^{^{3all:\3^{^{^{^}}}}}}} Kontext[pro], versus ^{^{^{^{3query3^{^{^{^.774}}}}}}} for Runway Gen-4 and ^{^{^{^{3query3^{^{^{^{.4^{^{^{^3all:\3}}}}}}}}}}} for GPT-4o-High (&&&^{^{^{^3all:\3}}} OR title:\3^{^{^{^&&&).}}} The reported inference time is 3–5 seconds for ^{^{^{^{3all:\3query3}}}} OR title:\3^{^{^{^{4×^{^{^{^{3all:\3query3}}}}}}}} OR title:\3^{^{^⁴}} images on a single A^{^{^{^{3all:\3query3query3^{^{^{^}}}}}}} GPU (&&&^{^{^{^3all:\3}}} OR title:\3^{^{^{^&&&).}}}

5. Efficiency, token management, and forward-only adaptation

As contextual generation scales, sequence length becomes the main systems bottleneck. In Diffusion Transformers, in-context generation concatenates noisy latent tokens $D=\{(x_i,y_i)\}_{i=1}^n$ 3 with a fixed reference sequence $D=\{(x_i,y_i)\}_{i=1}^n$ 4, giving self-attention cost $D=\{(x_i,y_i)\}_{i=1}^n$ 5 (&&&^{^{^{^{3all:\3all:\3^{^{^{^&&&).}}}}}}} ToPi addresses this with training-free token pruning. It first computes a layerwise Context Sensitivity Score

$D=\{(x_i,y_i)\}_{i=1}^n$ 6

on a calibration set, selects the top- $D=\{(x_i,y_i)\}_{i=1}^n$ 7 representative layers, and then scores each context token by a value-weighted attention influence metric (&&&^{^{^{^{3all:\3all:\3^{^{^{^&&&).}}}}}}} Pruning is updated only at anchor timesteps through a fidelity-constrained objective that preserves at least a fraction $D=\{(x_i,y_i)\}_{i=1}^n$ 8 of total influence (&&&^{^{^{^{3all:\3all:\3^{^{^{^&&&).}}}}}}} On Flux.^{^{^{^{3all:\3^{^{^{^-Kontext}}}}}}} and Qwen-Image-Edit, ToPi yields about ^{^{^{^{3all:\3^{^{^{^{.^{^{^³}}}}}}}}}} OR title:\3all:\3^{^{^{^{×–^{^{^{^{3all:\3^{^{^{^.33×}}}}}}}}}}} speedup, recovers within $D=\{(x_i,y_i)\}_{i=1}^n$ 9 dB of full-context PSNR, adds $\theta^*$ ^{^{^{^{3query3^{^{^{^}}}}}}} latency overhead, and removes over 5^{^{^{^{3query3^{^{^{^%}}}}}}} of reference tokens on average while preserving at least 85% of the context “information mass” (&&&^{^{^{^{3all:\3all:\3^{^{^{^&&&).}}}}}}}

Instructional image editing exposes a parallel efficiency–precision problem. In-Context Edit treats a pretrained DiT inpainting model as a black box by forming a side-by-side in-context image in which the source image occupies the left half and the target half is masked; the associated IC prompt describes the original image on the left and the instructed edit on the right (&&&^{^{^{^{3all:\3query3^{^{^{^&&&).}}}}}}} A training-free version already benefits from the IC prompt alone, improving CLIP-I from ^{^{^{^{3query3^{^{^{^{.68^{^{^{^{3all:\3^{^{^{^}}}}}}}}}}}}}}} to ^{^{^{^{3query3^{^{^{^.794}}}}}}} and GPT from ^{^{^{^{3query3^{^{^{^{.^{^{^{^3all:\3}}}}}}}}}}} to ^{^{^{^{3query3^{^{^{^{.^{^{^³}}}}}}}}}} OR title:\3^{^{^⁴}} in the reported ablation (&&&^{^{^{^{3all:\3query3^{^{^{^&&&).}}}}}}} The paper then adds a LoRA-MoE hybrid, in which the output of the frozen base layer is augmented by a sparse mixture of low-rank experts, and an early-filter inference-time scaling method that scores partial denoising trajectories with Qwen-VL-7^{^{^³}} OR title:\3^{^{^{^B}}} (&&&^{^{^{^{3all:\3query3^{^{^{^&&&).}}}}}}} The early filter improves SC by ^{^{^{^{3all:\3^{^{^{^9%}}}}}}} and overall VIE-Score by ^{^{^{^{3all:\3^{^{^{^6%}}}}}}} over single-seed outputs (&&&^{^{^{^{3all:\3query3^{^{^{^&&&).}}}}}}}

Forward-only adaptation also appears in retrieval-augmented generation. RAG-GD keeps the retriever and LLM backbone frozen, trains a base retrieval adapter $\theta^*$ ^{^{^{^{3all:\3^{^{^{^,}}}}}}} and then meta-trains a predictor $\theta^*$ ^{^{^³}} OR title:\3^{^{^{^}}} that maps a few-shot RAG support set to low-rank updates approximating what $\theta^*$ 3 steps of SGD would have done to the retrieval interface (&&&^{^{^{^{3all:\33^{^{^{^&&&).}}}}}}} At inference, the update is produced in one small forward pass rather than test-time backpropagation (&&&^{^{^{^{3all:\33^{^{^{^&&&).}}}}}}} On Qwen ^{^{^³}} OR title:\3^{^{^{^.5}}} B with E5 retrieval, the reported average EM/F^{^{^{^{3all:\3^{^{^{^}}}}}}} improves from 34.^{^{^{^{3all:\3^{^{^{^{6/4^{^{^³}}}}}}}}}} OR title:\3^{^{^{^.54}}} to 36.7^{^{^{^{3all:\3^{^{^{^{/45.^{^{^{^{3all:\3all:\3}}}}}}}}}}}} and the method approaches test-time gradient adaptation at much lower per-^{^{^{^{^{^{^{^{3query3^{^{^{^{^{^{^{^}}}}}}}}}}}}}}} cost (&&&^{^{^{^{3all:\33^{^{^{^&&&).}}}}}}}

6. Failure modes, safeguards, and open problems

A recurring concern is that contextual generation can improve surface quality while drifting semantically or statistically. Li et al.’s derived-prompt framework addresses semantic drift with two explicit safeguards: a KL penalty keeps the learned derived-prompt policy close to its reference initialization, and the final inference template always asks the model to answer the original prompt rather than the derived one (&&&^{^{^{^{3all:\3^{^{^{^5&&&).}}}}}}} In tabular generation, by contrast, the few-shot examples themselves may be the attack surface; prompt audit, balanced prompt design, fairness-guided exemplar selection, post-generation debiasing, and model-internal defenses are proposed as mitigation strategies for in-context bias propagation (Recasens et al., 11 Jun 2025).

Several modality-specific limitations remain. Context Diffusion reports that very fine-grained local edits can still fail and that, when visual context and text disagree, the model tends to favor the context image (&&&^{^{^³}} OR title:\3^{^{^{^&&&).}}} X-Prompt notes that the base Chameleon VQ-VAE compresses at ^{^{^{^{3all:\3^{^{^{^6×}}}}}}} and loses fine detail, and that generalization degrades across completely unrelated tasks (Sun et al., 2024). FLUX.^{^{^{^{3all:\3^{^{^{^}}}}}}} Kontext reports minor artifact accumulation and occasional instruction non-compliance after 6–7 edits in multi-turn workflows (&&&^{^{^{^3all:\3}}} OR title:\3^{^{^{^&&&).}}} In machine translation, progressive accumulation of synthetic demonstrations improves retrieval-based reuse but does not fully match dynamic on-the-fly generation (Lee et al., 31 May 2025). In educational question generation, example selection remains sensitive, and retrieved passages may be semantically related yet contextually irrelevant (&&&4^{^{^³}} OR title:\3^{^{^{^&&&).}}}

The broader research agenda remains explicitly open. The data-generation survey identifies the mechanistic origin of skill learning, the causal linkage between the pre-training function class and learnable in-context functions, the extension of the framework to chain-of-thought reasoning and self-critique, and unified probabilistic frameworks that cover both recognition and learning as central future directions (&&&^{^{^{^{3all:\3^{^{^{^&&&).}}}}}}} A plausible implication is that progress in in-context generation will depend less on any single prompting heuristic than on a joint theory of context selection, context compression, implicit optimization, and failure analysis across modalities.