Weakly Supervised In-Context Learning

Updated 8 October 2025

WS-ICL is a paradigm that relaxes the need for precise labels by using noisy and indirect signals to guide model adaptation without weight updates.
It employs probabilistic modeling and gradient-alignment from supportive pretraining to synthesize effective training signals from heterogeneous data.
The approach demonstrates robustness to label noise and imbalance, offering scalable, annotation-efficient learning across various applications.

Weakly Supervised In-Context Learning (WS-ICL) is an evolving paradigm that seeks to enable learning from examples or signals that are noisy, indirect, or drawn from sources whose label space is not aligned with the target task. It generalizes the core idea of in-context learning—model adaptation at inference time without weight updates—by relaxing the quality and granularity requirements on contextual supervision. This approach is motivated by the need to reduce annotation costs, scale machine learning to new domains, and handle scenarios where perfect or dense supervision is infeasible.

1. Probabilistic Modeling Techniques for Weak Indirect Supervision

A key technical foundation for WS-ICL is the use of probabilistic modeling to reconcile signals from indirect, heterogeneous, or mismatched supervision sources. The Probabilistic Label Relation Model (PLRM) exemplifies this approach (Zhang et al., 2021). In settings where labeling functions (ILFs) produce outputs in spaces loosely related to the target label set, PLRM accepts user-provided label relations as input, representing them as a label graph with pairwise relations (“exclusive,” “overlapping,” etc.).

PLRM is constructed as a factor graph with:

A latent binary vector $\bar{Y}$ indicating assignment to “seen” labels.
Dependency functions that enforce pseudo-accuracy (alignment between ILF outputs and the expected target labels, considering the non-exclusive neighborhoods from the label graph) and encode fine-grained label relations.

The joint log-linear model is:

$P_{\Theta}(y, \bar{Y}, \hat{y}) \propto \exp(\Theta^\top \cdot \text{Dep}(y, \bar{Y}, \hat{y})),$

where $\text{Dep}(\cdot)$ aggregates factor functions for all dependency types. Optimization is performed over the negative marginal log-likelihood. This probabilistic framework enables aggregation and “translation” of indirect or weak supervision signals into a common target label space, making it possible to synthesize usable training signals for downstream learning—even in the absence of direct labels for the target.

The modularity of this probabilistic scheme allows upstream synthesis of context examples for in-context learners, and the approach offers statistical efficiency (i.e., generalization bounds of similar order as traditional weak supervision) (Zhang et al., 2021).

2. Impact of Pretraining Data and Supportive Examples

Emergent in-context learning abilities are substantially influenced by the nature and diversity of pretraining data (Han et al., 2023). Rather than arising from direct exposure to demonstration examples, in-context capabilities often develop when the pretraining corpus contains “supportive” instances—examples whose gradient directions align with the ICL objective (as measured by gradient similarity, e.g., $\cos(\nabla_\theta L_{PT}, \nabla_\theta L_{ICL})$ ).

Characteristics of supportive pretraining data include:

Higher mass over long-tail (rare) tokens, as quantified by a lower Zipf coefficient.
Higher requirement for long-range contextual reasoning as indicated by lower information gain scores (harder-to-use context).
Weak or incidental domain overlap with downstream tasks—domain proximity is not the main driving factor.

Iterative gradient-alignment approaches (e.g., ORCA-ICL) can identify these supportive samples. Perturbative continued pretraining on such data can bolster in-context learning abilities by up to 18% (Han et al., 2023). For WS-ICL, this supports a weak supervision regime where strategic curation or upweighting of challenging, contextually rich, or rare-token-heavy instances augments a model’s proficiency in leveraging weak in-context cues.

3. Label Relationships and ICL Limitations

In-context learning uses statistical properties of demonstration labels in the context window to “infer” input-output mappings, as articulated via joint probability formulations over the sequential prompt (Kossen et al., 2023). Empirically, ICL can perform nontrivial in-context “learning” on previously unseen tasks by leveraging input–label co-occurrences, but some notable limitations arise:

Persistent pretraining biases can remain, especially when demonstration labels conflict with prior data (e.g., label flips in sentiment analysis).
Demonstrations are not weighted uniformly—recency bias is observed, with examples closer to the query exerting greater influence.
The approach cannot fully override pretraining tendencies, and improvements in log-probabilities or certainty are bounded.

For WS-ICL, these findings imply that while weak, noisy, or indirect supervision can inform label relationships, the maximum achievable accuracy is capped when in-context signals are insufficient to overcome strong priors.

4. Robustness, Label Imbalance, and Model Scale

Comparative analyses of ICL versus supervised learning indicate that in-context methods are more robust to label noise and imbalance, particularly as model scale increases (Wang et al., 2023). Key findings include:

ICL is less sensitive to label corruption than supervised learning. For example, with increasing label noise, the performance drop for ICL is about 11% versus 19% for supervised models.
ICL exhibits remarkable insensitivity to label imbalance—for all model sizes, changes in label distribution in the context have minimal performance impact.
The presence of gold (correct) labels is still critical for ICL, especially in large models, but overall, models with more parameters demonstrate greater robustness to perturbations.

This suggests that WS-ICL is especially attractive for scenarios with noisy or imbalanced weak supervision, provided that as many gold labels as possible are included in demonstrations and model scale is sufficiently large.

5. Data Generation, Skill Recognition, and Skill Learning

The mechanism by which in-context learning operates can be conceptualized under a Bayesian data generation framework, wherein the model samples an output conditional on a latent concept/function inferred from the prompt (Mao et al., 3 Feb 2024):

$p(\text{output} \mid \text{prompt}) = \int_\text{concept} p(\text{output} \mid \text{concept}, \text{prompt})\, p(\text{concept} \mid \text{prompt})\, d(\text{concept})$

This perspective highlights two core phenomena:

Skill recognition: the model identifies which pre-trained data generation process applies, robustly leveraging weak signals.
Skill learning: the model can synthesize hypotheses outside its pre-trained repertoire when prompted with sufficient (even if noisy) weak supervision.

For WS-ICL, this means that the quality of weak supervision affects both the inferability of latent concepts and the capacity for on-the-fly adaptation. Demonstration quality and alignment with the intended task remain central issues.

6. Automated, Contrastive, and Implicit Weak Supervision

Recent developments explore the automation of context selection and the introduction of weakly-supervised contrastive signals. Auto-ICL allows models to autonomously generate their own in-context demonstrations and instructions, obviating human curation (Yang et al., 2023). This supports WS-ICL systems that leverage their own generative and retrieval capacities to bootstrap weak supervision.

Contrastive in-context learning reinterprets the transformer’s self-attention as an implicit contrastive mechanism that aligns semantically similar, even poorly labeled, examples in the representation space (Miyanishi et al., 23 Aug 2024). In multimodal settings, careful management of format biases (fixed effects) and semantic losses (random effects) via analytical frameworks allows models to exploit weak or noisy supervision effectively.

Implicit methods such as I2CL transform the in-context demonstration set into a compressed activation-space vector, reducing sensitivity to order and content. The linear injection of this context vector into a model’s residual stream provides robust performance with lower computational overhead (Li et al., 23 May 2024).

7. Applications, Theoretical Guarantees, and Extensions

Probabilistic WS-ICL methods have demonstrated empirical effectiveness in image and text classification, with improvements over baselines of 2%–9% (Zhang et al., 2021). In practical applications such as medical image segmentation, WS-ICL using weak prompts (e.g., bounding boxes or points) has achieved performance comparable to traditional dense-label ICL, but with orders of magnitude less annotation effort (Hu et al., 7 Oct 2025). Theoretical analysis, including tests for label distinguishability and generalization bounds, provides guidance on the statistical limits and requirements of these frameworks.

Key mathematical guarantees include:

$\mathbb{E}[||\hat{\theta} - \theta^*||^2] \leq O(M \cdot (\log |D|)/|D|),\qquad \mathbb{E}[\ell(\hat{f}) - \ell(f^*)] \leq \chi + O(H \cdot \sqrt{(\log |D|)/|D|})$

Empirical findings indicate that even when constructed with indirect, noisy, or weak context signals, probabilistic and retrieval-based WS-ICL can synthesize effective learning scenarios.

In summary, Weakly Supervised In-Context Learning leverages probabilistic modeling, supportive pretraining, robust data selection, and automated or implicit context construction to harness indirect or noisy labels. While theoretical and practical limitations persist, especially concerning alignment with strong pretraining biases and the quality of weak supervision, WS-ICL provides a promising avenue for scalable, annotation-efficient learning across modalities and tasks (Zhang et al., 2021, Han et al., 2023, Kossen et al., 2023, Wang et al., 2023, Mao et al., 3 Feb 2024, Yang et al., 2023, Miyanishi et al., 23 Aug 2024, Li et al., 23 May 2024, Hu et al., 7 Oct 2025).