Cold-Start Sampling Strategy

Updated 2 February 2026

Cold-start sampling strategy is a technique for selecting informative data points from an unlabeled pool when labeled data is minimal or absent.
It leverages methods such as embedding-based clustering, outlier detection, and proxy-task uncertainty to maximize diversity, coverage, and information gain.
Empirical studies and theoretical analyses show that these strategies significantly outperform random sampling in tasks like medical image segmentation and recommendation systems.

Cold-start sampling strategy refers to the initial data selection policy or algorithmic procedure used when a learning system, model, or pipeline must make informative decisions with no or minimal labeled data. In contrast to warm-start approaches, which assume the availability of prior annotations, model checkpoints, or expert feedback, cold-start sampling must maximize information gain, diversity, or representativeness based on intrinsic data properties, unsupervised signals, or surrogate metrics. Recent advances span active learning, semi-supervised learning, recommendation systems, domain adaptation, Bayesian filtering, and high-dimensional sampling, emphasizing algorithmic rigor, statistical guarantees, and empirical validation.

1. Foundations: Motivation and Problem Definition

In annotation-driven tasks such as medical image segmentation, credit card fraud detection, class-imbalanced document classification, or recommender initialization, cold-start sampling strategies define which elements from an unlabeled pool should be manually labeled (or pseudo-labeled) to optimize downstream accuracy, uncertainty estimation, or preference modeling under budget constraints. This is critically important when:

Labeling cost or expertise is prohibitive (e.g. 3D MRI scans or specialist-reviewed recommendations)
The data manifold is high-dimensional, multimodal, or highly imbalanced
The system must rapidly bootstrap classifiers, segmenters, or personalization engines

Formally, given an unlabeled pool $U = \{x_1,\ldots,x_N\}$ and cold-start budget $B$ , the goal is a selection algorithm $\mathcal{A}: U, B \rightarrow S_0$ which outputs indices or examples $S_0$ such that the initial model trained on $S_0$ yields superior downstream performance compared to random selection, especially with respect to diversity, coverage, or informativeness (Levy et al., 26 Jan 2026, Mannix et al., 2023, Barata et al., 2021).

2. Core Methodological Families

a. Embedding-based Clustering and Manifold Partitioning

Modern strategies leverage pretrained foundation model embeddings (from self-supervised, transfer, or contrastive learning) to map each datum $x_i$ to a high-level feature vector $z_i = f_\theta(x_i)$ . Dimensionality reduction (t-SNE, PCA) may be employed for geometric interpretability, and unsupervised clustering (typically k-means, k-medoids, or greedy k-center) is performed to partition $U$ into $K$ clusters (Levy et al., 26 Jan 2026, Yuan et al., 2024, Mannix et al., 2023). Cluster assignments are used to select medoid examples (cluster centers) for diversity and proportional sampling for coverage:

Automatic $K$ selection via silhouette score maximization
Cluster-proportional allocation: $B$ 0
Intra-cluster farthest-point selection for maximizing spread

$\mathcal{A}: U, B \rightarrow S_0$ 0

Empirical results on image segmentation and classification tasks show that such strategies outperform random sampling by up to 50% in early Dice score, Hausdorff distance, and coverage improvement (Levy et al., 26 Jan 2026, Yuan et al., 2024).

b. Outlier, Discriminative, and Semi-supervised Ranking

For imbalanced datasets, cold-start selection may begin with unsupervised outlier detection (Isolation Forest, density estimation) to prioritize samples least likely to stem from the majority class (Barata et al., 2021). Once limited labels are obtained, discriminative active learning between the labeled and unlabeled pools (ODAL) can capture underrepresented regions. Semi-supervised label-propagation mechanisms use graph-based entropy or propagated uncertainty as sampling criteria (Brangbour et al., 2022).

Cold stage: Isolation Forest on U for top outliers
Warm stage: Outlier detector trained on L; score U by "outlierness" to L
Transition to fully supervised uncertainty sampling (entropy, QBC, etc.)

c. Proxy-Task Uncertainty and Self-Supervision

Certain cold-start frameworks build surrogate (pseudo) labels using task-specific transformations, e.g., windowing and thresholding to create proxy binary masks in medical segmentation, followed by proxy model training and ranking by uncertainty (MC dropout entropy or variance) (Nath et al., 2022). Early sample selection maximizes variance/reward under proxy predictions.

d. Meta-Learning and Popularity-Based Cold-Start Recommendation

Meta-learning architectures specifically address user- or item-side cold-start in recommender systems. Popularity-Aware Meta-learning (PAM) dynamically partitions interaction batches into multiple popularity-bracketed meta-tasks with feature reweighting and cold-bucket data augmentation. Cross-task or contrastive enhancements can simulate embeddings unavailable due to sparsity (Luo et al., 2024, Lee et al., 2019).

e. Policy Learning for Active Preference and User Simulation

In the absence of labeled preference pairs, self-supervised pre-training (e.g., one-component PCA) generates pseudo-labels with residual weighting; a subsequent uncertainty-based sampling policy refines the model through strategic oracle queries (Fayaz-Bakhsh et al., 7 Aug 2025). RL-based or contextual-bandit policies can optimize selection for augmentation (e.g., which users are emulated by LLM when creating cold-start items) (Subbaraman et al., 27 Nov 2025).

3. Theoretical Guarantees and Statistical Properties

Mixing from Cold Start in Sampling Algorithms

For high-dimensional log-concave density sampling, theoretical analysis centers on the concept of warmness (Rényi divergence) and the design of annealing or multiscale Markov chains (e.g., cube-based decompositions, proximal samplers) that achieve rapid mixing from cold starts, i.e., arbitrary initial distributions possibly far from target equilibrium (Kook et al., 3 May 2025, Narayanan et al., 2022). Multiscale walks rely on isoperimetric inequalities sensitive to boundary structure, with polynomial mixing time guarantees even when the start is $B$ 1-warm with $B$ 2.

If $B$ 3 is log-concave with covariance $B$ 4 and support diameter $B$ 5, warm start can be generated via accelerated variance annealing and proximal sampler contraction:

$B$ 6

Reducing the dimension-dependent complexity from $B$ 7 to $B$ 8 for isotropic position.

4. Impact and Quantitative Evaluation

Tables below summarize selected improvements (relative to random) in initial sample selection.

Dataset	Random Dice	Cold-Start Dice	Random HD95	Cold-Start HD95
CheXmask-300	0.918	0.929	32.41 mm	27.66 mm
Montgomery CXR	0.928	0.950	14.22 mm	9.38 mm
SynthStrip MRI	0.801	0.807	9.43 mm	8.69 mm

On CIFAR-10, Cold PAWS (k-medoids+tSNE) coverage of all classes in first 30 labels: ~90%, compared to ~67% for random (Mannix et al., 2023). Recall gains for minority classes in class-imbalanced problems, and recall@5 boosts of up to 60% for cold items in recommender meta-learning (Luo et al., 2024).

5. Visualization and Interpretability

Cold-start strategies often facilitate representation-space visualization. Typical workflows use 2D t-SNE projection for:

Display of all unlabeled points (gray)
Cluster medoid/farthest samples (blue circles)
New acquisitions after AL rounds (green squares)
Test-set overlays for coverage verification (orange triangles)
Draw cluster boundaries or convex hulls

This enables audit of diversity and mode coverage in selection.

6. Practical Recommendations and Guidelines

Use foundation-model embeddings for geometric representation
Automate cluster-count selection (silhouette, elbow method)
Prefer medoid + proportional/farthest-point within clusters for diversity
In imbalanced contexts, apply outlier-detection or element-wise propagated entropy ranking as soon as minimal labels are available
For streaming/online recommendation, partition cold vs. warm tasks; supplement cold buckets with simulated embeddings from warm history
In preference learning, initiate with self-supervised surrogate labels; refine with uncertainty sampling and explicit modeling of oracle noise

7. Limitations and Open Challenges

While embedding-based, cluster-guided, and meta-learned strategies substantially outperform random sampling in cold-start regimes, limitations remain:

Dependence on representation quality and domain transfer of foundation models
Sensitivity to cluster imbalance or poor feature space geometry
Computational complexity for very large ( $B$ 9) pools
Lack of formal sample complexity bounds in semi-supervised and RL-based cold-start augmentation

Future directions include hybrid active learning schemes, continual adaptation for concept drift, tighter theoretical mixing time analyses, and domain-specific fine-tuning of unsupervised representations.

For a rigorous development of embedding-based clustering for medical image segmentation cold-start, see "From Cold Start to Active Learning: Embedding-Based Scan Selection for Medical Image Segmentation" (Levy et al., 26 Jan 2026). For unsupervised label selection with clustering and manifold learning in vision, consult "Cold PAWS: Unsupervised class discovery and addressing the cold-start problem for semi-supervised learning" (Mannix et al., 2023). For theoretical analysis of cold-start sampling in log-concave densities, examine "Faster logconcave sampling from a cold start in high dimension" (Kook et al., 3 May 2025).