SFT Cold Start: Multimodal & Meta Strategies

Updated 27 September 2025

SFT Cold Start is a challenge in initializing machine learning systems without sufficient historical data, demanding specialized strategies for robust performance.
It leverages methods such as embedding models, side-information integration, and contrastive learning to mitigate data sparsity in recommender systems and serverless computing.
Recent approaches integrate multimodal reasoning and meta-learning to enhance rapid adaptation and improve recommendations in cold start scenarios.

Cold start refers to the scenario in which a system—typically a recommender, serverless platform, or active learning framework—must perform its core function without access to sufficient historical data pertaining to the target entity, be it a user, item, function, or variable. In recommender systems, this often means giving recommendations to new users or about new items with few or no prior interactions. In serverless computing, cold start denotes the latency experienced when a function is invoked after a period of inactivity and its environment must be reinitialized. Across domains, the cold start phenomenon introduces marked operational challenges and motivates a wide array of specialized mitigations, spanning algorithmic, architectural, and statistical perspectives.

1. Cold Start in Recommender Systems

The cold start problem is foundational to recommender system research and manifests when users or items lack historical interaction data. Standard collaborative filtering (CF) algorithms, relying on past behaviors, are insufficiently expressive under such sparsity. Proposed mitigations can be grouped as follows:

Embedding and Hybrid Models

Embedded Collaborative Filtering (ECF) (Zhou et al., 2017) leverages Word2Vec-based dimensionality reduction on implicit feedback, transforming items into lower-dimensional latent spaces. Items are mapped from sparse one-hot representations to dense vectors using Skip-Gram (SG) or Continuous Bag-of-Words (CBOW) models. Sessions are randomly sampled to improve co-occurrence modeling. Neighborhood CF strategies then operate over these learned embeddings: Item-Item-KNN aggregates similarities per session while User-Item-KNN averages item embeddings per user. ECF explicitly combines short-term and long-term session models using an interpolation coefficient $\alpha$ .

Matrix Factorization and Side Information

Collective Matrix Factorization (CMF) (Cortes, 2018) simultaneously factorizes user-item interactions and side information matrices via joint latent representations. The offset formulation enables rapid computation of cold start entity factors using attribute matrices (no need to solve linear systems), making real-time recommendation tractable. Latent user (or item) vectors are formed as products, e.g., $a_u = u_u C$ , for a new user.

Pure Cold Start and Optimization

In situations with neither interaction nor auxiliary data (Meng et al., 2020), solutions aggregate latent signals over warm users. The objective is to select a subset $Y^*$ of items that minimizes fav_loss across users:

$\text{fav\_loss}(Y) = \sum_{u \in U} \big( \max_{x \in X} u^T x - \max_{y \in Y} u^T y \big)$

Submodular greedy algorithms and graph-based search methods (IPGS) are shown to produce near-optimal recommendation sets under computational constraints.

Contrastive Learning Approaches

Contrastive Learning-based Cold-Start Recommendation (CLCRec) (Wei et al., 2021) formulates item representation learning as information-theoretic mutual dependence maximization between item content and collaborative signals. The framework aligns collaborative and content-based embeddings via contrastive losses, jointly maximizing $\text{MI}(z_i, f_i)$ and $\text{MI}(z_u, z_i)$ . Empirical results demonstrate significant improvements over BPR-optimized models, with hybrid training bridging warm/cold regimes.

Explainable Methods via Graph Reasoning

GRECS (Frej et al., 11 Jun 2024) incorporates explicit KG path traversal and cold embedding assignment by "average translation" from neighboring entities, allowing recommendations and interpretable rationales even when few or no user-item interactions exist.

2. Architectural and Systems-Level Cold Start Mitigations

Cold start is also a prevalent issue in serverless computing and cloud architectures, where initialization after idleness can induce substantial latencies:

Provider-Side Dependency Optimization

WarmSwap (Li et al., 13 Sep 2024) avoids per-function dependency initialization by live-migrating pre-initialized dependency images from a shared provider-side pool. A migration client receives process metadata and dependency pages on demand (via userfaultfd and a page server), accelerating loading for high-dependency functions by $2.2$– $3.2\times$ and saving up to $88\%$ of optimization space when images are shared among ten functions. Cache constraints are respected by maintaining only a fixed pool of shared images.

Profile-Guided Optimization

SLIMSTART (Tariq et al., 27 Apr 2025) employs call-path and statistical runtime profiling to detect and transform inefficient library usage, converting eager global imports to deferred lazy loads via automated code rewriting and continuous CI/CD-adaptive monitoring. This delivers up to $2.30\times$ speedup on initialization and $1.51\times$ reduction in memory usage across benchmark suites and production workloads.

OS Co-Design for Near-Warm Restores

Spice (Holmes et al., 17 Sep 2025) co-designs snapshot/restore mechanisms with custom OS primitives and a Joint Image Format (JIF) for kernel/user state serialization. Dedicated metadata restore and host-side optimized prefetching eliminate syscall replay and bulk minor page faults, achieving sub-$5$ms restore latencies and up to $14.9\times$ improvement over the prior process-based or VM-based restoration.

Prediction-Guided Prewarming Using Transformers

Transformer-Based Mitigation (Mouen et al., 15 Apr 2025) forecasts invocation patterns via a multi-head attention time-series model, enabling proactive scheduling of container prewarming. Experiments on Azure traces demonstrate up to $79\%$ reduction in cold start times versus conventional reactive provisioning.

3. Statistical and Learning-Theoretic Cold Start Strategies

Cold start is also central in active preference learning, particularly when labeled data is absent at deployment onset.

Self-Supervised Pretraining and Active Learning

Cold Start Active Preference Learning (Fayaz-Bakhsh et al., 7 Aug 2025) initiates preference modeling by extracting pseudo-labels via Principal Component Analysis (PCA) before any oracle queries. Surrogate pairwise preferences are generated, allowing an XGBoost classifier to be pre-trained. The model then enters an active querying loop, refining its understanding by targeting informative pairs for labeling via a simulated noisy oracle (Bradley-Terry, probabilistic feedback). The PCA phase solves:

$w = \underset{||w||_2=1}{\arg\max} \; \tfrac{1}{n} w^T X^T X w$

and assigns $\ell_{i,j} = 1$ if $t_i > t_j$ , facilitating rapid learning and superior sample efficiency on financial, career, and socio-economic datasets.

4. Recent Trends in Multimodal and Meta-Learning Cold Start

The emergence of large-scale multimodal models and meta-learning frameworks has motivated advanced cold start solutions:

Multimodal Reasoning via SFT and RL

Two-stage training pipelines (Wei et al., 28 May 2025) use Supervised Fine-Tuning (SFT) as a "cold start" step that injects structured chain-of-thought (CoT) reasoning. This is followed by reinforcement learning via Generalized Reinforcement Policy Optimization (GRPO), driving the model toward answer accuracy. Training involves staged curriculum learning with distinct learning rates for SFT ( $1 \times 10^{-5}$ ) and RL ( $1 \times 10^{-6}$ ), with reward function:

$\text{Reward} = \begin{cases} +1, & \text{if correct} \ 0, & \text{if incorrect} \end{cases}$

This method yields consistent improvements on MathVista, We-Math, and related benchmarks among open-source MLLMs at both 3B and 7B scales.

Meta Transitional Learning for Sequential Recommendations

MetaTL (Wang et al., 2021) casts cold-start sequential recommendation as a few-shot learning problem. A translation-based model processes item transitions, and meta-learning rapidly adapts representations via task-specific gradient steps, optimizing the query loss across sampled tasks for fast cold user adaptation.

5. Causal and Similarity-Based Forecasting in Time Series Cold Start

In forecasting, the lack of history for some variables is exacerbated in multivariate time series:

Causal Demand Forecasting Model

CDF-cold (Fatemi et al., 2023) integrates causal graphs (via VARLiNGAM) and Graph Neural Networks (GNNs) with LSTM layers for representation learning. The framework enables leveraging side information and similar data centers via clustering and similarity metrics (GMM, Eros norm) to predict for cold-start series. The system exhibits improved MSE, MAE, and MAPE compared to LSTM-only or non-causal models.

6. Implications, Limitations, and Future Directions

Across domains, several recurring themes and open challenges emerge:

Dependency on Auxiliary Data: Most cold start mitigations perform best when content, attributes, or knowledge graph relations are available. Pure cold start (no side information, no feedback) remains a frontier, often relying on global heuristics or aggregated warm signals (Meng et al., 2020).
Trade-offs in Warm vs. Cold Performance: Incorporating side information or offsets aids cold start but may marginally degrade warm-start accuracy (Cortes, 2018).
Hyperparameter Sensitivity and Automation Needs: Many frameworks require empirically tuned parameters (window sizes, sampling rates, weights), and automated methods for calibration are needed (Zhou et al., 2017).
Scalability and Real-Time Demands: Efficient computational strategies (lazy migration, batched memory restoration, prefetching) are central to making cold start solutions viable for large-scale deployment (Li et al., 13 Sep 2024, Holmes et al., 17 Sep 2025).
Interpretability and Explainability: New methods (GRECS) focus on providing human-interpretable rationales for recommendations under cold start, utilizing explicit graph reasoning to build user trust (Frej et al., 11 Jun 2024).

The diversity of strategies and persistent emergence of new techniques reflect the foundational status of cold start as a challenge and stimulus for algorithmic and systems innovation in data-intensive computing.