Papers
Topics
Authors
Recent
Search
2000 character limit reached

Adaptive STaR (AdaSTaR): Multi-Domain Adaptivity

Updated 20 February 2026
  • Adaptive STaR is a family of adaptive algorithms that dynamically adjust data selection and parameter weighting based on latent statistics across various domains.
  • It applies to self-taught reasoning in language models, semantic table fusion in retrieval tasks, multi-domain recommender systems, and unsupervised adaptation in speech recognition.
  • Empirical results demonstrate significant improvements like a 58.6% reduction in training FLOPs, an 8.0% CTR uplift, and a 13.5% relative WER reduction, confirming its versatile effectiveness.

Adaptive STaR (AdaSTaR) refers to a family of algorithms and architectures designed to introduce adaptivity into the STaR (“Self-Taught Reasoner” or “STAR”) training frameworks, spanning multiple domains such as LLM self-improvement, table representation fusion, speech recognition adaptation, and multi-domain recommender systems. All variants of AdaSTaR share the unifying principle of guiding model parameter or data selection adaptively with regard to latent statistics, attention patterns, domain structure, or sampling diversity. The following synthesis details the main AdaSTaR instantiations and their technical underpinnings as reported in recent arXiv literature.

1. Adaptive Data Sampling for Self-Taught Reasoners

AdaSTaR in the context of self-taught reasoning LLMs addresses biased observation sampling in STaR (Rejection sampling Fine-Tuning) pipelines, which previously suffered from observation imbalance, overtraining on well-solved examples, and undertraining on challenging ones (Koh et al., 22 May 2025). AdaSTaR introduces two core adaptive sampling mechanisms:

  • Adaptive Sampling for Diversity (AdaD): Each observation maintains statistics t~i\tilde t_i (last-sampled iteration) and wiw_i (fraction of correct CoT generations). Observations are managed in a min-heap ordered lexicographically by (t~i,wi)(\tilde t_i, w_i). The sampler prioritizes examples least recently updated, then hardest among them, enforcing balanced frequency and preference toward difficult observations only after sufficient coverage of the train set.
  • Adaptive Sampling for Curriculum (AdaC): Statistic updates are modulated by model strength, estimated via per-iteration training accuracy α\alpha. Only mα2\lfloor m\cdot \alpha^2 \rfloor out of mm observations receive statistic refresh, delaying a full focus on difficult cases until the model is empirically strong.

Formally, at each iteration, βt\beta^t batch elements are sampled adaptively, CoT chains are generated and accepted via rule-based outcome checking, and the model is fine-tuned via negative log-likelihood on accepted chains. The full AdaSTaR algorithm is provided in pseudocode in (Koh et al., 22 May 2025). Empirical results on Llama 3.2 3B and other LMs demonstrate consistent test accuracy improvement over strong STaR and SFT baselines, achieving best performance on ARC-C, CQA, CLadder 1.5, ANLI, GSM8K, and SVAMP, while reducing total training FLOPs by an average of 58.6%.

Ablation confirms that both adaptive diversity and curriculum are essential: diversity alone induces frequency uniformity but degrades accuracy, whereas curriculum alone yields marginal gains. Compute overhead from AdaSTaR’s statistics is negligible; all updates are piggybacked on the core rejection-sampling loop.

2. Adaptive Fusion in Semantic Table Representation

In table retrieval and semantic table representation, AdaSTaR denotes a neural adaptive fusion extension to the STAR architecture (Hsu et al., 22 Jan 2026). The AdaSTaR design augments STAR’s two-stage pipeline—header-aware clustering and cluster-specific synthetic query generation—by replacing manual or fixed-cluster fusion with a per-cluster attention network:

  1. Header-aware clustering: Rows are embedded as ei=αeH+(1α)erie_i = \alpha e_H + (1-\alpha) e_{r_i}, balancing header (schema) content and instance.
  2. Cluster-specific query generation: Each semantic cluster RkR_k forms a mini-table, over which an LLM generates a synthetic query qkq_k.
  3. Adaptive weighted fusion: Both representative partial-row tables and their synthetic queries are encoded separately. A one-layer MLP takes the concatenation [Etable;Equeryk][E_{table}; E_{query_k}] for each cluster and projects it onto a learnable vector, producing normalized weights wkw_k via softmax. The final table embedding is given by Efused=k=1Kwk[Etable;Equeryk]E_{fused} = \sum_{k=1}^K w_k [E_{table}; E_{query_k}].

AdaSTaR’s learned fusion mechanism explicitly allocates more weight to semantically informative clusters, enhancing representation alignment with user queries and thereby improving recall in retrieval tasks. On Mimo (zh/en), OTTQA, FetaQA, and E2E-WTQ, AdaSTaR outperforms both fixed- and cosine-based fusion, demonstrating the benefits of trainable, input-conditional fusion (Hsu et al., 22 Jan 2026).

3. Star Topology Adaptive Recommender for Multi-Domain CTR Prediction

In recommender systems, AdaSTaR corresponds to the Star Topology Adaptive Recommender for simultaneously serving multiple business domains with a hybrid of shared and per-domain parameters (Sheng et al., 2021). Its core architectural properties are:

  • Star topology: A centralized shared trunk (Θshared\Theta_\mathrm{shared}), with lightweight domain-specific branches (Θd\Theta_d) for each domain dd. Each fully connected layer's parameters are element-wise modulated by the domain via Wd=WWdW^\star_d = W \otimes W_d and bd=b+bdb^\star_d = b + b_d.
  • Partitioned normalization: Embedding outputs are normalized per domain with domain-specific moving statistics and scale/bias vectors.
  • Auxiliary domain network: The domain indicator is passed through a small MLP, yielding an additive logit adjustment to the final CTR prediction.

Principally, AdaSTaR in this context adapts the model to both global (across-domain) and local (domain-specific) user/item statistics—combining strengths of a shared-bottom and mixture-of-experts lineage, but with minimal parameter overhead and efficient inference via precomputed domain-specific weights. Production deployment at Alibaba yielded an average +8.0% CTR and +6.0% RPM uplift versus single-domain baselines, with <<1% memory increase (Sheng et al., 2021).

4. Unsupervised Adaptation for Speech Foundation Models

For automatic speech recognition (ASR), AdaSTaR indicates an unsupervised adaptation algorithm for Transformer-based speech models, most notably Whisper and Canary (Hu et al., 2024). The objective is source-free unsupervised domain adaptation (UDA) using only unlabeled target-domain data X(t)\mathcal{X}^{(t)}, with the aim of reducing WER on new domains in the absence of supervised data or access to original source-domain data (addressing privacy, cost, and practicality constraints).

Key AdaSTaR mechanisms include:

  • Quality indicator StS_t for pseudo-label re-weighting: StS_t combines a posterior-derived confidence CtC_t and a self-attention coherence score AtA_t, using sigmoidal and exponential transforms for stability and reliability across token positions:

St=Stconf+Stcons,S_t = S_t^{\mathrm{conf}} + S_t^{\mathrm{cons}},

where StconfS_t^{\mathrm{conf}} and StconsS_t^{\mathrm{cons}} are defined as functions of CtC_t, AtA_t, the conflict threshold λ=2\lambda=2, and temperature τ=10\tau=10.

  • Utterance filtering: Pseudo-labels are further filtered by injecting noise into model weights and measuring edit distance diversity; only utterances with low uncertainty are retained.
  • Fine-tuning with weighted loss: Selected example-transcriptions are used to minimize a weighted cross-entropy, with each token’s contribution modulated by StS_t. Iterative pseudo-label regeneration is possible but 2–3 rounds suffice.

AdaSTaR demonstrated an average 13.5% relative WER reduction across 14 diverse test domains with <1<1 hour of unlabeled audio per domain, sometimes matching supervised upper bounds. Catastrophic forgetting is avoided without source-domain replay. The approach generalizes to open-source and commercial models, as well as to speech translation tasks (+0.8–2.2 BLEU on FLEURS X→En) (Hu et al., 2024).

5. Summary Table: Core Mechanisms Across Domains

Domain Key Adaptive Mechanism Principal Gain
Reasoning LMs Adaptive sampling (diversity/curr) Accuracy↑, FLOPs↓, low imbalance
Table retrieval Neural attention fusion Recall↑, semantic alignment
Recommender systems Star-topology parameter modulation Multi-domain AUC/CTR↑
ASR Attention-based pseudo-label scoring + selection WER↓, robust UDA, no forgetting

6. Implementation and Empirical Characteristics

All AdaSTaR variants emphasize parameter or data adaptivity implemented via either statistics tracking (as in per-observation min-heaps, curriculum gating), attention-based fusion networks, or star-topology parameter combination. Computational overhead is consistently low relative to baseline methods, either due to negligible additional per-sample computation (reasoning LMs, ASR) or due to small auxiliary networks (table retrieval, recommenders).

Across studies, AdaSTaR achieves consistent improvements in primary task metrics (accuracy, recall, WER, CTR, RPM) and data efficiency (e.g., <<1 hr unlabeled audio for ASR, 58.6% average FLOPs reduction in reasoning LMs), and generalizes across model families and domains. Empirical ablations underline the necessity of adaptivity in both data selection and parameter weighting for robust, efficient learning.

7. Scope and Generalization

AdaSTaR’s cross-domain versatility—applicable to LLM self-improvement, tabular semantic alignment, multi-domain user modeling, and robust ASR adaptation—suggests broad applicability for any context requiring adaptive selection or weighting of data, representations, or parameters. The minimal computational and implementation overhead in all studies positions AdaSTaR as a scalable paradigm for efficient model adaptation and learning across a range of challenging real-world tasks (Koh et al., 22 May 2025, Hsu et al., 22 Jan 2026, Sheng et al., 2021, Hu et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Adaptive STaR (AdaSTaR).