YOTO: You Only Train Once
- YOTO is a machine learning paradigm that unifies diverse training steps into one end-to-end process, eliminating iterative retraining and hyperparameter sweeps.
- It leverages differentiable operators and unified architectures to integrate tasks like pruning, multi-task optimization, and loss weighting seamlessly.
- Implementations of YOTO demonstrate significant gains in speed, resource efficiency, and performance across computer vision, computational biology, and code analysis.
You Only Train Once (YOTO) encompasses a family of machine learning frameworks that seek to minimize or outright eliminate the need for retraining or multi-stage optimization procedures, delivering competitive or state-of-the-art performance in diverse domains including computer vision, computational biology, code analysis, and automated pruning. By expressing end-task objectives, multi-task variants, or architectural compression in a unified, end-to-end differentiable pipeline, YOTO methods enable new workflows where expansive static hyperparameter sweeps, task-by-task pipelines, or model updates for new classes or objectives become unnecessary. The YOTO principle has been explored under various task-specific instantiations, always with the common aim: amortizing or collapsing the repeated "train-and-fine-tune" cycle into a single, fully end-to-end process.
1. Unified Model Architectures and Parameterizations
A central motif in YOTO frameworks is the explicit unification of what would, in conventional approaches, be multiple separate training runs, networks, or sub-tasks—accomplished through shared architectures, embedding-based selectors, or parameter space interpolations.
- In "You Only Train Once: Multi-Identity Free-Viewpoint Neural Human Rendering from Monocular Videos," YOTO replaces per-identity NeRF networks with a single shared volumetric representation, expanded via a set of learnable identity codes and a pose-conditioned code query mechanism. Each identity is embedded via a dedicated code; pose-dependent motion is modeled through joint-pose cross-attention, enabling synthesis of arbitrary identities under arbitrary poses within a single network instance (Kim et al., 2023).
- For gene subset selection in single-cell transcriptomic data, YOTO learns a discrete mask over genes end-to-end with shared encoder modules for representation and task-specific heads for multiple predictive targets. The model is guided directly by the prediction loss, connecting subset discovery and downstream utility in one differentiable pipeline. Sparsity is enforced structurally, not as a separate regularizer, and only selected features are passed forward at inference (Chopard et al., 19 Dec 2025).
- In code vulnerability detection, YOTO decomposes the learning of vulnerability-specific detectors into independent fine-tuning on per-vulnerability datasets, then fuses the resulting parameter difference vectors (termed "Vul-Vectors") by direct linear combination. This parameter fusion process constructs a single model capable of multi-category detection without ever requiring joint training or retraining when new vulnerability types are introduced (Tian et al., 12 Mar 2025).
- In task balancing and loss weighting, YOTO frameworks embed the loss weights themselves as learnable parameters (typically via a softmax transformation to ensure positivity and normalization), making the selection of empirical loss weights a direct target of first-order optimization. This eliminates grid search, allowing both model and hyperparameters to be obtained in a single training run (Sakaridis, 4 Jun 2025).
- In pruning and network compression, Only-Train-Once (OTO) and its successors (OTOv2, Auto-Train-Once) use group-structured penalties and learned controller modules to drive entire blocks of parameters to exact zeros, thus producing pruned networks from scratch with no fine-tuning. All structured sparsity is achieved within a single training session through explicit group-wise projections and adaptive mask generation (Chen et al., 2021, Wu et al., 2024).
2. Differentiable Operators and End-to-End Optimization
A core technical strategy spans nearly all YOTO paradigms: expressing non-differentiable operations (such as subset selection, group pruning, loss weighting, model fusion) in a differentiable relaxation for end-to-end training.
- Discrete subset selection is handled by continuous relaxations (Plackett-Luce/Gumbel-Softmax permutations for differentiable sorting), followed by straight-through estimation to force hard masks at inference (Chopard et al., 19 Dec 2025).
- Multi-loss balancing replaces brute-force grid or random searches of weighting hyperparameters with a softmax-parametrized layer appended to the model. Gradients with respect to the loss weights (which are now unconstrained "logits") are derived, and a regularization term is added to avoid degenerate or collapsed solutions (Sakaridis, 4 Jun 2025).
- Pruning via OTO/ATO achieves exact group zeros by a half-space stochastic projection: after a warm-up period, parameter groups are projected to zero if their gradient step falls outside a hyperplane relative to the current value. Controllers in ATO output binary masks for group retention/removal via a Gumbel-sigmoid, allowing architectural selection to be learned during training (Chen et al., 2021, Wu et al., 2024).
- In parameter fusion for code analysis, the theoretical underpinnings derive from the empirically observed flatness of the loss landscape, enabling simple linear addition of independent task-specific parameter difference vectors without explicit alignment or further retraining (Tian et al., 12 Mar 2025).
3. Performance, Scalability, and Update Efficiency
YOTO approaches consistently demonstrate increases in scalability, training/inference efficiency, and update flexibility relative to prior art.
- In multi-subject NeRF rendering, YOTO attains average PSNR 30.54 (vs. HumanNeRF 30.38), SSIM 0.961, and lower LPIPS*, while reducing total training time for 6 subjects from 147 hours (6 per-subject models) to 31 hours (one shared model), and slashing storage cost by 6× (Kim et al., 2023).
- For multi-task gene subset selection, a single YOTO model trained once outperforms or matches all single-task classical and modern baselines (e.g., mRMR, HSIC-Lasso, PERSIST). Performance advantages are especially pronounced at larger subset sizes, and in multi-task or partially labeled contexts (Chopard et al., 19 Dec 2025).
- In code vulnerability detection, YOTO's fusion achieves superior recall and precision, both in binary and multiclass settings, compared to parameter averaging and even joint fine-tuning, with direct support for incremental addition of new classes with zero retraining cost (Tian et al., 12 Mar 2025).
- For structured pruning, OTO on VGG16 compresses the architecture to 16.3% FLOPs and 2.5% params of the baseline at 91.0% accuracy (baseline: 91.6%), without any fine-tuning; for BERT on SQuAD, compresses to 40% size with 1.8× inference speedup (Chen et al., 2021). Auto-Train-Once extends this, outperforming all one-shot and multi-stage baselines across ResNet and MobileNet families on ImageNet and CIFAR (Wu et al., 2024).
- Object detection YOTO frameworks, such as the YOLO11n/DeiT/Qdrant pipeline for retail checkout, dramatically reduce model update cost: as new SKUs (product classes) are added, embeddings for new products are appended to the vector database but no further retraining of detector or metric backbone is needed. Training time is reduced by ~3× versus classical detection models, and incremental product onboarding incurs only a constant-time embedding operation (Hidayatullah et al., 4 Dec 2025).
4. Task Expansions and Domain-General Mechanisms
YOTO frameworks have generalized to numerous domains and offer canonical solutions to several historically multi-stage tasks.
- Free-viewpoint rendering tasks drop per-identity optimization by learning a canonical volumetric space and pose-conditioned identity selection in one pass (Kim et al., 2023).
- Feature and subset selection for high-dimensional omics data is achieved via a single end-to-end model that couples selection and prediction across multi-task targets (Chopard et al., 19 Dec 2025).
- Hybrid object detection/recognition is realized through a modular separation of localization (YOLO-based), feature embedding (Deep Vision Transformer), and metric-space retrieval, enabling retraining-free model extension for open-set classification (Hidayatullah et al., 4 Dec 2025).
- In quantitative image quality assessment, YOTO proposes a universal encoder combined with hierarchical and semantic attention adapters, fusing both full-reference (FR) and no-reference (NR) IQA in a single model. Task mode is selected dynamically by learned segment embeddings, and joint training improves NR performance without compromise to FR (Yun et al., 2023).
- Bimanual robotic policy learning incorporates one-shot video demonstration, automated geometric/environmental augmentation, and SIM(3)-equivariant diffusion policies for dexterous manipulation, within a data-efficient, minimal-retraining workflow (Zhou et al., 24 Jan 2025).
5. Limitations, Theoretical Considerations, and Future Directions
Limitations across YOTO implementations generally stem from model capacity, domain-specific architectural constraints, or the assumptions implicit in one-shot unification.
- The number of supported identities or tasks is limited by the parameterization capacity (e.g., size of shared MLPs or identity codebook in NeRF-YOTO). Very large N may require codebook scaling or hierarchical partitioning (Kim et al., 2023).
- Some frameworks, such as the code analysis YOTO, depend on the "flat minima" hypothesis in loss space for parameter fusion. Interpretability of fused representations and transfer to other backbones remain open questions (Tian et al., 12 Mar 2025).
- Models relying on accurate external estimates (e.g., pose tracking in human rendering) are vulnerable to upstream errors, although internal MLP corrections partially mitigate this (Kim et al., 2023).
- Pruning pipelines in OTO/ATO are currently task-agnostic but may be sensitive to the specifics of ZIG definition and, in the controller version, the adaptation horizon of the controller network. Non-smooth objective landscapes could challenge theoretical guarantees (Wu et al., 2024).
- Certain modalities, such as out-of-distribution generalization in bimanual manipulation or dynamic addition of entirely new data regimes, may still require further methodological adaptation (Zhou et al., 24 Jan 2025).
Future directions cited include learning hierarchical or continuously parameterized identity/task embeddings, integrating stronger cross-modal or 2D–3D feedback in end-to-end training, and extending the controller paradigm farther into architecture and hyperparameter selection (Kim et al., 2023, Wu et al., 2024).
6. Representative Implementations and Outcomes
The table below summarizes core characteristics of several YOTO instantiations.
| Domain | YOTO Mechanism | One-Shot Component | Outcome/Metric Improvement |
|---|---|---|---|
| Human rendering (NeRF) | Identity codes, pose-conditioned | Multi-id NeRF training | 4.7× speed, SOTA quality (Kim et al., 2023) |
| Gene subset selection | Diff. top-k selector + encoder | Mask+prediction end-to-end | Outperforms Seurat, PERSIST (Chopard et al., 19 Dec 2025) |
| Code vulnerability detection | Vul-vector parameter fusion | Model fusion, inc. updates | ↑recall/precision (Tian et al., 12 Mar 2025) |
| Model pruning | Group-structured penalty + HSPG | All pruning from scratch | 2–3× param/FLOP cut w/ no fine-tune (Chen et al., 2021, Wu et al., 2024) |
| Object detection (retail) | YOLO, DeiT, Proxy Anchor, Qdrant | No retraining for new classes | ~3× reduction in training time (Hidayatullah et al., 4 Dec 2025) |
| Image quality assessment | Hierarchical & semantic attention | Joint FR/NR training | SOTA PLCC/SROCC on LIVE, TID2013 (Yun et al., 2023) |
7. Significance and Theoretical Impact
YOTO approaches represent a paradigm shift in how resource amortization, representation sharing, and hyperparameter/process automation are handled in deep learning. By collapsing what were once multi-stage, resource-intensive workflows into a single, differentiable process, YOTO frameworks enable substantially faster iteration, more scalable systems, and smooth task or class expansion. Theoretical analyses (mirror-descent convergence for ATO, projection properties for OTO, differentiability and optimization of loss weight layers) underpin most practical implementations and suggest robustness of the paradigm to wide classes of loss/objective structures (Wu et al., 2024, Sakaridis, 4 Jun 2025).
In summary, YOTO frameworks systematically remove traditional retraining bottlenecks across modalities, architectures, and workflows. Through an overview of embedding-based parametrization, differentiable operators, structured optimization penalties, and end-to-end multi-task integration, "You Only Train Once" has established a canonical blueprint for scalable, update-efficient, and high-performance machine learning systems.