Unified Optimization with Foundation Models

Updated 24 April 2026

Unified optimization with foundation models is an emerging paradigm that leverages pretrained representations for scalable, multi-task optimization across diverse domains.
It integrates cross-domain priors and self-supervised pretraining to enhance model generalizability and address challenges like heterogeneity and sample efficiency.
Architectural innovations such as UFO, GFM, FoMEMO, and MorphDistill demonstrate practical performance improvements in speed, accuracy, and adaptability.

Unified optimization with foundation models defines an emerging set of paradigms and algorithms whereby large pretrained models are leveraged to enable highly generalizable, efficient, and often multi-task optimization solutions across domains including operations research, computer vision, federated learning, multi-objective design, and medical informatics. These approaches generalize the pretrain–transfer paradigm of LLMs to new problem classes and data modalities, augment classical optimization, and extend multi-model ensembling to foundation-scale architectures. Unified optimization integrates foundation model representations, cross-task or cross-domain priors, and often deals with various challenges such as heterogeneity, sample-efficiency, and zero/few-shot adaptation.

1. Foundations and Motivation

Unified optimization aims to overcome the inherent specialization and rigidity of conventional optimization methods and neural solvers. In traditional operations research (OR), exact algorithms (e.g., branch-and-bound) and specialized heuristics are tuned for specific problems but scale poorly or fail to generalize across different network topologies or tasks. Similarly, single-task deep models or GNN-oriented solvers manifest poor cross-task transfer and lack robustness to real-world data complexity.

In contrast, foundation models, exemplified by LLMs and large vision-LLMs, demonstrate that self-supervised pretraining on massive and diverse corpora induces representations with substantial transferability. The analogy extends to other domains: just as words and sentences enable semantic learning in LLMs, nodes and walks in a graph, or tokenized multi-objective trajectories, admit a grammar that is amenable to similar pretraining protocols. This motivates the adoption of unified optimization approaches, leveraging representation learning, self-supervision, and multi-task knowledge integration at scale (Liang et al., 29 Sep 2025, Xi et al., 2022, Yao et al., 3 Sep 2025, Khan et al., 7 Apr 2026).

2. Architectural and Algorithmic Principles

Unified optimization with foundation models is instantiated via several distinct but related architectures:

Unified Feature Optimization (UFO): UFO employs a single moderate-sized transformer backbone shared by all tasks and optimized jointly. Task-specific heads and a multi-task loss ( $L_\text{unified}(\theta) = \sum_t \alpha_t L_t(\theta)$ ) coordinate training. Architectural adaptations via NAS-driven subnetwork extraction further tailor the model for deployment, balancing task performance, model size, and inference efficiency (Xi et al., 2022).
Graph Foundation Model (GFM): GFM adapts the BERT architecture to graph optimization by self-supervised pretraining on random walks, treating paths as sentences. The transformer backbone encodes structural dependencies without explicit task design, enabling a generative heuristic decoder to tackle distance-based optimization, including shortest-path, TSP, and tour variants, via a task-agnostic Decode that conditions on constraints and leverages the pretrained graph prior $\pi(\cdot \mid G)$ (Liang et al., 29 Sep 2025).
FoMEMO: For multi-objective optimization, FoMEMO pretrains a permutation-invariant transformer (PFN) to model the full posterior over scalarized objectives, conditioned on domain trajectories and user preferences. The resulting model enables in-context optimization without model retraining, as new tasks are absorbed by incorporating the trajectory and preference tokens in a single forward pass (Yao et al., 3 Sep 2025).
MorphDistill: In computational pathology, MorphDistill distills representations from multiple heterogeneous foundation models into a unified student encoder via dimension-agnostic, multi-teacher relational distillation in feature similarity space, further regularized by supervised contrastive loss for task specificity (Khan et al., 7 Apr 2026).
Ensemble Optimization (BMA/OMA): Unified optimization also includes model ensembling frameworks such as Bayesian Model Averaging (BMA) and Optimized Model Averaging (OMA), which treat ensembling of pre-trained foundation models as an optimization problem over model weights, leveraging Bayesian evidence or entropy minimization to maximize ensemble performance (Park, 28 May 2025).

3. Optimization Objectives and Knowledge Transfer

A common thread is that unified optimization reframes classical objectives as functions over representations, weights, or outputs of large pretrained models:

Multi-task Losses: Unified models minimize aggregated losses over pre-specified tasks, sometimes with weighed synergy/conflict analysis (e.g., via Kendall- $\tau$ rank correlations in UFO for synergy detection and conflict mitigation).
Self-supervised Pretraining: Foundation models are pretrained using objectives that internalize the structure or semantics of the domain—mask prediction for walks in graphs (GFM), relational alignment of feature pairwise similarities (MorphDistill), or aggregation posterior learning (FoMEMO).
In-context Adaptation: In FoMEMO and GFM, fast adaptation is enabled by conditioning on new data or constraints without gradient updates, moving all adaptation to the model’s attention and input tokenization.
Knowledge Distillation: MorphDistill preserves structure from multiple teachers by aligning batchwise similarity distributions, which allows for scaling knowledge transfer even across heterogeneous architectures or embedding sizes.
Meta-learning and Fine-tuning: Meta-adaptation frameworks, such as “Meta-LoRA,” integrate parameter-efficient fine-tuning (PEFT) (e.g., LoRA-style low-rank adapters) into the meta-objective. Instead of independent retraining and adaptation, the meta-objective directly optimizes for adaptability by folding in the anticipated future fine-tuning step, provably reducing error and achieving optimality in the linear case (Block et al., 2024).

4. Empirical Performance and Benchmarks

Experimental evidence across domains demonstrates the efficacy and efficiency of unified optimization:

GFM achieves near-optimal solutions on SP/TSP classes on real-world graphs (Berkeley, Chengdu, N up to 893), with inference times (0.04–0.10 s per instance) that are orders of magnitude faster than traditional heuristics (e.g., LKH3 requires 2328 s for a TSP on Berkeley) and with gaps to optimality as small as ≈1% (Liang et al., 29 Sep 2025).
UFO trimmed models attain parameter reductions of 30–50% (relative to the supernet) and equal or improved accuracies on all tasks (e.g., 96% on CALFW face recognition), with zero adaptation cost post-trimming (Xi et al., 2022).
In federated settings (FedOT), the combination of global classifiers and client-specific orthogonal adaptations yields higher generalization and personalization scores across five benchmarks. Ablation demonstrates that orthogonality constraints are necessary for stability—removing them causes a substantial drop in cross-client generalization (Kong et al., 26 May 2025).
FoMEMO delivers competitive or superior sample efficiency on expensive multi-objective synthetic and real-world benchmarks (best or near-best hypervolume in 10/12 real engineering tasks), with candidate selection orders of magnitude faster than Bayesian optimization with surrogates (Yao et al., 3 Sep 2025).
MorphDistill outperforms all single-teacher and baseline encoders on colorectal cancer survival prediction, showing an AUC improvement from 0.63 to 0.68, a C-index increase, and robust generalization to external cohorts (Khan et al., 7 Apr 2026).
BMA and OMA ensembles of foundation models achieve state-of-the-art classification under both in-domain and shifted distributions, beating naive averaging and best individual models, with robust scaling as new foundation models are introduced (Park, 28 May 2025).

5. Challenges: Scalability, Conflict, and Specialization

Unified optimization with foundation models faces nontrivial limitations and open challenges:

Scalability: Transformer-based approaches (e.g., GFM, FoMEMO) suffer O( $T^2$ ) memory and compute cost with sequence length or batch size. Scaling to extremely large graphs, high-dimensional features, or many objectives may require sparse attention, hierarchical decomposition, or advanced embedding methods (Liang et al., 29 Sep 2025, Yao et al., 3 Sep 2025).
Task Conflicts and Synergies: Multi-task models can experience negative transfer, where certain layers or channels benefit one task while harming others. UFO implements architectural path gating and block dropping, together with NAS, to identify and mitigate these effects, preserving mutual-benefit layers while suppressing conflict (Xi et al., 2022).
Distribution Shift and Heterogeneity: In federated learning, heterogeneous client domains can cause gradient conflicts and loss of generalization. Orthogonal transformations (FedOT) are shown to preserve representation geometry and minimize gradient mismatch, with theoretical guarantees on the bound of the gradient norm differences (Kong et al., 26 May 2025).
Data Modality and Domain Prior Limitations: Pretraining on synthetic or strongly biased priors (e.g., Gaussian processes in FoMEMO) can limit performance if the downstream application diverges significantly. This suggests a need for diversified priors or hybrid approaches incorporating real data (Yao et al., 3 Sep 2025).

6. Extensibility, Adaptation, and Future Directions

Unified optimization with foundation models is extensible across a variety of downstream tasks without the necessity for retraining or structural changes:

Graph Routing Variants: Pretrained GFMs can be deployed on new routing problem classes (e.g., capacitated VRP or time-windows) by adapting the decode policy and imposing additional constraints at inference time (Liang et al., 29 Sep 2025).
Vision and Retrieval: UFO and its large-scale derivatives (17B-parameter VIMER-UFO) show that the same backbone can support SOTA on 28 distinct vision benchmarks, further emphasizing modularity and cross-task coherence (Xi et al., 2022).
Personalization and Federated Learning: Orthogonal adaptation and mixing strategies enable balancing local personalization and global generalization, with empirical robustness to scaling in client number and communication rounds (Kong et al., 26 May 2025).
Meta-learning for Adaptability: Folding adaptation into pretraining (Meta-LoRA) demonstrates that parameter-efficient fine-tuning can be optimally incorporated without the need for multi-stage pipelines, with both theoretical and empirical advantages (Block et al., 2024).
Medical Informatics: Multi-teacher distillation (MorphDistill) enables integrating knowledge from specialized and general foundation models for tailored medical prediction tasks, outperforming any individual backbone (Khan et al., 7 Apr 2026).

A plausible implication is that as the diversity and scale of available foundation models expand, unified optimization methodologies will grow in importance for both industrial deployment and scientific discovery across multi-modal, heterogeneous, and combinatorial settings.

References

(Liang et al., 29 Sep 2025) "Graph Foundation Models: Bridging LLM Paradigms and Graph Optimization"
(Xi et al., 2022) "UFO: Unified Feature Optimization"
(Kong et al., 26 May 2025) "Generalized and Personalized Federated Learning with Foundation Models via Orthogonal Transformations"
(Yao et al., 3 Sep 2025) "FoMEMO: Towards Foundation Models for Expensive Multi-objective Optimization"
(Block et al., 2024) "Meta-Learning Adaptable Foundation Models"
(Park, 28 May 2025) "Revisiting Bayesian Model Averaging in the Era of Foundation Models"
(Khan et al., 7 Apr 2026) "MorphDistill: Distilling Unified Morphological Knowledge from Pathology Foundation Models for Colorectal Cancer Survival Prediction"