Q-MetaSur: LLM Meta-Surrogate for MTMOO

Updated 24 December 2025

Q-MetaSur is a meta-surrogate that leverages a sequence-to-sequence transformer to translate tokenized task metadata and decision variables into accurate objective predictions for MTMOO.
It employs a two-stage training strategy, combining supervised fine-tuning with reinforcement learning fine-tuning and conservative Q-learning to enhance surrogate accuracy and generalization.
Extensive benchmarks on CEC2019 demonstrate that Q-MetaSur significantly improves both surrogate error metrics and overall optimization performance compared to traditional baselines.

Q-MetaSur is a LLM–based meta-surrogate designed to provide unified and high-fidelity surrogate modeling for offline multi-task multi-objective optimization (MTMOO), particularly in settings where direct evaluation of expensive objective functions is prohibitive. Leveraging a sequence-to-sequence approach with tailored tokenization, Q-MetaSur generalizes across heterogeneous tasks and objectives by encoding both decision variables and task metadata as token streams and using an LLM to autoregressively predict objective values. A two-stage offline training scheme—combining supervised fine-tuning and implicit Q-learning with conservative regularization—yields robust generalization. Extensive benchmarks on CEC2019 MTMOO demonstrate significant improvements in both surrogate errors and resultant optimization performance compared to established baselines (Zhang et al., 17 Dec 2025).

1. Problem Setting and Design Rationale

Q-MetaSur addresses the scenario of expensive offline MTMOO, where one must solve $T$ distinct but related multi-objective optimization problems:

$\{\mathcal{T}_t\}_{t=1}^T,\quad \mathcal{T}_t = (\mathcal{X}_t, F_t, \mathcal{M}_t, \mathcal{D}_t)$

with only fixed, finite datasets $\mathcal{D}_t = \{(x_i^{(t)}, F_t(x_i^{(t)}))\}_{i=1}^{N_t}$ for each task. The limitations of classical surrogate models such as per-objective GPs—namely fragmentation, cubic scaling with respect to $T$ and $N_t$ , and poor handling of variable input/output dimensions—motivate a single-sequence, autoregressive surrogate.

Q-MetaSur recasts surrogate modeling as a sequence-to-sequence learning problem, using a LLM to “translate” a concatenated, tokenized representation of task metadata and decision variables to corresponding objective values. This approach leverages the expressivity of LLMs and enables joint modeling of task/task relationships and heterogeneous objectives.

2. Unified Tokenization and Model Architecture

Textual Tokenization

Q-MetaSur utilizes a unified textual format:

Each instance is described as a token stream including:
- Task metadata (e.g., function name, dimensionality, instance index)
- Decision vector $x\in\mathbb{R}^{n_t}$ and objective vector $\mathbf{y}\in\mathbb{R}^{k_t}$ for each task $t$
All scalars $z$ are encoded in scientific notation:

$z = \mathrm{sign}(z) \cdot m \cdot 10^\kappa,\quad m\in[1,10),\; \kappa=\lfloor \log_{10}|z| \rfloor$

and tokenized as:

$\phi(z) = [\pm]\;\langle10^\kappa\rangle\; d_1 d_2 \cdots d_{n_\text{digit}}$

The tokenizer $\tau$ maps $(m_t, x, \mathbf{y})$ to source and target token sequences:

$Z \Vert O = (\tau(m_t, x),\; \tau(\mathbf{y})) \in \mathcal{V}^* \times \mathcal{V}^*$

Encoder–Decoder Model

Backbone: T5 Transformer (t5-small or t5-base), usually 12 encoder/decoder layers, model dimension $d_m$ of 512 or 768, and 8/12 self-attention heads
Input processing: Source tokens $Z=(z_1,\dots,z_{n_\mathrm{src}})$ are embedded and summed with positional encodings, then passed to the encoder; the decoder autoregressively predicts objective tokens with teacher forcing
Output: The probability of target sequence $O = (o_1, \ldots, o_{n_\mathrm{tgt}})$ given $Z$ factorizes as:

$P_\theta(O \mid Z) = \prod_{i=1}^{n_\mathrm{tgt}} P_\theta\big( o_i \mid o_{<i},\; \mathrm{Encoder}(X) \big)$

3. Two-Stage Offline Training: SFT and RLFT

Stage 1: Supervised Fine-Tuning (SFT)

Minimize a priority-weighted cross-entropy loss emphasizing sign and exponent tokens:

$\mathcal{L}_\text{sup} = -\sum_{j=1}^{k_t} \sum_{\ell=1}^{L_j} \omega_{j,\ell} \log P_\theta\big(o_{\iota_{j,\ell}} \mid o_{<\iota_{j,\ell}}, \text{Encoder}(X)\big)$

with decay weights $\omega_{j,\ell}$ .

Stage 2: Reinforcement Learning Fine-Tuning (RLFT)

Treats output decoding as a POMDP; final sequence-level reward $R(\hat{\mathbf{y}}, \mathbf{y})$ is exponential-RMSE with exponent–sign–matching bonuses
Data augmentation: Add noise to ground-truth $\mathbf{y}$ to expand reward coverage
Attach two Q-heads and a V-head to the decoder; minimize mixed Bellman and expectile loss:

$\mathcal{L}_{Q,V} = \mathbb{E}_{\tau} \left[\sum_{i=1}^L (R + \gamma V_\theta(h_{i+1}) - Q_\theta(h_i, a_i))^2 + \mathrm{ET}(Q_{\hat{\theta}}(h_i, a_i) - V_\theta(h_i)) \right]$

Add a conservative Q-learning regularizer:

$\mathcal{L}_\text{CQL} = \mathbb{E}_{(s, a)} \mathrm{CE}\left(\mathrm{softmax}(Q_\theta(s, \cdot)), a\right)$

Joint RLFT objective:

$\mathcal{L} = \mathcal{L}_{Q,V} + \lambda_\text{CQL} \mathcal{L}_\text{CQL} + \mathcal{L}_\text{sup}$

At inference, logits can be adjusted using advantage $(Q_\theta - V_\theta)$ .

4. Integration into Evolutionary Multi-Task Optimization

Q-MetaSur is integrated in plug-and-play fashion into any MTMOO evolutionary framework by replacing expensive calls to $F_t$ with its learned surrogate:

$u.\mathrm{Obj} \gets \hat{F}_\mathrm{MetaSur}(m_t, u.\mathrm{Dec})$

where $\hat{F}_\mathrm{MetaSur}$ produces autoregressive, tokenized predictions. Standard selection, crossover, and transfer processes (e.g., NSGA-II tournament, crowding-distance, adaptive transfer) are performed as before, using the surrogate-predicted objectives (Zhang et al., 17 Dec 2025).

5. Empirical Benchmarking on CEC2019 MTMOO

Extensive experiments on CEC2019 MTMOO assess both surrogate accuracy and downstream optimization.

Surrogate Accuracy

Compared to state-of-the-art surrogates—single-task RBFN, Kolmogorov–Arnold Networks (KAN), and FTGP (ExTrEMO transfer GP)—Q-MetaSur achieves lowest standardized MAE (sMAE) on 10/12 instance–objective pairs and top $R^2$ in 9/12. On several instances (Inst2/Obj0, Inst3/Obj0, Inst5/Obj0), it attains sMAE $=0$ and $R^2=1.000$ .

Instance	RBFN Obj0	RBFN Obj1	KAN Obj0	KAN Obj1	FTGP Obj0	FTGP Obj1	Q-MetaSur Obj0	Q-MetaSur Obj1
Inst1	0.2064	0.2026	0.4582	0.4357	0.2077	0.2023	0.0693	0.0655
Inst2	0.2615	0.1083	1.8741	1.0453	0.2510	0.1494	0.0000	0.1529
...	...	...	...	...	...	...	...	...

Optimization Performance

Under a strict 200-function-evaluations (FE) regime, Q-MetaSur embedded within both MO-MaTDE and MTEA-DCK achieves the best mean standardized IGD score (MSS; lower is better) in 11/12 instance–backbone cases, significantly outperforming both the REAL-fitness (no surrogate) and other surrogate-assisted variants.

Setting	Q-MetaSur MSS	REAL MSS	RBFN MSS	FTGP MSS	KAN MSS
ins1_MO-MaTDE	-0.777 ± 0.065	-0.153 ± 0.041 (+)	-0.645 ± 0.064 (+)	0.852 ± 0.105 (+)	1.031 ± 0.265 (+)
ins6_MTEA-DCK	-1.034 ± 0.018	-0.633 ± 0.052 (+)	0.748 ± 0.032 (+)	-0.405 ± 0.063 (+)	0.610 ± 0.080 (+)

“ $(+)$ ” denotes Q-MetaSur is significantly better (Wilcoxon, $\alpha=0.05$ ).

Final Pareto fronts confirm that the surrogate-assisted optimization with Q-MetaSur closely aligns with the true Pareto front, whereas alternative surrogates and REAL-fitness lag behind.

6. Key Findings, Limitations, and Outlook

Approximation accuracy: Q-MetaSur reduces sMAE by up to 100% on nearly-deterministic objectives and decreases unexplained variance ( $1-R^2$ ) by 30–70% relative to single-task surrogates.
Optimization improvements: Under strict FE constraints and in fully offline mode, Q-MetaSur delivers statistically significant gains on virtually all task–algorithm pairs tested.
Generalization: Shows stable zero-shot performance on unseen tasks and moderate unseen dimensions with sMAE $\leq 0.07$ and $R^2 > 0$ .
Limitations: (i) Degraded stability for decision-space dimensions far exceeding the training range; (ii) surrogate quality depends on metadata completeness and reward function design; (iii) longer sequences incur additional latency.
Future directions: Pursue few-shot online calibration to extend applicability, explore structured numeric prediction heads and uncertainty-aware decoding (e.g., conformal intervals), and investigate lightweight adapters or mixture-of-experts to better capture localized objective landscapes.

Q-MetaSur demonstrates that, with an appropriately-designed tokenization scheme, explicit loss–metric alignment via offline RL, and conservative Q-learning regularization, a single LLM-based surrogate can provide scalable, high-fidelity approximation over diverse and challenging offline MTMOO settings (Zhang et al., 17 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

Offline Multi-Task Multi-Objective Data-Driven Evolutionary Algorithm with Language Surrogate Model and Implicit Q-Learning (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Q-MetaSur.