QF-tuner (Q-FOX): RL Hyperparameter Optimizer

Updated 26 February 2026

QF-tuner (Q-FOX) is an automated hyperparameter optimization framework for Q-learning that uses the FOX algorithm to balance global exploration and local exploitation.
It employs a multi-objective fitness formulation that prioritizes agent rewards while penalizing temporal-difference errors and tuning duration for enhanced performance.
Empirical evaluations on OpenAI Gym tasks demonstrate that QF-tuner outperforms PSO, GA, and other optimizers by achieving higher rewards and faster convergence.

QF-tuner (Q-FOX) is an automated hyperparameter optimization framework for Q-learning algorithms in reinforcement learning, employing the FOX optimization algorithm (FOX) as a population-based metaheuristic. QF-tuner introduces a multi-objective fitness formulation that prioritizes agent reward while also penalizing temporal-difference error and tuning duration. It has been empirically demonstrated on OpenAI Gym control tasks to outperform alternative optimizers, such as particle swarm optimization (PSO), genetic algorithms (GA), bees algorithm (BA), and random parameter search, delivering marked improvements in learning efficiency and solution quality (Jumaah et al., 2024).

1. FOX Optimization Algorithm

The FOX optimizer is a population-based metaheuristic inspired by the foraging dynamics of red foxes. It maintains a population of $G$ candidate solutions (agents) $X_i \in \mathbb{R}^d$ evaluated and updated over $T$ iterations. Each iteration consists of a stochastic choice between exploitation and exploration (each with 50% probability).

Exploitation: Agents are deterministically updated toward the best-known solution with multiplicative factors reflecting two possible “jump” strengths:
- For agent $i$ at iteration $t$ , let
$S_i = \frac{\mathrm{BestX}_i}{T_i},\quad \mathrm{DSF}_i = S_i\;T_i,\quad \mathrm{DFP}_i = \tfrac12\,\mathrm{DSF}_i,$

$\mathrm{Jump}_i = 0.5\,g\,t^2,\quad g = 9.81,$

then update as either

$X_i(t+1) = \mathrm{DFP}_i\,\mathrm{Jump}_i\,c_1 \quad \text{or} \quad X_i(t+1) = \mathrm{DFP}_i\,\mathrm{Jump}_i\,c_2,$

with $c_1 = 0.180, c_2 = 0.820$ .
Exploration: Agents undergo a random walk:

$X_i(t+1) = \mathrm{BestX}(t)\,\times\,\text{rand}(1,d)\,\times\,\min(\text{tt})\,\times\,a,$

where $X_i \in \mathbb{R}^d$ 0, $X_i \in \mathbb{R}^d$ 1 scaling constant, and $X_i \in \mathbb{R}^d$ 2 yields a random vector.

This construction enables global exploration in the early optimization phase and local exploitation near the best candidate, balancing the search for globally and locally optimal hyperparameters (Jumaah et al., 2024).

2. QF-tuner Fitness Formulation

QF-tuner employs a scalar fitness function designed to favor convergence stability as well as overall performance. The three targets are:

$X_i \in \mathbb{R}^d$ 3: average episodic cumulative reward over the final quarter window of optimization episodes
$X_i \in \mathbb{R}^d$ 4: mean squared temporal-difference error over the same window
$X_i \in \mathbb{R}^d$ 5: total FOX optimization time (in seconds or iterations)

The fitness value $X_i \in \mathbb{R}^d$ 6 is:

$X_i \in \mathbb{R}^d$ 7

This formula doubles the weight of $X_i \in \mathbb{R}^d$ 8 (reward), subtracts the error, and normalizes by elapsed time, thus prioritizing high reward, low error, and rapid convergence in candidate selection. Evaluation is always over the final $X_i \in \mathbb{R}^d$ 9 episodes to avoid premature convergence artifacts (Jumaah et al., 2024).

3. Hyperparameter Search Space

QF-tuner optimizes the canonical Q-learning hyperparameters, searching within their complete valid domains:

Learning rate $T$ 0
Discount factor $T$ 1
Exploration rate $T$ 2

These are continuously encoded as search dimensions for each FOX agent. All candidate solutions are direct triples $T$ 3; bounds on each variable are enforced throughout the optimization (Jumaah et al., 2024).

4. Algorithmic Workflow

The high-level QF-tuner process is as follows:

Initialize parameters: number of agents $T$ 4, maximum iterations $T$ 5, and per-candidate run count $T$ 6.
Randomly populate $T$ 7 agents with hyperparameter triples $T$ 8.
Repeat for $T$ 9 to $i$ 0:
- For each agent $i$ $i$ 1:
  - Execute $i$ 2 Q-learning episodes using $i$ 3.
  - Compute $i$ 4, $i$ 5 (final quarter window), and time $i$ 6.
  - Compute $i$ 7.
- Select current $i$ 8 (agent with maximal $i$ 9).
- Update each agent $t$ 0 per the FOX rules (exploitation or exploration).
After $t$ 1 iterations, return the best hyperparameter set found (Jumaah et al., 2024).

The Q-learning update inside each FOX evaluation uses:

$t$ 2

5. Empirical Evaluation on Control Tasks

QF-tuner was benchmarked on two discrete-action OpenAI Gym environments:

FrozenLake-v1: 4×4 discrete gridworld, 16 states, 4 actions; reward +1 for reaching the goal, 0 otherwise. Episode ends on goal, hole, or 100 steps.
CartPole-v1: continuous 4-feature state vector, 2 actions; reward +1 per time step balanced. Episode ends when $t$ 3, $t$ 4, or 200 steps.

Key protocol parameters:

FOX optimizer: $t$ 5 agents, $t$ 6 iterations
Each candidate: averaged over $t$ 7 Q-learning episodes
Final evaluation on $t$ 8 test episodes with best hyperparameters (Jumaah et al., 2024)

6. Quantitative Results

QF-tuner (Q-FOX) achieved superior cumulative reward and faster convergence compared to PSO, GA, Bee, and random search methods. The following tables present the optimal hyperparameters and performance metrics found:

FrozenLake-v1 Performance

Method	α	γ	ε	Reward $t$ 9	Convergence (s)
Q-FOX	0.7422	0.9692	0.0030	0.9500	12.3
PSO	0.9999	0.3757	0.0010	0.8818	15.8
GA	0.3367	0.8328	0.1753	0.6182	20.4
Bee	0.9388	0.6396	0.9964	0.2545	28.6
RND	0.0921	0.6188	0.1332	0.7409	17.1

Reward improvement over PSO: $S_i = \frac{\mathrm{BestX}_i}{T_i},\quad \mathrm{DSF}_i = S_i\;T_i,\quad \mathrm{DFP}_i = \tfrac12\,\mathrm{DSF}_i,$ 0
Over GA: $S_i = \frac{\mathrm{BestX}_i}{T_i},\quad \mathrm{DSF}_i = S_i\;T_i,\quad \mathrm{DFP}_i = \tfrac12\,\mathrm{DSF}_i,$ 1; Bee: $S_i = \frac{\mathrm{BestX}_i}{T_i},\quad \mathrm{DSF}_i = S_i\;T_i,\quad \mathrm{DFP}_i = \tfrac12\,\mathrm{DSF}_i,$ 2; RND: $S_i = \frac{\mathrm{BestX}_i}{T_i},\quad \mathrm{DSF}_i = S_i\;T_i,\quad \mathrm{DFP}_i = \tfrac12\,\mathrm{DSF}_i,$ 3
Learning time reduced by $S_i = \frac{\mathrm{BestX}_i}{T_i},\quad \mathrm{DSF}_i = S_i\;T_i,\quad \mathrm{DFP}_i = \tfrac12\,\mathrm{DSF}_i,$ 4 vs. PSO

CartPole-v1 Performance

Method	α	γ	ε	Reward $S_i = \frac{\mathrm{BestX}_i}{T_i},\quad \mathrm{DSF}_i = S_i\;T_i,\quad \mathrm{DFP}_i = \tfrac12\,\mathrm{DSF}_i,$ 5	Convergence (s)
Q-FOX	0.8287	0.9504	0.2590	32.0773	14.7
PSO	0.9990	0.6689	0.1096	29.9864	18.2
GA	0.8416	0.7069	0.3214	25.0273	22.9
Bee	0.9866	0.9940	0.7017	22.6455	26.4
RND	0.5582	0.3825	0.4235	20.4636	24.1

Reward improvement over PSO: $S_i = \frac{\mathrm{BestX}_i}{T_i},\quad \mathrm{DSF}_i = S_i\;T_i,\quad \mathrm{DFP}_i = \tfrac12\,\mathrm{DSF}_i,$ 6
Over GA: $S_i = \frac{\mathrm{BestX}_i}{T_i},\quad \mathrm{DSF}_i = S_i\;T_i,\quad \mathrm{DFP}_i = \tfrac12\,\mathrm{DSF}_i,$ 7; Bee: $S_i = \frac{\mathrm{BestX}_i}{T_i},\quad \mathrm{DSF}_i = S_i\;T_i,\quad \mathrm{DFP}_i = \tfrac12\,\mathrm{DSF}_i,$ 8; RND: $S_i = \frac{\mathrm{BestX}_i}{T_i},\quad \mathrm{DSF}_i = S_i\;T_i,\quad \mathrm{DFP}_i = \tfrac12\,\mathrm{DSF}_i,$ 9
Learning time reduced by $\mathrm{Jump}_i = 0.5\,g\,t^2,\quad g = 9.81,$ 0 vs. PSO

In both environments, QF-tuner consistently identified superior Q-learning parameterizations, leading to significantly faster and higher-reward policies (Jumaah et al., 2024).

7. Reproducibility and Implementation

To replicate QF-tuner results:

Implement the FOX optimizer update rules as specified in Section 1
Wrap FOX around repeated stochastic Q-learning episodes, extracting the fitness metric in Section 2
Use parameter bounds $\mathrm{Jump}_i = 0.5\,g\,t^2,\quad g = 9.81,$ 1 throughout
Follow the outlined pseudocode and evaluation strategy, using standard OpenAI Gym environments for validation

QF-tuner (Q-FOX) thus provides a robust, computationally efficient framework for hyperparameter selection in discrete-action reinforcement learning, offering both practical performance gains and direct procedural reproducibility (Jumaah et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

QF-tuner: Breaking Tradition in Reinforcement Learning (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to QF-tuner (Q-FOX).