QF-tuner (Q-FOX): RL Hyperparameter Optimizer
- QF-tuner (Q-FOX) is an automated hyperparameter optimization framework for Q-learning that uses the FOX algorithm to balance global exploration and local exploitation.
- It employs a multi-objective fitness formulation that prioritizes agent rewards while penalizing temporal-difference errors and tuning duration for enhanced performance.
- Empirical evaluations on OpenAI Gym tasks demonstrate that QF-tuner outperforms PSO, GA, and other optimizers by achieving higher rewards and faster convergence.
QF-tuner (Q-FOX) is an automated hyperparameter optimization framework for Q-learning algorithms in reinforcement learning, employing the FOX optimization algorithm (FOX) as a population-based metaheuristic. QF-tuner introduces a multi-objective fitness formulation that prioritizes agent reward while also penalizing temporal-difference error and tuning duration. It has been empirically demonstrated on OpenAI Gym control tasks to outperform alternative optimizers, such as particle swarm optimization (PSO), genetic algorithms (GA), bees algorithm (BA), and random parameter search, delivering marked improvements in learning efficiency and solution quality (Jumaah et al., 2024).
1. FOX Optimization Algorithm
The FOX optimizer is a population-based metaheuristic inspired by the foraging dynamics of red foxes. It maintains a population of candidate solutions (agents) evaluated and updated over iterations. Each iteration consists of a stochastic choice between exploitation and exploration (each with 50% probability).
- Exploitation: Agents are deterministically updated toward the best-known solution with multiplicative factors reflecting two possible “jump” strengths:
- For agent at iteration , let
then update as either
with .
Exploration: Agents undergo a random walk:
where 0, 1 scaling constant, and 2 yields a random vector.
This construction enables global exploration in the early optimization phase and local exploitation near the best candidate, balancing the search for globally and locally optimal hyperparameters (Jumaah et al., 2024).
2. QF-tuner Fitness Formulation
QF-tuner employs a scalar fitness function designed to favor convergence stability as well as overall performance. The three targets are:
3: average episodic cumulative reward over the final quarter window of optimization episodes
4: mean squared temporal-difference error over the same window
5: total FOX optimization time (in seconds or iterations)
The fitness value 6 is:
7
This formula doubles the weight of 8 (reward), subtracts the error, and normalizes by elapsed time, thus prioritizing high reward, low error, and rapid convergence in candidate selection. Evaluation is always over the final 9 episodes to avoid premature convergence artifacts (Jumaah et al., 2024).
3. Hyperparameter Search Space
QF-tuner optimizes the canonical Q-learning hyperparameters, searching within their complete valid domains:
Learning rate 0
Discount factor 1
Exploration rate 2
These are continuously encoded as search dimensions for each FOX agent. All candidate solutions are direct triples 3; bounds on each variable are enforced throughout the optimization (Jumaah et al., 2024).
4. Algorithmic Workflow
The high-level QF-tuner process is as follows:
Initialize parameters: number of agents 4, maximum iterations 5, and per-candidate run count 6.
Randomly populate 7 agents with hyperparameter triples 8.
Repeat for 9 to 0:
- For each agent 1:
- Execute 2 Q-learning episodes using 3.
- Compute 4, 5 (final quarter window), and time 6.
- Compute 7.
- Select current 8 (agent with maximal 9).
- Update each agent 0 per the FOX rules (exploitation or exploration).
- For each agent 1:
- After 1 iterations, return the best hyperparameter set found (Jumaah et al., 2024).
The Q-learning update inside each FOX evaluation uses:
2
5. Empirical Evaluation on Control Tasks
QF-tuner was benchmarked on two discrete-action OpenAI Gym environments:
- FrozenLake-v1: 4×4 discrete gridworld, 16 states, 4 actions; reward +1 for reaching the goal, 0 otherwise. Episode ends on goal, hole, or 100 steps.
- CartPole-v1: continuous 4-feature state vector, 2 actions; reward +1 per time step balanced. Episode ends when 3, 4, or 200 steps.
Key protocol parameters:
- FOX optimizer: 5 agents, 6 iterations
- Each candidate: averaged over 7 Q-learning episodes
- Final evaluation on 8 test episodes with best hyperparameters (Jumaah et al., 2024)
6. Quantitative Results
QF-tuner (Q-FOX) achieved superior cumulative reward and faster convergence compared to PSO, GA, Bee, and random search methods. The following tables present the optimal hyperparameters and performance metrics found:
FrozenLake-v1 Performance
| Method | α | γ | ε | Reward 9 | Convergence (s) |
|---|---|---|---|---|---|
| Q-FOX | 0.7422 | 0.9692 | 0.0030 | 0.9500 | 12.3 |
| PSO | 0.9999 | 0.3757 | 0.0010 | 0.8818 | 15.8 |
| GA | 0.3367 | 0.8328 | 0.1753 | 0.6182 | 20.4 |
| Bee | 0.9388 | 0.6396 | 0.9964 | 0.2545 | 28.6 |
| RND | 0.0921 | 0.6188 | 0.1332 | 0.7409 | 17.1 |
- Reward improvement over PSO: 0
- Over GA: 1; Bee: 2; RND: 3
- Learning time reduced by 4 vs. PSO
CartPole-v1 Performance
| Method | α | γ | ε | Reward 5 | Convergence (s) |
|---|---|---|---|---|---|
| Q-FOX | 0.8287 | 0.9504 | 0.2590 | 32.0773 | 14.7 |
| PSO | 0.9990 | 0.6689 | 0.1096 | 29.9864 | 18.2 |
| GA | 0.8416 | 0.7069 | 0.3214 | 25.0273 | 22.9 |
| Bee | 0.9866 | 0.9940 | 0.7017 | 22.6455 | 26.4 |
| RND | 0.5582 | 0.3825 | 0.4235 | 20.4636 | 24.1 |
- Reward improvement over PSO: 6
- Over GA: 7; Bee: 8; RND: 9
- Learning time reduced by 0 vs. PSO
In both environments, QF-tuner consistently identified superior Q-learning parameterizations, leading to significantly faster and higher-reward policies (Jumaah et al., 2024).
7. Reproducibility and Implementation
To replicate QF-tuner results:
- Implement the FOX optimizer update rules as specified in Section 1
- Wrap FOX around repeated stochastic Q-learning episodes, extracting the fitness metric in Section 2
- Use parameter bounds 1 throughout
- Follow the outlined pseudocode and evaluation strategy, using standard OpenAI Gym environments for validation
QF-tuner (Q-FOX) thus provides a robust, computationally efficient framework for hyperparameter selection in discrete-action reinforcement learning, offering both practical performance gains and direct procedural reproducibility (Jumaah et al., 2024).