Papers
Topics
Authors
Recent
Search
2000 character limit reached

QF-tuner (Q-FOX): RL Hyperparameter Optimizer

Updated 26 February 2026
  • QF-tuner (Q-FOX) is an automated hyperparameter optimization framework for Q-learning that uses the FOX algorithm to balance global exploration and local exploitation.
  • It employs a multi-objective fitness formulation that prioritizes agent rewards while penalizing temporal-difference errors and tuning duration for enhanced performance.
  • Empirical evaluations on OpenAI Gym tasks demonstrate that QF-tuner outperforms PSO, GA, and other optimizers by achieving higher rewards and faster convergence.

QF-tuner (Q-FOX) is an automated hyperparameter optimization framework for Q-learning algorithms in reinforcement learning, employing the FOX optimization algorithm (FOX) as a population-based metaheuristic. QF-tuner introduces a multi-objective fitness formulation that prioritizes agent reward while also penalizing temporal-difference error and tuning duration. It has been empirically demonstrated on OpenAI Gym control tasks to outperform alternative optimizers, such as particle swarm optimization (PSO), genetic algorithms (GA), bees algorithm (BA), and random parameter search, delivering marked improvements in learning efficiency and solution quality (Jumaah et al., 2024).

1. FOX Optimization Algorithm

The FOX optimizer is a population-based metaheuristic inspired by the foraging dynamics of red foxes. It maintains a population of GG candidate solutions (agents) XiRdX_i \in \mathbb{R}^d evaluated and updated over TT iterations. Each iteration consists of a stochastic choice between exploitation and exploration (each with 50% probability).

  • Exploitation: Agents are deterministically updated toward the best-known solution with multiplicative factors reflecting two possible “jump” strengths:
    • For agent ii at iteration tt, let

    Si=BestXiTi,DSFi=Si  Ti,DFPi=12DSFi,S_i = \frac{\mathrm{BestX}_i}{T_i},\quad \mathrm{DSF}_i = S_i\;T_i,\quad \mathrm{DFP}_i = \tfrac12\,\mathrm{DSF}_i,

    Jumpi=0.5gt2,g=9.81,\mathrm{Jump}_i = 0.5\,g\,t^2,\quad g = 9.81,

    then update as either

    Xi(t+1)=DFPiJumpic1orXi(t+1)=DFPiJumpic2,X_i(t+1) = \mathrm{DFP}_i\,\mathrm{Jump}_i\,c_1 \quad \text{or} \quad X_i(t+1) = \mathrm{DFP}_i\,\mathrm{Jump}_i\,c_2,

    with c1=0.180,c2=0.820c_1 = 0.180, c_2 = 0.820.

  • Exploration: Agents undergo a random walk:

    Xi(t+1)=BestX(t)×rand(1,d)×min(tt)×a,X_i(t+1) = \mathrm{BestX}(t)\,\times\,\text{rand}(1,d)\,\times\,\min(\text{tt})\,\times\,a,

    where XiRdX_i \in \mathbb{R}^d0, XiRdX_i \in \mathbb{R}^d1 scaling constant, and XiRdX_i \in \mathbb{R}^d2 yields a random vector.

This construction enables global exploration in the early optimization phase and local exploitation near the best candidate, balancing the search for globally and locally optimal hyperparameters (Jumaah et al., 2024).

2. QF-tuner Fitness Formulation

QF-tuner employs a scalar fitness function designed to favor convergence stability as well as overall performance. The three targets are:

  • XiRdX_i \in \mathbb{R}^d3: average episodic cumulative reward over the final quarter window of optimization episodes

  • XiRdX_i \in \mathbb{R}^d4: mean squared temporal-difference error over the same window

  • XiRdX_i \in \mathbb{R}^d5: total FOX optimization time (in seconds or iterations)

The fitness value XiRdX_i \in \mathbb{R}^d6 is:

XiRdX_i \in \mathbb{R}^d7

This formula doubles the weight of XiRdX_i \in \mathbb{R}^d8 (reward), subtracts the error, and normalizes by elapsed time, thus prioritizing high reward, low error, and rapid convergence in candidate selection. Evaluation is always over the final XiRdX_i \in \mathbb{R}^d9 episodes to avoid premature convergence artifacts (Jumaah et al., 2024).

3. Hyperparameter Search Space

QF-tuner optimizes the canonical Q-learning hyperparameters, searching within their complete valid domains:

  • Learning rate TT0

  • Discount factor TT1

  • Exploration rate TT2

These are continuously encoded as search dimensions for each FOX agent. All candidate solutions are direct triples TT3; bounds on each variable are enforced throughout the optimization (Jumaah et al., 2024).

4. Algorithmic Workflow

The high-level QF-tuner process is as follows:

  1. Initialize parameters: number of agents TT4, maximum iterations TT5, and per-candidate run count TT6.

  2. Randomly populate TT7 agents with hyperparameter triples TT8.

  3. Repeat for TT9 to ii0:

    • For each agent ii1:
      • Execute ii2 Q-learning episodes using ii3.
      • Compute ii4, ii5 (final quarter window), and time ii6.
      • Compute ii7.
    • Select current ii8 (agent with maximal ii9).
    • Update each agent tt0 per the FOX rules (exploitation or exploration).
  4. After tt1 iterations, return the best hyperparameter set found (Jumaah et al., 2024).

The Q-learning update inside each FOX evaluation uses:

tt2

5. Empirical Evaluation on Control Tasks

QF-tuner was benchmarked on two discrete-action OpenAI Gym environments:

  • FrozenLake-v1: 4×4 discrete gridworld, 16 states, 4 actions; reward +1 for reaching the goal, 0 otherwise. Episode ends on goal, hole, or 100 steps.
  • CartPole-v1: continuous 4-feature state vector, 2 actions; reward +1 per time step balanced. Episode ends when tt3, tt4, or 200 steps.

Key protocol parameters:

  • FOX optimizer: tt5 agents, tt6 iterations
  • Each candidate: averaged over tt7 Q-learning episodes
  • Final evaluation on tt8 test episodes with best hyperparameters (Jumaah et al., 2024)

6. Quantitative Results

QF-tuner (Q-FOX) achieved superior cumulative reward and faster convergence compared to PSO, GA, Bee, and random search methods. The following tables present the optimal hyperparameters and performance metrics found:

FrozenLake-v1 Performance

Method α γ ε Reward tt9 Convergence (s)
Q-FOX 0.7422 0.9692 0.0030 0.9500 12.3
PSO 0.9999 0.3757 0.0010 0.8818 15.8
GA 0.3367 0.8328 0.1753 0.6182 20.4
Bee 0.9388 0.6396 0.9964 0.2545 28.6
RND 0.0921 0.6188 0.1332 0.7409 17.1
  • Reward improvement over PSO: Si=BestXiTi,DSFi=Si  Ti,DFPi=12DSFi,S_i = \frac{\mathrm{BestX}_i}{T_i},\quad \mathrm{DSF}_i = S_i\;T_i,\quad \mathrm{DFP}_i = \tfrac12\,\mathrm{DSF}_i,0
  • Over GA: Si=BestXiTi,DSFi=Si  Ti,DFPi=12DSFi,S_i = \frac{\mathrm{BestX}_i}{T_i},\quad \mathrm{DSF}_i = S_i\;T_i,\quad \mathrm{DFP}_i = \tfrac12\,\mathrm{DSF}_i,1; Bee: Si=BestXiTi,DSFi=Si  Ti,DFPi=12DSFi,S_i = \frac{\mathrm{BestX}_i}{T_i},\quad \mathrm{DSF}_i = S_i\;T_i,\quad \mathrm{DFP}_i = \tfrac12\,\mathrm{DSF}_i,2; RND: Si=BestXiTi,DSFi=Si  Ti,DFPi=12DSFi,S_i = \frac{\mathrm{BestX}_i}{T_i},\quad \mathrm{DSF}_i = S_i\;T_i,\quad \mathrm{DFP}_i = \tfrac12\,\mathrm{DSF}_i,3
  • Learning time reduced by Si=BestXiTi,DSFi=Si  Ti,DFPi=12DSFi,S_i = \frac{\mathrm{BestX}_i}{T_i},\quad \mathrm{DSF}_i = S_i\;T_i,\quad \mathrm{DFP}_i = \tfrac12\,\mathrm{DSF}_i,4 vs. PSO

CartPole-v1 Performance

Method α γ ε Reward Si=BestXiTi,DSFi=Si  Ti,DFPi=12DSFi,S_i = \frac{\mathrm{BestX}_i}{T_i},\quad \mathrm{DSF}_i = S_i\;T_i,\quad \mathrm{DFP}_i = \tfrac12\,\mathrm{DSF}_i,5 Convergence (s)
Q-FOX 0.8287 0.9504 0.2590 32.0773 14.7
PSO 0.9990 0.6689 0.1096 29.9864 18.2
GA 0.8416 0.7069 0.3214 25.0273 22.9
Bee 0.9866 0.9940 0.7017 22.6455 26.4
RND 0.5582 0.3825 0.4235 20.4636 24.1
  • Reward improvement over PSO: Si=BestXiTi,DSFi=Si  Ti,DFPi=12DSFi,S_i = \frac{\mathrm{BestX}_i}{T_i},\quad \mathrm{DSF}_i = S_i\;T_i,\quad \mathrm{DFP}_i = \tfrac12\,\mathrm{DSF}_i,6
  • Over GA: Si=BestXiTi,DSFi=Si  Ti,DFPi=12DSFi,S_i = \frac{\mathrm{BestX}_i}{T_i},\quad \mathrm{DSF}_i = S_i\;T_i,\quad \mathrm{DFP}_i = \tfrac12\,\mathrm{DSF}_i,7; Bee: Si=BestXiTi,DSFi=Si  Ti,DFPi=12DSFi,S_i = \frac{\mathrm{BestX}_i}{T_i},\quad \mathrm{DSF}_i = S_i\;T_i,\quad \mathrm{DFP}_i = \tfrac12\,\mathrm{DSF}_i,8; RND: Si=BestXiTi,DSFi=Si  Ti,DFPi=12DSFi,S_i = \frac{\mathrm{BestX}_i}{T_i},\quad \mathrm{DSF}_i = S_i\;T_i,\quad \mathrm{DFP}_i = \tfrac12\,\mathrm{DSF}_i,9
  • Learning time reduced by Jumpi=0.5gt2,g=9.81,\mathrm{Jump}_i = 0.5\,g\,t^2,\quad g = 9.81,0 vs. PSO

In both environments, QF-tuner consistently identified superior Q-learning parameterizations, leading to significantly faster and higher-reward policies (Jumaah et al., 2024).

7. Reproducibility and Implementation

To replicate QF-tuner results:

  • Implement the FOX optimizer update rules as specified in Section 1
  • Wrap FOX around repeated stochastic Q-learning episodes, extracting the fitness metric in Section 2
  • Use parameter bounds Jumpi=0.5gt2,g=9.81,\mathrm{Jump}_i = 0.5\,g\,t^2,\quad g = 9.81,1 throughout
  • Follow the outlined pseudocode and evaluation strategy, using standard OpenAI Gym environments for validation

QF-tuner (Q-FOX) thus provides a robust, computationally efficient framework for hyperparameter selection in discrete-action reinforcement learning, offering both practical performance gains and direct procedural reproducibility (Jumaah et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to QF-tuner (Q-FOX).