Rethinking How to Act: Action-Space Engineering for Reinforcement Learning-Based Circuit Routing in Distributed Quantum Systems

Published 4 May 2026 in quant-ph | (2605.02389v1)

Abstract: As it becomes increasingly difficult to monolithically scale a quantum processor, distributed quantum computing (DQC) offers an alternative by distributing qubits across multiple smaller interconnected quantum processor modules. In such an architecture, the challenge of quantum circuit compilation shifts from placing and routing qubits within one module to placing, routing and using the qubits efficiently across modules. In order to optimize circuit execution time, the right state-dependent networking decisions must be found, such as when and where to generate shared remote quantum states to support remote operations. Reinforcement learning (RL) provides a natural framework for this problem, generating a compilation policy that can generalize across different circuits. Building on the framework of Promponas et al. (2024), we introduce an agent that combines a novel action-space formulation with effective action-masking strategies. A comprehensive numerical comparison of the two approaches under different coupling constraints shows that our agent achieves improved training and inference performance with a relative reduction in the modeled execution time of up to 35\%.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces a revised RL agent that restructures the action space by generalizing qubit routing operations along precomputed paths using Dijkstra's algorithm.
It demonstrates a significant reduction in circuit execution time—up to 38%—and improved scalability across varying circuit depths on distributed quantum architectures.
The agent achieves computational efficiency gains by reducing wall-clock training time by 64% through structured Q-value approximation and enhanced action masking.

Action-Space Engineering for RL-Based Circuit Routing in Distributed Quantum Systems

Introduction and Motivation

The scalability challenges of monolithic quantum architectures have led the community to modular and distributed quantum computing (DQC), where multiple smaller quantum processor modules are interconnected via quantum channels. One core problem in this landscape is quantum circuit compilation: mapping and routing quantum circuits across these modules efficiently, minimizing execution time given local and non-local connectivity constraints, and the stochastic, resource-intensive nature of remote entanglement (Figure 1).

Figure 1: An entangled state is generated between two remote qubits used as communication qubits connected via a quantum channel and heralded by a classical message sent over a classical channel. Within each module, local two-qubit gates can be applied between all qubits according to the available connectivity (physical coupling edges).

Standard approaches for distributed circuit compilation rely primarily on static partitioning, graph partitioning heuristics, or network-aware flow and scheduling methods largely abstracting away underlying network and resource dynamics. Recent advances incorporate explicit modeling of entanglement generation and decoherence [main_distributed_2025], but retain circuit-by-circuit optimization costs and rely on heuristics.

Reinforcement learning (RL) framed in the Markov decision process (MDP) paradigm can amortize experience across circuits and, with sufficient generalization, offer fast inference for new circuits after a potentially expensive training phase. Early RL approaches for DQC leverage large groundstate action spaces and limited action masking, resulting in suboptimal training and inference efficiency [promponas_compiler_2024].

Novel Contributions and Methodology

This paper introduces a revised RL agent for circuit compilation in DQC featuring:

Action-space restructuring: Instead of associating actions solely to physical operations (SWAP, tele-qubit, generation) at specific edges, actions are generalized to pairwise qubit routing operations, allowing an RL action to specify a chain of swaps or teleportations along a precomputed path between any qubit pair. This method leverages the underlying hardware coupling graph and uses Dijkstra's algorithm for shortest path computation.
Enhanced action masking: The admissible action set at each timestep is restricted to only those actions that are immediately relevant for progress with respect to the current "frontier" layer of the circuit DAG or those that prepare for future remote interactions. Actions are masked unless they (1) decrease the spatial separation between qubits required for imminent two-qubit gates, (2) move unassigned qubits towards a channel endpoint to prepare resource states, or (3) synchronize EPR qubits with required frontier operations.
Efficient Q-value representation: To mitigate the quadratic growth in Q-function outputs (for all qubit pairs), a structured Q-value approximation is used. Each action Q-value is parameterized in terms of source and target qubits' Q-values with a directionality-preserving convex combination. The Q-network thus scales linearly rather than quadratically with the number of physical qubits, yielding significant computational gains in large systems.
Modified reward structure: Additional rewards penalize unnecessary system idling, and the reward for routing actions is zero unless they actually reduce the total frontier qubit distance in the hardware graph. This focuses credit assignment strictly on progress-relevant decisions.

Numerical Evaluation and Key Results

The introduced RL agent is evaluated against the baseline DDQN agent of [promponas_compiler_2024] on two architectural testbeds: a low-connectivity two-module IBM Guadalupe layout, and a high-connectivity pair of $4 \times 4$ grid QPUs. Circuits with varying numbers of CNOT gates (30–50) and 18 qubits are randomly generated for both training and test sets. The primary metric is modeled circuit execution time (in architecture-specific timesteps), together with wall-clock training time.

Strong numerical findings and claims include:

Significant reduction in execution time: On the constrained Guadalupe architecture, average circuit execution time dropped from ~1,210 to ~746 timesteps, a 38% relative reduction, over the last 100 training episodes, with a similar carry-over to inference on unseen circuits at a 35% reduction.
Improved scalability: Contrary to the baseline, execution time for the new agent continued to improve with increasing circuit depth (number of CNOTs), indicating better generalization and scalability for larger circuit instances.
Computational efficiency gains: Wall-clock time for completing 250 training episodes dropped from ~66 hours (baseline) to ~24 hours (proposed agent), a 64% reduction, primarily attributed to the lower-dimensional Q-function and reduced neural network requirements.
Effective action masking: Restrictive masking accelerated convergence and stabilized learning, demonstrated by reduced reward variance at late-stage training relative to the baseline agent.
Lookahead analysis: Performance consistently improved as agents were trained on longer circuits, indicating that, for the tested problem sizes, all circuit gates impact routing; thus, aggressive lookahead truncation is not advisable at these scales.
Competitive behavior in highly connected architectures: On the $4\times4$ grid topology, despite initially slower convergence due to non-optimal path interference, the final performance matched or slightly surpassed the baseline.
Figure 2: Training results of baseline agent~\cite{promponas_compiler_2024} over 250 episodes, showing cumulative rewards and execution time per episode for circuits of differing sizes.

Implications and Future Directions

Practical Significance

These results show that RL agents with engineered action spaces and aggressive masking can realize substantial practical benefits in the context of DQC compilation. Faster inference after training amortizes computational costs across workloads and makes RL-based compilation competitive for batch and streaming quantum computing scenarios. Further, the techniques for dimensionality reduction in RL—structuring the Q-function and leveraging architectural regularities—are widely transferable to other large-action-space RL scheduling and resource allocation problems.

Theoretical Impact

The work highlights the criticality of action-space design and Q-value parameterization in RL for combinatorial scheduling over networked quantum devices. By directly coupling the action structure to physical system topology and the circuit computation graph, the agent is better aligned with the underlying hardware constraints and resource bottlenecks. The methodology suggests directions for theory: analyzing the expressivity–efficiency trade-off of structured parametric value functions and the convergence implications of heavy masking, especially regarding policy optimality when the admissible set is highly filtered.

Broader AI and Quantum Computing Outlook

From the broader AI perspective, this research exemplifies the maturation of RL for hybrid quantum-classical optimization domains—where both the state and action spaces are governed by physical constraints, and simulation costs are substantial. The approach is thus relevant to other emerging applications in quantum network scheduling, memory management, and error correction, especially as scalable quantum architectures move beyond experimental regimes.

For distributed quantum computing, scalable and efficient circuit compilation is a key bottleneck for practical, large-scale platforms. As fault-tolerance technologies mature and DQC architectures proliferate, the need for compiler approaches that sustain high throughput and minimal overhead, and adapt rapidly to hardware changes or workload shifts, will remain paramount.

Limitations and Open Challenges

Despite strong performance improvements, the fundamental scalability of the RL approach is still bound by the growth of the system state space with circuit size and qubit count. The current approach does not abstract away all combinatorial complexity. Furthermore, strict masking strategies—while beneficial for credit assignment and training—may impede discovery of globally optimal, non-myopic routes, particularly if circuit scheduling subtleties require temporary regressions for long-term gains.

Future work must address:

Learning compact, transferable representations of quantum circuit state that support zero-shot or few-shot generalization across diverse workloads and hardware topologies.
Balancing exploration and exploitation in heavily masked action sets; possibly leveraging policy distillation or offline RL to recover global optimality.
Incorporating physical noise models, decoherence, and stochastic entanglement generation more faithfully to bridge the gap between simulated environments and actual hardware.
Extending the approach to integrate compiling for quantum error correction and multi-level logical-physical mapping.

Conclusion

This work demonstrates that RL agents, when rigorously engineered with problem-specific action spaces, masking, and Q-function parametrizations, can dramatically improve both execution time and computational efficiency for circuit routing in distributed quantum systems. These advances support the future deployment of scalable, low-latency DQC platforms and establish critical design principles for RL in quantum resource management tasks.

Markdown Report Issue