Voxel-Level A3C Module for 3D Segmentation
- The paper presents a voxel-level A3C method that treats each voxel as an independent agent using composite rewards to correct noisy labels.
- It leverages a shared encoder with multiple decoders to output segmentation, value, and policy predictions, ensuring stable asynchronous updates.
- Empirical results demonstrate Dice score improvements of up to +3.72% on challenging datasets, confirming the module's robust performance.
The voxel-level Asynchronous Advantage Actor-Critic (vA3C) module is a reinforcement learning component formulated for robust 3D medical image segmentation in the presence of noisy annotations. It is a core innovation of the Staged Voxel-Level @@@@1@@@@ (SVL-DRL) framework, which frames each image voxel as an autonomous agent operating asynchronously in parallel. Unlike conventional sample-level or patch-level denoising approaches, vA3C exploits local policy adjustments driven by composite rewards that fuse segmentation accuracy metrics with anatomical constraints, thereby incrementally rectifying labeling inaccuracies at voxel resolution (Fu et al., 7 Jan 2026).
1. Voxel-wise Reinforcement Learning Agent Formulation
In the vA3C formulation, each voxel of a 3D image volume —where —is modeled as an independent agent. The state of agent at step is denoted , initialized to the voxel's raw intensity and, in practice, set to the feature vector at within a global feature map output by a shared encoder. This state captures both local and contextual neighborhood information. The collective system state is , allowing all voxels to update policy-relevant representations simultaneously.
2. Discrete Action Space and Voxel Manipulations
Each voxel-agent samples from a finite action set :
- : do nothing,
- : enhance tissue/lesion,
- : weaken tissue/lesion.
The selected action modulates the current voxel value via: where . This formulation imposes controlled perturbations, simulating potential medical effects and facilitating error correction in the segmentation process.
3. Shared-Encoder and Multi-Decoder Architecture
The network backbone consists of a shared Swin-Unetr encoder, which generates hierarchical feature maps from the entire input volume . This is followed by three parallel decoder branches:
- Segmentation head : produces probability map for voxel-wise segmentation.
- Value head : estimates the expected return for each voxel.
- Policy head : outputs action logits for each voxel, parameterizing .
Encoder parameters are shared, while decoders for segmentation, value, and policy are independent, collectively parameterized by .
4. Voxel-Adaptive A3C: Workflow and Loss Formulations
The conventional A3C framework—where multiple actors interact with independent environments—is adapted such that every voxel is an agent in a shared environment, executed in parallel in each forward pass. Training proceeds in three main stages:
- Warmup: Pure supervised segmentation,
- Transition: Mixed Dice and value-MSE loss,
- Full RL: RL loss with segmentation, value, and policy terms.
At each RL step:
- Actions are sampled per voxel: .
- The action effect is applied to yield .
- Step rewards are computed (see Section 5).
- -step returns are accumulated: , where .
- Advantages are used in policy gradients.
- Gradients for segmentation, value, and policy loss are aggregated and used to update the parameters.
The staged loss functions are
Values for are set respectively to $0.2, 0.2, 0.3$.
5. Advantage Policy Gradient and Composite Reward Function
The advantage function follows the A3C standard: . The policy gradient for voxel is
aggregated over all voxels and -step windows.
The reward at each step is: where
and the anatomical constraint
with as the number of connected components in the binary mask, and the total variation of the segmentation. A lower enforces anatomical plausibility, penalizing fragmented or irregular outputs.
6. Empirical Validation: Ablation and Robustness under Noisy Annotations
Ablation studies (Tables 7–8 in the source) quantify vA3C’s contribution to segmentation accuracy under synthetic noise. For example, on the LA dataset with 50% SFDA-Noise, Dice scores improve from baseline to with full vA3C (+); for Pancreas-CT with SFDA-Noise, performance increases from to (+). Removing any of the RL training stages leads to further performance degradation. This empirically confirms that voxel-level asynchronous policy updates are the dominant factor for robust learning under annotation noise (Fu et al., 7 Jan 2026).
7. Theoretical Rationale for Voxel-Level Asynchronous Actor-Critic
Treating each voxel as an agent enables correction of mislabeled regions without requiring exclusion of entire volumes or patches. Asynchronous updates across spatially disjoint voxels decorrelate gradients, promoting stable optimization in high-dimensional settings. The composite reward, combining localized accuracy gain (Dice delta) and global anatomical regularity, incentivizes the model to incrementally ameliorate noisy labels. The staged training regimen ensures effective initialization, mitigating policy collapse. Overall, vA3C's granularity and parallel reinforcement maximize resilience to global annotation noise and accelerate convergence, outperforming conventional supervised baselines and sample-filtering methods (Fu et al., 7 Jan 2026).