Cross-architecture generality of the post-trained PrivEsc agent

Determine whether the high budgeted success achieved by PrivEsc-LLM—produced by applying supervised fine-tuning on procedurally generated privilege-escalation traces followed by reinforcement learning with verifiable rewards to the Qwen3-4B base model—extends to other base language model architectures when evaluated on the same 12-scenario Linux privilege-escalation benchmark under fixed round budgets.

Background

The paper post-trains a 4B-parameter Qwen3-4B model using a two-stage pipeline—supervised fine-tuning followed by reinforcement learning with verifiable rewards—to create PrivEsc-LLM, which reaches a 95.8% success rate at 20 rounds on a held-out 12-scenario Linux privilege-escalation benchmark.

All empirical results in the paper are obtained with a single base architecture, Qwen3-4B. The authors explicitly note that this restriction limits the scope of the generality claims and highlights uncertainty about whether similar gains would hold across different model families.

The open issue concerns whether the demonstrated improvements and interaction efficiency from this training methodology generalize beyond Qwen3-4B to other base LLM architectures under the same evaluation protocol.

References

Specifically, we study only one base architecture, Qwen3-4B, so cross-family generality remains open.

Post-Training Local LLM Agents for Linux Privilege Escalation with Verifiable Rewards  (2603.17673 - Normann et al., 18 Mar 2026) in Section 6 (Discussion)