ProGuard: Java Optimization & Multimodal Safety
- ProGuard is a tool for Java bytecode shrinking, optimization, and obfuscation, streamlining code by removing unused classes, methods, and fields.
- It employs static reachability analysis measured by soundness and precision, achieving nearly 100% precision while facing challenges with dynamic and reflective code paths.
- ProGuard also denotes an RL-trained multimodal safety system using a ViT-based encoder and transformer to classify and detect safety risks in vision-language content.
ProGuard refers to both a widely used Java bytecode optimization and debloating tool, and a recently introduced vision-language proactive moderation system for multimodal generative content. Although the two share a name, their operational contexts are entirely separate: the Java tool targets code reduction and optimization, while the multimodal guard system focuses on safety risk identification across text and image modalities.
1. ProGuard for Java Bytecode Debloating
ProGuard (v7.7) is a static analysis tool used for shrinking, optimizing, and obfuscating Java bytecode, primarily within input JAR archives. The workflow involves the following steps: reading specified JARs; constructing call and reachability graphs by analyzing bytecode instructions, constant-pool references, inheritance, and direct calls; and pruning program constructs deemed unreachable from designated entry points (typically a main class). It removes unused classes and strips away unused methods and fields from surviving classes, producing an optimized JAR and optionally applying obfuscation and further optimizations. Due to its reliance on purely static reachability, ProGuard is unable to identify code paths activated only at runtime via reflection, dynamic class loading, or framework-based annotations (Klauke et al., 23 Oct 2025).
2. Soundness and Precision in Java Debloating Workflows
Soundness and precision are central metrics for assessing any debloating tool. The definitions, as employed in contemporary benchmarking, are:
- Soundness: Measures the proportion of “required” classes, methods, or fields (i.e., those ground-truth constructs invoked at runtime) preserved by ProGuard.
- Precision: Represents the fraction of retained constructs that are genuinely required, penalizing retention of bloated code.
These metrics are computed at class, method, and field levels against a curated ground truth of runtime-used versus bloated constructs.
3. Benchmarking Methodology: The Deblometer Suite
For systematic evaluation, the Deblometer micro-benchmark suite comprises 59 JAR files, each emphasizing one or more of 13 Java language features. Features covered include abstract classes, annotations, deserialization, dynamic class loading, exception types, generics, interfaces, method overloading and overriding, reflection, and serialization. Each benchmark provides a manually curated ground truth in JSON format, labeling required and bloated classes, methods, and fields. During validation, the SootUp framework extracts post-debloating program constructs, enabling objective, automated calculation of TP, FP, and FN for each feature (Klauke et al., 23 Oct 2025).
4. Quantitative Evaluation of ProGuard’s Debloating Performance
ProGuard demonstrates strong precision in removing unused code, with nearly 100% precision at the class level for most features. However, soundness—preservation of required constructs—is compromised in scenarios involving dynamic or metadata-driven language features. Highlights from benchmarking scores include:
| Feature | Class Level | Method Level | Field Level |
|---|---|---|---|
| abstract | S=100%, P=100% | S=100%, P=83% | S=100%, P=67% |
| annotation | S=100%, P=100% | S=40%, P=100% | S=100%, P=50% |
| deserialization | S=100%, P=100% | S=33%, P=100% | S=100%, P=80% |
| dynamic loading | S=67%, P=100% | S=50%, P=100% | S=0%, P=0% |
| reflection | S=100%, P=100% | S=0%, P=0% | S=50%, P=50% |
Unsoundness is most pronounced for dynamic class loading, reflection, and annotation-driven callbacks, where required constructs are systematically removed due to invisibility under static reachability analysis. For example, ProGuard retained only 4 of 6 required classes in a dynamic class loading test (S=67%), dropped reflective methods essential to runtime (S=0%), and removed annotated callback methods (S=40%) or fields bearing metadata (S=0% in certain interface tests) (Klauke et al., 23 Oct 2025). For method overloading, ProGuard dropped 1 of 8 required overloaded methods, achieving S=88% but P=100%.
5. Comparative Perspective: ProGuard, Deptrim, JShrink
ProGuard’s static-only approach contrasts with Deptrim’s hybrid of static and dynamic analysis and with JShrink’s partial static workflows. Deptrim yields high soundness across features but lower precision, retaining substantial bloated code due to dynamic instrumentation. ProGuard achieves nearly perfect precision but is unsound for reflection-based, dynamic, and annotation-driven constructs; JShrink suffers corrupted JARs and crashes when handling annotation metadata or lambdas. The root limitation for ProGuard is its inability to detect code paths only exercised at runtime, such as via custom ClassLoaders, java.lang.reflect, or framework-invoked annotations (Klauke et al., 23 Oct 2025).
6. Recommendations for Mitigating ProGuard’s Unsoundness
Empirical evidence suggests best practices for improving the reliability of shrinking workflows with ProGuard: integrating dynamic analysis or lightweight bytecode instrumentation to capture dynamically loaded or reflectively invoked constructs; supplying explicit -keep rules for annotated callbacks or framework hooks; enhancing annotation constant pool rewriting to avoid corrupted artifacts; extending static analysis to cover common reflection idioms; and writing main-driver stubs invoking complex features to expose hidden code paths. For projects relying on reflection, dynamic loading, or annotation-driven libraries such as Jackson or JAXB, explicit whitelisting or short dynamic profiling runs are essential to safeguard against semantic breakage (Klauke et al., 23 Oct 2025).
7. ProGuard: Vision-Language Multimodal Proactive Guard
In a distinct context, ProGuard denotes an RL-trained multimodal safety guard designed for the proactive detection of safety risks in generative vision-language systems. Built on Qwen2.5-VL-7B, it processes text, images, and text-image pairs with a modality-balanced dataset of 87K samples, annotated with binary safety labels and fine-grained risk categories from an 11×28 hierarchical taxonomy. A majority voting scheme among three open-source annotators ensures inter-annotator consistency (Fleiss-Kappa ≈ 0.7). The model architecture couples a ViT-style vision encoder, multimodal transformer, and language decoder outputting reasoning chains and final answers.
ProGuard employs pure Group Relative Policy Optimization (GRPO) for RL training, optimizing rewards for correct safety classification, fine-grained categorization, and OOD detection/description. For OOD cases, a synonym-bank similarity reward encourages concise, semantically aligned descriptions based on sentence transformer embeddings. OOD category inference involves random removal of 50% of taxonomy categories during training, requiring the model to detect in/out status and generate plausible category names when out-of-taxonomy.
Experimental results show ProGuard-7B reaches ~90% F1 on text safety classification, 88.5% F1 on image+text (competitive with closed-source GPT4o-mini), and outperforms open-source VLM guards by 15–30 pp in multiclass categorization. In OOD detection/description, ProGuard-7B achieves ~61% F1 and 24/100 mean reward, surpassing GPT4o-mini (36%) and Gemini2.5 (16/100) by 52.6% and 64.8% respectively. RL training yields concise reasoning traces while sustaining overall accuracy. Limitations include remaining OOD detection challenges and synonym-bank coverage bias. Proposed future directions involve continuous taxonomy expansion, larger-scale multimodal pretraining, and dynamic synonym augmentation (Yu et al., 29 Dec 2025).