RoFTCodeSum: Robust Code Summaries
- RoFTCodeSum is a fine-tuning method that combines curriculum learning and model-agnostic meta-learning to robustly summarize obfuscated code.
- It incrementally trains on multiple levels of code obfuscation to overcome degradation from mangled identifiers and injected dead code.
- Experiments show significant gains in BLEU and SBERT metrics, outperforming standard fine-tuning and curriculum learning baselines.
RoFTCodeSum is a fine-tuning method for code summarization models designed to address robustness issues arising when code readability is degraded by obfuscation or semantic interference. In real-world scenarios such as reverse engineering, legacy maintenance, and security analysis, code often lacks clear semantic cues due to mangled identifiers or injected dead code, resulting in substantial performance drops for standard LLMs. RoFTCodeSum combines curriculum learning with model-agnostic meta-learning, constructing multi-level training datasets with increasing obfuscation and optimizing model parameters through both base and meta-gradient updates. This approach yields significant improvements in summary quality and robustness across both original and obfuscated code relative to prevailing fine-tuning and curriculum learning baselines (Zeng et al., 9 Jan 2026).
1. Motivation and Problem Scope
State-of-the-art code summarization models—such as DeepSeek-Coder (1.3B) and Qwen2.5-Coder (1.5B)—demonstrate strong performance on high-readability code, where clear semantic cues in function names, variable names, and comments enable precise summarization. However, practical applications routinely encounter scenarios where naming conventions are inconsistent, identifiers have been obfuscated, or non-functional code (dead code) has been injected. Systematic evaluation reveals substantial degradation in quality, with even leading models like GPT-4o and DeepSeek-V3 experiencing marked BLEU and SBERT score drops on low-readability data. Previous mitigations, including prompt engineering, contrastive pre-training, and tree-based architectures, yield only modest gains or incur substantial architectural intervention costs. This evidences a need for robust fine-tuning methods tolerating semantic perturbations (Zeng et al., 9 Jan 2026).
2. Algorithmic Framework: Meta Curriculum Learning
RoFTCodeSum’s training integrates curriculum learning (CL) and model-agnostic meta-learning (MAML):
- Curriculum Learning (CL): Tasks are partitioned at multiple readability levels, e.g., original code (high readability), moderate obfuscation, and heavy obfuscation. CL encourages the model to incrementally master progressively challenging summarization tasks.
- Meta-Learning (MAML): For each obfuscation level , inner-loop adaptation steps update model parameters using support sets. The outer loop aggregates meta-gradients from query sets at each level to refine the global parameter initialization.
The iterative training process comprises:
- Base Update: For the original data batch , update parameters using the cross entropy loss with learning rate .
- Meta-Adaptation: For each obfuscation level , perform inner gradient step(s) on support subset (learning rate ), then compute meta-gradient on query subset (learning rate ). Update global parameters via accumulated meta-gradients.
Mathematically, the key update steps are:
- Base fine-tuning loss:
- Inner adaptation:
- Outer meta-objective:
3. Curricular Dataset Construction
RoFTCodeSum employs two curriculum paradigms with hierarchical levels reflecting escalating difficulty:
- Semantic-Obfuscation Curriculum (Levels 1–3):
- Original: No obfuscation.
- Function Name Erosion (FNE): Replace each function name with generic tokens (func_1, func_2, ...).
- Identifier Renaming (IRN): Further replace all local identifiers with generic forms (var_1, var_2, ...); external API names remain intact.
- Semantic-Interference Curriculum:
- Original: No dead code.
- DCI-5: Inject 5 lines of non-functional (dead) code maintaining functional equivalence.
- DCI-10: Inject 10 lines of dead code.
Difficulty metrics include obfuscation depth and a readability score proxy (ratio of semantically meaningful tokens to total tokens). Each level induces a measurable BLEU drop (10–30%) for baseline models (Zeng et al., 9 Jan 2026).
4. Pseudocode Summary
The RoFTCodeSum algorithmic workflow is captured as:
1 2 3 4 5 6 7 8 9 10 11 12 |
Input: α, β, γ (learning rates), θ (model parameters), D^1 (original dataset), D^2 (Level 2), D^3 (Level 3)
While not converged do
1. Sample mini‐batches D^1_i, D^2_i, D^3_i from each level.
2. Base update:
Compute g^1 ← ∇_θ L(θ|D^1_i);
θ ← θ – α g^1
3. Meta‐updates for k=2,3:
Split D^k_i into D^{k,supp}_i and D^{k,qry}_i.
Compute θ^k_i = θ – β ∇_θ L(θ|D^{k,supp}_i).
Compute g^k ← ∇_θ L(θ^k_i | D^{k,qry}_i).
4. θ ← θ – γ (g^2 + g^3)
End while |
5. Experimental Results and Performance Analysis
Experiments utilize MLRC (human-labeled readability, split at median) and the Python subset of CodeSearchNet (2,000 test examples; validation on held-out splits per curriculum level). The following backbone models and baselines are established:
| Model/Method | Training Regime | Obfuscation Robustness |
|---|---|---|
| Zero-shot | No fine-tuning | Poor |
| FT₁ | Fine-tune on D1 (original only) | Moderate |
| FT_all | Fine-tune on all levels concatenated | Moderate-high |
| CL | Sequential curriculum D1→D2→D3 | High |
| CLAWSAT | Alternating robust fine-tuning | Moderate-high |
| RoFTCodeSum | CL+MAML (meta curriculum) | Highest |
Key evaluation metrics:
- BLEU-4 (n-gram precision, up to 4-grams)
- SBERT cosine similarity (sentence embedding comparison)
- Wilcoxon signed-rank test for statistical significance (p<0.05)
RoFTCodeSum consistently achieves:
- Highest BLEU and SBERT scores across all test levels (Origin, FNE, IRN).
- Average BLEU gain vs. FT_all: +3.31 (DeepSeek-Coder), +1.96 (Qwen2.5-Coder).
- Maintained or improved performance on original code (e.g., DeepSeek BLEU 24.94 vs. 24.39 for FT_all).
- Doubling of robustness gains in semantic-interference curriculum relative to FT_all.
Ablation studies indicate that removing meta-learning (γ=0) reduces BLEU by 1.5–2.0 points; eliminating curriculum (single level) reduces robustness by 2.0 points. Hyperparameter sensitivity analysis finds best trade-off at α=γ=5e–5, β=5e–5. Performance is more sensitive to α/γ than to β.
6. Insights and Limitations
RoFTCodeSum’s improvements derive from two synergistic effects: curriculum learning exposes the model to incrementally more challenging obfuscations, preventing abrupt jumps to poorly-readable examples; meta-learning configures model initialization to facilitate rapid adaptation on obfuscated data. This dual mechanism reduces reliance on superficial cues in source code—such as identifier names—and prioritizes semantic features from control-flow and data dependencies. Qualitative analysis demonstrates that, in the IRN setting, RoFTCodeSum generates semantically meaningful summaries (e.g., “Convert numpy array to shared variable”) as opposed to degraded outputs from FT_all (“Function 1”). In semantic-interference settings, RoFTCodeSum preserves intent (“Add a new filter to the filter list”) where baselines drift (“Filter a new glass”).
Notable limitations include reliance on synthetic obfuscation—potentially missing edge cases found in more severe real-world scenarios (e.g., control-flow flattening)—and the absence of formal human evaluation. The three learning rates (α, β, γ) require careful tuning; future research could investigate automated meta-schedulers or broader curriculum types.
7. Applications and Future Directions
RoFTCodeSum is applicable in domains requiring robust model performance on low-readability code, such as code forensics, reverse engineering, or security auditing. Its methodologically agnostic framework (compatible with DeepSeek-Coder, Qwen2.5-Coder, etc.) and empirically validated gains position it as a practical fine-tuning recipe for model robustness. Plausible future work may encompass broader obfuscation patterns, integration with qualitative human evaluation, and adaptive curriculum/meta-learning scheduling. These directions promise further enhancement of code summarization reliability in increasingly challenging program comprehension settings (Zeng et al., 9 Jan 2026).