Source Code Authorship Attribution
- SCAA is the computational task of inferring the author of code samples by analyzing distinctive stylistic, structural, and semantic patterns.
- It combines traditional stylometry with deep learning and transformer models, achieving high accuracy in controlled settings despite challenges in adversarial environments.
- Its practical applications include software forensics, plagiarism detection, and intellectual property litigation, driving ongoing research in interpretability and robustness.
Source Code Authorship Attribution (SCAA) is the computational task of inferring the author of a code sample—at the file, fragment, project, or even binary level—by exploiting stylistic, structural, or semantic coding patterns. SCAA has direct applications in software forensics, plagiarism detection, malware lineage tracing, and intellectual property litigation. While methods span classic stylometry, deep learning, robust adversarial modeling, and LLMs, the field faces unique challenges due to the highly structured, variable, and obfuscatable nature of source code.
1. Technical Foundations and Problem Formulation
SCAA formalizes the problem as a multiclass supervised classification over a closed—or, less often, open—set of candidate authors. Given a training corpus of source code samples with author labels and an unseen code sample , the goal is to construct a function that correctly assigns to its true author or abstains. Classification accuracy, precision, recall, and score are standard metrics, typically macro- or micro-averaged over (Bogomolov et al., 2020). Some systems also support confidence-calibrated output for "open-world" rejection, e.g., rejecting low-confidence attributions (Dauber et al., 2017).
SCAA can target whole files, small incomplete fragments, or binary executables (Caliskan et al., 2015). Multi-author segments, code written by groups or with heavy borrowing, present a further complication addressed by tailored modeling (Mahbub et al., 2022). Adversarial rejection (i.e., prediction under style obfuscation or adversarial perturbation) is now a central focus (Abuhamad et al., 2023, Quiring et al., 2019).
2. Stylometric and Syntactic Feature Engineering
Early and continuing lines of SCAA research rely on coding style ("stylometry") as found in layout, identifier naming, indentation, and syntactic habits. Features fall into several broad categories:
- Lexical: word n-grams, character n-grams (Dauber et al., 2017, Frantzeskou et al., 2021), token frequency, API symbols, comment ratios.
- Syntactic: AST (abstract syntax tree) node frequencies, syntactic bigrams, subtree depth distributions, parse tree statistics (Bogomolov et al., 2020, Joshi et al., 2024).
- Structural: control-flow graph node counts, statement and block types, function parameter patterns.
- Layout/Style: whitespace ratios, brace placement, comment density, line length, indentation habits (Mahbub et al., 2022, Joshi et al., 2024).
- Semantic/Dynamic: binary disassembly length, execution-time/memory usage, code complexity metrics (Joshi et al., 2024).
Feature selection is commonly performed via information gain, mutual information (Caliskan et al., 2015, Bogomolov et al., 2020), or discriminative filtering to reduce high cardinality (often hundreds of thousands of initial features). Language-agnostic approaches using AST paths and path-contexts (e.g., code2vec representations) have demonstrated cross-language portability (Bogomolov et al., 2020). In practice, a hybrid combination of these features provides the most robust performance, particularly when supported by interpretation tools such as SHAP for tree-based models (Joshi et al., 2024).
Surface features—such as identifiers—have ambiguous effects: certain identifier types (notably class/object names) can be highly discriminative, whereas method and simple variable names often add noise; globally renaming all user-defined identifiers can actually increase overall credit assignment by removing generic, non-informative tokens (Frantzeskou et al., 2021).
3. Learning Architectures and Attribution Pipelines
The primary supervised learning paradigms for SCAA span classic machine learning, deep neural networks, ensemble stacking, and—emergently—transformer-based LLMs:
Traditional ML:
- Random Forests: frequently used with token/AST-path features; ensemble voting provides robust multiclass outputs (Dauber et al., 2017, Caliskan et al., 2015, Joshi et al., 2024). Feature importances can be directly interpreted.
- SVMs, XGBoost, Gradient Boosting: used for higher-dimensional or structured feature vectors.
- SCAP (Source Code Author Profiles): n-gram intersection between author “profiles” and code sample (Frantzeskou et al., 2021).
Deep Learning:
- DNNs/LSTMs/RNNs: sequence and bag-of-token features; bi-LSTM encoders combine n-gram/statistical input; softmax for multiclass output (Abuhamad et al., 2023, Li et al., 2022).
- CNNs: 1D convolutions over n-gram or embedding sequences for encoding syntax and order (Abuhamad et al., 2023).
- Stacked (Ensemble) Models: Multiple heterogeneous base classifiers (RF, SVM, DNN variants), outputs concatenated as meta-features and ingested by a second-level neural meta-learner, consistently yield higher accuracy, especially in the multi-author scenario (Mahbub et al., 2022).
Language-Aware Transformers:
- CodeBERT, GraphCodeBERT, UniXcoder, Code Llama, DeepSeek-Coder: encoder, graph-aware, and decoder models fine-tuned to predict authorship from code tokens; code LLMs require careful hyperparameter tuning (batch size, learning rate, LoRA) and sometimes necessitate input chunking for long code (Dipongkor et al., 20 Jun 2025).
- LLM Zero-/Few-Shot Prompting: Off-the-shelf LLMs (GPT-4o, Gemini 1.5-Pro, etc.) can perform attribution with “same-author verification” or few-shot identification via in-context learning. Tournament-style querying can scale attribution to hundreds of authors under context limitations (Choi et al., 14 Jan 2025).
- Transformer Head Adaptation: Custom decoderless transformer heads (e.g., CodeT5-JSA for JavaScript) can dramatically improve multi-class attribution performance, especially for code generated by LLMs (Tihanyi et al., 12 Oct 2025).
4. Empirical Performance and Dataset Considerations
Reported SCAA accuracy varies by language, number of authors, fragment granularity, feature set, and learning architecture:
| Dataset / Model | # Authors | # Samples | Accuracy (Top-1) | Reference |
|---|---|---|---|---|
| GCJ C++ (RF, LSTM, DNN) | 100–200 | 900–1800 | 88–96% | (Caliskan et al., 2015, Abuhamad et al., 2023, Dauber et al., 2017) |
| GitHub/Competitive C++ (RF) | 50 | 300–1000 | 65% | (Caliskan et al., 2015) |
| Python/Java/Multilingual | 70–200 | 1000–2500 | 88–98% | (Bogomolov et al., 2020, Li et al., 2022) |
| LLM Prompting (C++/Java, LLM Tournament) | 500–686 | 26,000–55,000 | 65–69% | (Choi et al., 14 Jan 2025) |
| Multi-author segments (Python, Stacking Ensemble) | 8 (groups) | 6063 | 87% | (Mahbub et al., 2022) |
| Small, incomplete fragments | 106 | ~100 / author | 60–75% (single); 99% (multi-aggregated) | (Dauber et al., 2017) |
| JavaScript (LLM-NodeJS, CodeT5-JSA) | 5 / 10 / 20 | 250,000 | 95.8% / 94.6% / 88.5% | (Tihanyi et al., 12 Oct 2025) |
On clean, single-author data with language/style-constrained contexts, state-of-the-art models regularly attain near-perfect accuracy. When evaluated in context-separated, temporally-separated, or multi-author conditions, accuracy can fall to 20–30% (Bogomolov et al., 2020, Mahbub et al., 2022). Models trained on realistic industry datasets, where author style evolves and project conventions dominate, are much less accurate than on synthetic or balanced problems.
5. Security, Robustness, and Adversarial Perspectives
SCAA models—regardless of method—are highly vulnerable to adversarially crafted, semantics-preserving code transformations. Black-box code insertion (e.g., dead code snippets, unreachable branches), Monte Carlo Tree Search–guided transformation sequences, and automatic style imitation can reduce attribution accuracy to near-random (≤1–5%) (Quiring et al., 2019, Abuhamad et al., 2023). Targeted attacks (impersonation of another author) show 66–88% success rates on standard methods when only a few semantically null lines are added (Abuhamad et al., 2023).
Defenses investigated include:
- Adversarial Training: Augmenting training data with perturbed/adversarial examples, as in RoPGen, reduces attacker success rates by 22–41% (Li et al., 2022).
- Normalize-and-Predict (N&P): Preprocessing all code via deterministic normalization provably blocks whole classes of relational (equivalence-class–based) attacks, with robust accuracy gains of 45–70 pp over vanilla or adversarially trained deep networks (Wang et al., 2020).
- Structural/Deep Stylometry: Moving from surface features to AST/data-flow–based signatures increases robustness under obfuscation (Tihanyi et al., 12 Oct 2025).
- Prompt Engineering: LLMs prompted to focus on “persistent author-specific traits” can partially resist style-transfer attacks, with resilience improving by ~10 pp (Choi et al., 14 Jan 2025).
However, no known transformation offers universal -anonymity for code authorship: the problem is formally undecidable due to the equivalence problem for Turing-complete languages (Horlboge et al., 2022). The weaker measure of -uncertainty (closeness of attribution confidences for authors) can be increased empirically with heavy obfuscation (as with Tigress), but even sophisticated transformations only yield practical privacy on select datasets and can often be neutralized when the attacker adapts classifier training (Horlboge et al., 2022).
6. Limitations, Open Problems, and Future Directions
A number of critical research directions are highlighted in the literature:
- Multi-author and Mixed-style Segments: Attribution for segments co-authored or containing heavily borrowed code is largely unsolved; ensemble stacking marginally increases accuracy for group labels, but with unknown scalability beyond small cases (Mahbub et al., 2022).
- Cross-language, Multi-project, and Context Variation: SCAA performance degrades when training and testing across different projects, time intervals, or language boundaries. Generalization remains an open challenge (Bogomolov et al., 2020).
- Evolving Coding Style (Temporal Drift): Developers’ patterns change over time; models trained on early commits underperform on later contributions (Bogomolov et al., 2020).
- Open-world and Unknown-author Attribution: Explicit mechanisms for “author unknown” submission remain limited to confidence thresholding; calibration curves improve robustness but may discard valid predictions (Dauber et al., 2017, Caliskan et al., 2015).
- Binary/Obfuscated Attribution: Machine learning on decompiled binaries, combining disassembly n-grams, ASTs, and statistical distributions, is effective with 65–96% accuracy over 50–100 known authors even after compilation, optimization, and moderate obfuscation (Caliskan et al., 2015).
- Privacy/Defensive Transformations: General, automated -anonymizing (or high -uncertainty) transformations are provably impossible. Practical privacy mechanisms require black-box adversarial awareness, extensive code rewriting, and continual arms-race adaptation (Horlboge et al., 2022).
- Interpretability and Plagiarism Analysis: SHAP and integrated gradients provide model-level explanation of decisive stylometric cues, supporting forensic evidence and diagnosis of authorship features (Joshi et al., 2024, Dipongkor et al., 20 Jun 2025).
- Scalability: Systems such as LLM tournament-based few-shot attribution enable scaling to hundreds of authors, but efficient attribution at greater scale, in streaming or evolving corpora, with provable calibration, remains unresolved (Choi et al., 14 Jan 2025).
A plausible implication is that ensemble, explainable, and normalization-augmented models, supported by adversarial training, currently define the frontier for robust, high-accuracy authorship attribution in practical settings. Open challenges remain in non-cooperative, real-world environments, where project, temporal, and author multiplicity conspire with adversarial activity to limit attribution confidence and reliability.
7. Key Resources and Selected Comparative Results
| Approach | Language(s) | Author Classes | Adversarial Robustness | Key Strengths | Reference |
|---|---|---|---|---|---|
| RF/AST/CFG + DNN ensemble | Python | 8 (groups) | Not evaluated | Multi-author, stacking gain | (Mahbub et al., 2022) |
| CNN/RNN/Stylometry | C++ | 200 | Weak | High clean accuracy | (Abuhamad et al., 2023) |
| RoPGen (Adversarial Training) | C/C++/Java | 40–204 | Improves 22–41% | Reduces attack success rate | (Li et al., 2022) |
| Language-agnostic PbRF, PbNN | Java/C++/Py | 40–1600 | Not robust to context | High accuracy (benchmarks) | (Bogomolov et al., 2020) |
| LLM tournament/few-shot | C++/Java | 500–686 | Partial | Zero-/few-shot generalization | (Choi et al., 14 Jan 2025) |
| Structural Transformer (CodeT5-JSA) | JS | 5–20 (LLMs) | Effective | Robust to mangling, scalable | (Tihanyi et al., 12 Oct 2025) |
| Normalize-and-Predict | C++ | 204 | 70% gain, provable | Efficient, adversarial defense | (Wang et al., 2020) |
| Binary attribution | C/C++ | 50–600 | Moderate | Works post-compile, stripped | (Caliskan et al., 2015) |
These results reflect progress from classic stylometric techniques to advanced neural and transformer approaches, with special attention to design for interpretability, scalability, and adversarial resilience. Continuous evaluation on realistic cross-context, multi-author, and adversarially perturbed corpora is necessary for future progress in source code authorship attribution.