Source Code Authorship Attribution

Updated 7 January 2026

SCAA is the computational task of inferring the author of code samples by analyzing distinctive stylistic, structural, and semantic patterns.
It combines traditional stylometry with deep learning and transformer models, achieving high accuracy in controlled settings despite challenges in adversarial environments.
Its practical applications include software forensics, plagiarism detection, and intellectual property litigation, driving ongoing research in interpretability and robustness.

Source Code Authorship Attribution (SCAA) is the computational task of inferring the author of a code sample—at the file, fragment, project, or even binary level—by exploiting stylistic, structural, or semantic coding patterns. SCAA has direct applications in software forensics, plagiarism detection, malware lineage tracing, and intellectual property litigation. While methods span classic stylometry, deep learning, robust adversarial modeling, and LLMs, the field faces unique challenges due to the highly structured, variable, and obfuscatable nature of source code.

1. Technical Foundations and Problem Formulation

SCAA formalizes the problem as a multiclass supervised classification over a closed—or, less often, open—set of candidate authors. Given a training corpus of source code samples $C = \{c_1, ..., c_N\}$ with author labels $A = \{a_1, ..., a_M\}$ and an unseen code sample $c^*$ , the goal is to construct a function $f: C \to A \cup \{“unknown”\}$ that correctly assigns $c^*$ to its true author or abstains. Classification accuracy, precision, recall, and $F_1$ score are standard metrics, typically macro- or micro-averaged over $A$ (Bogomolov et al., 2020). Some systems also support confidence-calibrated output for "open-world" rejection, e.g., rejecting low-confidence attributions (Dauber et al., 2017).

SCAA can target whole files, small incomplete fragments, or binary executables (Caliskan et al., 2015). Multi-author segments, code written by groups or with heavy borrowing, present a further complication addressed by tailored modeling (Mahbub et al., 2022). Adversarial rejection (i.e., prediction under style obfuscation or adversarial perturbation) is now a central focus (Abuhamad et al., 2023, Quiring et al., 2019).

2. Stylometric and Syntactic Feature Engineering

Early and continuing lines of SCAA research rely on coding style ("stylometry") as found in layout, identifier naming, indentation, and syntactic habits. Features fall into several broad categories:

Lexical: word n-grams, character n-grams (Dauber et al., 2017, Frantzeskou et al., 2021), token frequency, API symbols, comment ratios.
Syntactic: AST (abstract syntax tree) node frequencies, syntactic bigrams, subtree depth distributions, parse tree statistics (Bogomolov et al., 2020, Joshi et al., 2024).
Structural: control-flow graph node counts, statement and block types, function parameter patterns.
Layout/Style: whitespace ratios, brace placement, comment density, line length, indentation habits (Mahbub et al., 2022, Joshi et al., 2024).
Semantic/Dynamic: binary disassembly length, execution-time/memory usage, code complexity metrics (Joshi et al., 2024).

Feature selection is commonly performed via information gain, mutual information (Caliskan et al., 2015, Bogomolov et al., 2020), or discriminative filtering to reduce high cardinality (often hundreds of thousands of initial features). Language-agnostic approaches using AST paths and path-contexts (e.g., code2vec representations) have demonstrated cross-language portability (Bogomolov et al., 2020). In practice, a hybrid combination of these features provides the most robust performance, particularly when supported by interpretation tools such as SHAP for tree-based models (Joshi et al., 2024).

Surface features—such as identifiers—have ambiguous effects: certain identifier types (notably class/object names) can be highly discriminative, whereas method and simple variable names often add noise; globally renaming all user-defined identifiers can actually increase overall credit assignment by removing generic, non-informative tokens (Frantzeskou et al., 2021).

3. Learning Architectures and Attribution Pipelines

The primary supervised learning paradigms for SCAA span classic machine learning, deep neural networks, ensemble stacking, and—emergently—transformer-based LLMs:

Traditional ML:

Random Forests: frequently used with token/AST-path features; ensemble voting provides robust multiclass outputs (Dauber et al., 2017, Caliskan et al., 2015, Joshi et al., 2024). Feature importances can be directly interpreted.
SVMs, XGBoost, Gradient Boosting: used for higher-dimensional or structured feature vectors.
SCAP (Source Code Author Profiles): n-gram intersection between author “profiles” and code sample (Frantzeskou et al., 2021).

Deep Learning:

DNNs/LSTMs/RNNs: sequence and bag-of-token features; bi-LSTM encoders combine n-gram/statistical input; softmax for multiclass output (Abuhamad et al., 2023, Li et al., 2022).
CNNs: 1D convolutions over n-gram or embedding sequences for encoding syntax and order (Abuhamad et al., 2023).
Stacked (Ensemble) Models: Multiple heterogeneous base classifiers (RF, SVM, DNN variants), outputs concatenated as meta-features and ingested by a second-level neural meta-learner, consistently yield higher accuracy, especially in the multi-author scenario (Mahbub et al., 2022).

Language-Aware Transformers:

CodeBERT, GraphCodeBERT, UniXcoder, Code Llama, DeepSeek-Coder: encoder, graph-aware, and decoder models fine-tuned to predict authorship from code tokens; code LLMs require careful hyperparameter tuning (batch size, learning rate, LoRA) and sometimes necessitate input chunking for long code (Dipongkor et al., 20 Jun 2025).
LLM Zero-/Few-Shot Prompting: Off-the-shelf LLMs (GPT-4o, Gemini 1.5-Pro, etc.) can perform attribution with “same-author verification” or few-shot identification via in-context learning. Tournament-style querying can scale attribution to hundreds of authors under context limitations (Choi et al., 14 Jan 2025).
Transformer Head Adaptation: Custom decoderless transformer heads (e.g., CodeT5-JSA for JavaScript) can dramatically improve multi-class attribution performance, especially for code generated by LLMs (Tihanyi et al., 12 Oct 2025).

4. Empirical Performance and Dataset Considerations

Reported SCAA accuracy varies by language, number of authors, fragment granularity, feature set, and learning architecture:

Dataset / Model	# Authors	# Samples	Accuracy (Top-1)	Reference
GCJ C++ (RF, LSTM, DNN)	100–200	900–1800	88–96%	(Caliskan et al., 2015, Abuhamad et al., 2023, Dauber et al., 2017)
GitHub/Competitive C++ (RF)	50	300–1000	65%	(Caliskan et al., 2015)
Python/Java/Multilingual	70–200	1000–2500	88–98%	(Bogomolov et al., 2020, Li et al., 2022)
LLM Prompting (C++/Java, LLM Tournament)	500–686	26,000–55,000	65–69%	(Choi et al., 14 Jan 2025)
Multi-author segments (Python, Stacking Ensemble)	8 (groups)	6063	87%	(Mahbub et al., 2022)
Small, incomplete fragments	106	~100 / author	60–75% (single); 99% (multi-aggregated)	(Dauber et al., 2017)
JavaScript (LLM-NodeJS, CodeT5-JSA)	5 / 10 / 20	250,000	95.8% / 94.6% / 88.5%	(Tihanyi et al., 12 Oct 2025)

On clean, single-author data with language/style-constrained contexts, state-of-the-art models regularly attain near-perfect accuracy. When evaluated in context-separated, temporally-separated, or multi-author conditions, accuracy can fall to 20–30% (Bogomolov et al., 2020, Mahbub et al., 2022). Models trained on realistic industry datasets, where author style evolves and project conventions dominate, are much less accurate than on synthetic or balanced problems.

5. Security, Robustness, and Adversarial Perspectives

SCAA models—regardless of method—are highly vulnerable to adversarially crafted, semantics-preserving code transformations. Black-box code insertion (e.g., dead code snippets, unreachable branches), Monte Carlo Tree Search–guided transformation sequences, and automatic style imitation can reduce attribution accuracy to near-random (≤1–5%) (Quiring et al., 2019, Abuhamad et al., 2023). Targeted attacks (impersonation of another author) show 66–88% success rates on standard methods when only a few semantically null lines are added (Abuhamad et al., 2023).

Defenses investigated include:

Adversarial Training: Augmenting training data with perturbed/adversarial examples, as in RoPGen, reduces attacker success rates by 22–41% (Li et al., 2022).
Normalize-and-Predict (N&P): Preprocessing all code via deterministic normalization provably blocks whole classes of relational (equivalence-class–based) attacks, with robust accuracy gains of 45–70 pp over vanilla or adversarially trained deep networks (Wang et al., 2020).
Structural/Deep Stylometry: Moving from surface features to AST/data-flow–based signatures increases robustness under obfuscation (Tihanyi et al., 12 Oct 2025).
Prompt Engineering: LLMs prompted to focus on “persistent author-specific traits” can partially resist style-transfer attacks, with resilience improving by ~10 pp (Choi et al., 14 Jan 2025).

However, no known transformation offers universal $k$ -anonymity for code authorship: the problem is formally undecidable due to the equivalence problem for Turing-complete languages (Horlboge et al., 2022). The weaker measure of $k$ -uncertainty (closeness of attribution confidences for $k$ authors) can be increased empirically with heavy obfuscation (as with Tigress), but even sophisticated transformations only yield practical privacy on select datasets and can often be neutralized when the attacker adapts classifier training (Horlboge et al., 2022).

6. Limitations, Open Problems, and Future Directions

A number of critical research directions are highlighted in the literature:

Multi-author and Mixed-style Segments: Attribution for segments co-authored or containing heavily borrowed code is largely unsolved; ensemble stacking marginally increases accuracy for group labels, but with unknown scalability beyond small cases (Mahbub et al., 2022).
Cross-language, Multi-project, and Context Variation: SCAA performance degrades when training and testing across different projects, time intervals, or language boundaries. Generalization remains an open challenge (Bogomolov et al., 2020).
Evolving Coding Style (Temporal Drift): Developers’ patterns change over time; models trained on early commits underperform on later contributions (Bogomolov et al., 2020).
Open-world and Unknown-author Attribution: Explicit mechanisms for “author unknown” submission remain limited to confidence thresholding; calibration curves improve robustness but may discard valid predictions (Dauber et al., 2017, Caliskan et al., 2015).
Binary/Obfuscated Attribution: Machine learning on decompiled binaries, combining disassembly n-grams, ASTs, and statistical distributions, is effective with 65–96% accuracy over 50–100 known authors even after compilation, optimization, and moderate obfuscation (Caliskan et al., 2015).
Privacy/Defensive Transformations: General, automated $k$ -anonymizing (or high $k$ -uncertainty) transformations are provably impossible. Practical privacy mechanisms require black-box adversarial awareness, extensive code rewriting, and continual arms-race adaptation (Horlboge et al., 2022).
Interpretability and Plagiarism Analysis: SHAP and integrated gradients provide model-level explanation of decisive stylometric cues, supporting forensic evidence and diagnosis of authorship features (Joshi et al., 2024, Dipongkor et al., 20 Jun 2025).
Scalability: Systems such as LLM tournament-based few-shot attribution enable scaling to hundreds of authors, but efficient attribution at greater scale, in streaming or evolving corpora, with provable calibration, remains unresolved (Choi et al., 14 Jan 2025).

A plausible implication is that ensemble, explainable, and normalization-augmented models, supported by adversarial training, currently define the frontier for robust, high-accuracy authorship attribution in practical settings. Open challenges remain in non-cooperative, real-world environments, where project, temporal, and author multiplicity conspire with adversarial activity to limit attribution confidence and reliability.

7. Key Resources and Selected Comparative Results

Approach	Language(s)	Author Classes	Adversarial Robustness	Key Strengths	Reference
RF/AST/CFG + DNN ensemble	Python	8 (groups)	Not evaluated	Multi-author, stacking gain	(Mahbub et al., 2022)
CNN/RNN/Stylometry	C++	200	Weak	High clean accuracy	(Abuhamad et al., 2023)
RoPGen (Adversarial Training)	C/C++/Java	40–204	Improves 22–41%	Reduces attack success rate	(Li et al., 2022)
Language-agnostic PbRF, PbNN	Java/C++/Py	40–1600	Not robust to context	High accuracy (benchmarks)	(Bogomolov et al., 2020)
LLM tournament/few-shot	C++/Java	500–686	Partial	Zero-/few-shot generalization	(Choi et al., 14 Jan 2025)
Structural Transformer (CodeT5-JSA)	JS	5–20 (LLMs)	Effective	Robust to mangling, scalable	(Tihanyi et al., 12 Oct 2025)
Normalize-and-Predict	C++	204	70% gain, provable	Efficient, adversarial defense	(Wang et al., 2020)
Binary attribution	C/C++	50–600	Moderate	Works post-compile, stripped	(Caliskan et al., 2015)

These results reflect progress from classic stylometric techniques to advanced neural and transformer approaches, with special attention to design for interpretability, scalability, and adversarial resilience. Continuous evaluation on realistic cross-context, multi-author, and adversarially perturbed corpora is necessary for future progress in source code authorship attribution.

Markdown Upgrade to Chat

References (14)

Authorship Attribution of Source Code: A Language-Agnostic Approach and Applicability in Software Engineering (2020)

Git Blame Who?: Stylistic Authorship Attribution of Small, Incomplete Source Code Fragments (2017)

When Coding Style Survives Compilation: De-anonymizing Programmers from Executable Binaries (2015)

Authorship Identification of Source Code Segments Written by Multiple Authors Using Stacking Ensemble Method (2022)

SHIELD: Thwarting Code Authorship Attribution (2023)

Misleading Authorship Attribution of Source Code using Adversarial Learning (2019)

The significance of user-defined identifiers in Java source code authorship identification (2021)

AuthAttLyzer-V2: Unveiling Code Authorship Attribution using Enhanced Ensemble Learning Models & Generating Benchmark Dataset (2024)

RoPGen: Towards Robust Code Authorship Attribution via Automatic Coding Style Transformation (2022)

10.

Reassessing Code Authorship Attribution in the Era of Language Models (2025)

11.

I Can Find You in Seconds! Leveraging Large Language Models for Code Authorship Attribution (2025)

12.

The Hidden DNA of LLM-Generated JavaScript: Structural Patterns Enable High-Accuracy Authorship Attribution (2025)

13.

Robust and Accurate Authorship Attribution via Program Normalization (2020)

14.

I still know it's you! On Challenges in Anonymizing Source Code (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Source Code Authorship Attribution (SCAA).