Software Birthmarks in Code Analysis
- Software Birthmarks are unique signatures derived from static and dynamic analyses that help identify software plagiarism.
- They leverage methods like CFG analysis, opcode n-grams, and embedding models to quantify code similarity accurately.
- Project Martial integrates these techniques in a modular ensemble system that ensures robust detection and empirical performance benchmarking.
Project Martial is an open-source, modular framework designed for the automated detection of software plagiarism, integrating the principal methodologies and insights established in code similarity analysis. The system is architected to allow researchers, instructors, and practitioners to combine static fingerprinting, dynamic birthmarking, and embedding-based detectors in a single unified pipeline. Project Martial is engineered around precise classification of detection artifacts, extensible plugin interfaces, and empirical benchmarking guided by legal and academic precedent (Folea et al., 1 Jan 2026).
1. System Architecture and Modularity
Project Martial is structured in three principal layers:
- Front-End Ingestion: Project Martial currently supports source code (SC) parsing, which encompasses stripping comments, extracting linter directives, and building abstract syntax trees. Instrumentation harnesses for binaries (BIN) enable dynamic analysis by executing compiled programs with profilers (e.g., Linux perf), collecting hardware-level metrics (BIN.D).
- Core Analysers: These plug-in modules encapsulate different detection strategies:
- Comment-Embedding Analyser (SC.S): Extracts and embeds code comments using pre-trained models such as RoBERTa or the Universal Sentence Encoder.
- Linter-Directive Analyser (SC.S): Processes linter directives for command pattern identification.
- Dynamic Complexity Analyser (BIN.D): Profiles binaries at run-time to generate execution complexity signatures.
The plug-in API is designed so that additional fingerprinting and birthmark modules (e.g., Winnowing, CFG spectrum, dynamic opcode n-grams) can be integrated with minimal code adaptation.
- Comparison & Indexing Engine: Each analyser emits signatures (e.g., vector embeddings, fingerprint sets) per submission. These signatures are indexed in structures such as LSH forests or inverted indices for efficient nearest-neighbor queries and pairwise similarity computations.
The full detection pipeline—parser invocation, feature stream extraction, analyser signature computation, storage, and scheduled similarity querying—produces a ranked list of suspect pairs for human review (Folea et al., 1 Jan 2026).
2. Detection Methodologies
Project Martial supports the three dominant categories of software plagiarism detection techniques:
2.1 Fingerprinting
- Workflow: Generating k-gram tokens from program source, computing Rabin hash fingerprints, and selecting representative hashes via Winnowing.
- Signature: , the set of fingerprints for program .
- Similarity: Jaccard index .
- Properties: Fingerprinting is robust to identifier renaming, whitespace changes, and minor reorderings, tunable via and window size .
2.2 Software Birthmarks
- Static CFG Birthmarks: Construct control-flow graph , compute adjacency matrix spectrum, and use the top eigenvalues as feature vectors , with Euclidean distance for comparison.
- Dynamic Opcode n-gram Birthmarks: Record the sequence of executed opcodes and represent the n-gram histogram as a vector ; compare via cosine similarity.
- Properties: CFG-based methods are insensitive to variable names and focus on structural similarity. Dynamic signatures capture behavioral similarity and tolerate some instruction reordering.
2.3 Code Embeddings
- Technique: Embed comments or larger code fragments with transformer-based models. Project Martial’s Comment-Embedding Analyser computes aggregated file-level embeddings for direct similarity comparison.
- Similarity: Cosine similarity of embedding vectors.
- Properties: Embedding methods detect paraphrased or obfuscated code, provided natural-language commentary persists.
Combining different analysers and requiring consensus among multiple detectors mitigates false positives, especially under advanced obfuscation scenarios (e.g., identifier renaming, branch inversion, loop unrolling) (Folea et al., 1 Jan 2026).
3. Classification, Adversarial Robustness, and Limitations
Project Martial’s architecture is grounded in explicit categorization of artifact types:
- SC.S: Static source code
- SC.D: Dynamic source code instrumentation (future work)
- BIN.S: Static binary analysis (future work)
- BIN.D: Dynamic binary profiling
Each detection module targets specific obfuscation challenges:
- Fingerprinting modules tolerate superficial edits (name changes, formatting).
- CFG birthmarks are independent of lexical features.
- Opcode n-gram birthmarks are resilient to limited instruction rearrangement but may degrade when is large or with aggressive optimizer passes.
- Embedding modules ignore code structure but require presence of comments or documentation.
A majority-voting scheme among analysers is enforced to elevate only submissions flagged by at least two independent modules, addressing the risk of single-module evasion (Folea et al., 1 Jan 2026).
4. Related Approaches and Integration Pathways
Project Martial is not a stand-alone reimplementation of known methods; rather, it provides an orchestration layer for integrating:
- Fingerprinting tools (e.g., MOSS/Winnowing, Schleimer et al.),
- Birthmark algorithms (e.g., Myles & Collberg’s dynamic techniques, Lu et al.’s opcode n-grams),
- Embedding models (e.g., code2vec, code2seq).
The plug-in system enables rapid extension: for example, a Winnowing fingerprint plug-in can be implemented in ~200 lines of Go or Python. CFG-birthmark and code2vec modules can use external analyzers or inference libraries connected to the Martial API (Folea et al., 1 Jan 2026).
5. Indexing, Evaluation, and Performance Benchmarking
Project Martial’s comparison engine supports large-scale indexing:
- Vector storage: space for -dimensional vectors.
- Similarity search: LSH forest indexing provides top- approximate neighbor queries.
- Fingerprint indices: Inverted index structure scaling as for average fingerprint set size .
- Module-level metrics: Precision, recall, computed per module and for ensemble combinations.
Empirical results demonstrate:
- Comment analyser (USE): , , at threshold $0.75$ in VSCode+Kubernetes comment experiments.
- Dynamic complexity birthmark: , .
- Ensemble majority-vote: on suspicious pairs test set.
- Martial’s ensemble surpasses MOSS, JPlag, and Andromeda in recall–precision tradeoff and obfuscated pair identification (Folea et al., 1 Jan 2026).
6. Evaluation Methodology and Real-World Benchmarks
Project Martial evaluates on diverse datasets:
- Open-source history: VSCode (25 releases, ESLint directives), Kubernetes (30 incremental releases, Go source code).
- Academic data: Corpus of ~200 Java programs with hand-labeled plagiarism relationships (obfuscations: renaming, loop unrolling, branch inversion).
- Experimental split: Training and test partitions (80/20), with parameter tuning by grid search, and benchmarked against external tools.
- Ranking principle: Only suspect pairs reaching a high similarity threshold in two or more independent modules are highlighted for manual review, operationalizing the “good faith” flagging principle.
7. Legal, Academic, and Practical Implications
The system design incorporates legal and pedagogical insights:
- Oracle v. Google: Focuses similarity detection away from API surface and toward natural language/comments (for static) and dynamic performance (for binaries).
- MOSS "good faith" principle: Outputs ranked lists instead of binary decisions to prevent overreliance on automated judgments.
- Obfuscation and LLMs: System modularity allows future integration of LLM-based sanitation/normalization steps to counter advanced transformation.
- Usability: Recognizing low automation tool adoption (8% among instructors), Martial provides both CLI and web UI options.
Project Martial thus constitutes a configurable ensemble platform for code similarity detection, directly integrating state-of-the-art analytical modules, scalable indexing, empirical benchmarking, and modular extensibility, all within a framework sensitive to the realities of software copyright and academic standards (Folea et al., 1 Jan 2026).