Papers
Topics
Authors
Recent
2000 character limit reached

Project Martial: Open-Source Plagiarism Detector

Updated 8 January 2026
  • Project Martial is an open-source, modular framework for automated software plagiarism detection that combines static fingerprinting, dynamic birthmarks, and code embeddings.
  • Its three-layer architecture processes source code through front-end ingestion, analyzer plug-ins, and an indexing engine for pairwise similarity comparisons.
  • Empirical evaluations report high F1 scores (>0.90), demonstrating the system’s robustness against obfuscation and enhanced detection accuracy.

Project Martial is an open-source, modular framework for automated software plagiarism detection, integrating static fingerprinting, dynamic birthmark, and code embedding strategies into a unified system designed for extensibility and rigorous detection. By orchestrating a variety of classic and modern code similarity techniques, Project Martial aims to address the challenges posed by both superficial and semantically obfuscated code modifications in academic, professional, and legal contexts (Folea et al., 1 Jan 2026).

1. System Architecture and Workflow

Project Martial is structured into three core layers: front-end ingestion, plug-in analyser modules, and a comparison/indexing engine.

  • Front-End Ingestion: The system parses source code (currently supporting static code—SC—languages), stripping comments, linter directives, and constructing abstract syntax trees. For dynamic or binary analysis, a dedicated harness collects execution-time data, such as performance counter traces, through direct instrumentation of binaries (BIN).
  • Core Analysers: The framework includes three primary analyser modules in its initial release:
    • Comment-Embedding Analyser (static, SC.S): Extracts and embeds code comments using transformer models for natural language similarity.
    • Linter-Directive Analyser (static, SC.S): Processes linter directives for similarity, often represented as one-hot vectors.
    • Dynamic Complexity Analyser (dynamic, BIN.D): Profiles binaries for dynamic execution features, such as CPU cycles and branch misses.
  • Indexing and Comparison Engine: Each analyser outputs a fixed-length numeric signature or set of fingerprints for each submission. These are inserted into a global index (e.g., LSH forest, inverted index), and pairwise similarity is computed either via all-pairs comparison or approximate nearest neighbor queries, with flagged pairs elevated for human review.

The modular API allows users to insert classic fingerprint-based, birthmark-based, or embedding-based methods as independent plug-ins, facilitating research and comparative evaluation (Folea et al., 1 Jan 2026).

2. Detection Methodologies

Martial integrates and standardizes three principal detection paradigms:

  • Fingerprinting (k-gram/Rabin Hashing, Winnowing):
    • Token stream segmentation into k-grams followed by Rabin fingerprint computation:

    hi=j=0k1ti+jpk1jmodMh_i = \sum_{j=0}^{k-1} t_{i+j} p^{k-1-j} \bmod M - Sliding-window selection of minimum hashes provides a robust “fingerprint set” F(P)F(P), enabling resilience to non-semantic code changes. - Jaccard index is used for set similarity:

    simJ(P,Q)=F(P)F(Q)F(P)F(Q)\operatorname{sim}_J(P,Q) = \frac{|F(P)\cap F(Q)|}{|F(P)\cup F(Q)|}

  • Software Birthmarks (Static CFG and Dynamic Opcode n-grams):

    • CFG Spectra: Extraction of control-flow graphs, followed by spectral analysis (top-dd eigenvalues of the adjacency matrix) to form signature vectors; Euclidean distance evaluates structure similarity.
    • Dynamic n-gram Birthmarks: Instrumented runtime opcode streams converted into nn-gram histograms, with cosine similarity as the matching metric.
  • Embeddings (Transformers, Paragraph Vectors):
    • Natural-language comments embedded using pretrained models (e.g., RoBERTa, USE).
    • File-level vectors constructed via averaging or concatenation of block embeddings, compared using cosine similarity.

Martial’s plug-in interface allows orchestration of multiple modules and supports a “measure-and-vote” strategy, where consensus across analysers yields higher confidence in flagged pairs (Folea et al., 1 Jan 2026).

3. Obfuscation Robustness and Detection Challenges

Project Martial systematically classifies input artifacts by modality: static source code (SC.S), dynamic source code (SC.D), static binary (BIN.S), and dynamic binary (BIN.D). It explicitly addresses common obfuscations:

  • Whitespace/Identifier Substitution: Fingerprinting methods are parameterized for kk and ww to trade off robustness to code reordering.
  • Structural Modifications: CFG techniques defeat identifier renaming, while opcode n-gram birthmarks tolerate some instruction reordering.
  • Superficial Comment Changes: Embedding-based detectors operate on semantic content, being resilient to variable or function name changes but reliant on substantive comments.

The ensemble approach—requiring consensus from at least two analysers before deeming a pair suspicious—further reduces false positives in the face of adversarially manipulated code (Folea et al., 1 Jan 2026).

4. Integration of Classic and Modern Methods

Project Martial provides templates for rapid integration of established tools:

  • Winnowing Fingerprint: Implementable in ≈200 lines, exposing a standard interface to the Martial comparison engine.
  • CFG Birthmark: Can use third-party control-flow analyzers or binary lifting frameworks, producing vectors of leading eigenvalues.
  • Code Embeddings: Transformer-based or neural code representation models can be inserted behind Martial’s Embedder abstraction.

The architecture is thus positioned as an integration and orchestration platform for diverse and evolving software similarity paradigms (Folea et al., 1 Jan 2026).

5. Indexing, Performance, and Evaluation

  • Indexing:
    • LSH forests for vector signatures: time complexity O(Nd)O(N\ell d) (insertion), O(kdlogN)O(k d \log N) (query).
    • Fingerprint sets: inverted index storage, O(mN)O(mN) for NN programs.
  • Performance Metrics:
    • Precision, recall, and F1F_1:

    P=TPTP+FP,R=TPTP+FN,F1=2PRP+RP = \frac{TP}{TP+FP}, \quad R = \frac{TP}{TP+FN}, \quad F_1 = 2\frac{PR}{P+R}

  • Empirical Results:

    • On the VSCode+Kubernetes corpus and hand-labelled student Java files, the ensemble approach achieves F1>0.90F_1 > 0.90 on held-out suspicious pairs, outperforming individual analysers and established tools such as MOSS and JPlag, with improved recall on obfuscated samples and reduced false positive rate (Folea et al., 1 Jan 2026).
  • Legal Precedents and Principles:
    • “Good faith” practice from MOSS [29]: automatic tools serve as aids for human reviewers, not arbiters.
    • Oracle v. Google [19]: Martial’s analysers ignore API calls, focusing on implementation and behavioral similarity.
    • Anticipation of LLM-based obfuscation: future plans for an “LLM-sanitizer” preprocessor to canonicalize code.
  • Usability: Simple CLI and web-based UI are provided to address low adoption rates among instructors (only 8% currently use such tools in academic settings).
  • Evaluation Protocol:
    • 80/20 training/testing splits, grid search for thresholding, and comparative evaluation against MOSS, JPlag, and Andromeda on authentic and adversarial code pairs.

Project Martial’s open-source nature and modularity extend its applicability across academic integrity, copyright infringement disputes, and detection of synthetic code generated by LLMs (Folea et al., 1 Jan 2026).

7. Extensibility and Future Directions

Project Martial’s architecture invites addition of new analysers, including:

  • Transformer-based and neural code representation models.
  • LLM-generated code obfuscation handlers.
  • Fine-grained binary-level instrumentation modules.
  • Integration with academic and commercial corpora for benchmarking and legal cases.

The system's API and evaluation framework facilitate participation by the research community, ensuring adaptability to new forms of code generation and manipulation as the software plagiarism landscape evolves (Folea et al., 1 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Project Martial.