KitPhishr: Automated Phishing Kit Detection
- KitPhishr is a technical framework designed for early detection and analysis of phishing kits using both static code analysis and dynamic behavioral clustering.
- It combines deterministic source code parsing with supervised machine learning to identify obfuscation tactics and author fingerprints in phishing campaigns.
- Dynamic clustering via JavaScript API traces and HDBSCAN enables precise grouping of phishing pages, supporting actionable threat intelligence and rapid response.
KitPhishr is a technical framework for the automated early detection and analysis of phishing kits, web pages, and campaigns, leveraging both static and dynamic features of phishing kit deployments. It synthesizes deterministic code analysis and large-scale behavioral clustering via browser JavaScript execution traces to address the scale and adaptability of phishing attacks. The system is underpinned by rigorous supervised machine learning methodologies and is designed to offer robust, scalable, and actionable intelligence for researchers, analysts, and web platform providers.
1. Core Principles: Phishing Kit Detection and Differentiation
KitPhishr focuses on systematic kit identification rather than page-by-page signature matching. Phishing kits are collections of files (HTML, PHP, JavaScript, configuration, and assets) disseminated within the cybercrime ecosystem to facilitate rapid, repeatable deployment of phishing campaigns.
Central principles include:
- Deterministic source code analysis: Identifies the presence and nature of evasion (e.g., forbidden access, redirection) and obfuscation mechanisms (e.g., base64 encoding, eval, urldecode), as well as code signatures and embedded author tags.
- Feature engineering: Captures discriminative features such as the existence of “.htaccess” and “robots.txt”; prevalence of obfuscation techniques; structural file composition; and traces of framework usage.
- Behavioral fingerprinting: Aggregates client-side JavaScript API call sets as derived by instrumented browsers executing first-party scripts.
This design establishes a robust ground-truth for subsequent classification and allows for the profiling of kit author design habits.
2. Static and Dynamic Classification Methodologies
KitPhishr employs a two-phase approach for kit recognition:
- Static Deterministic Analysis: Source code from kits is parsed using regular expressions and AST-based heuristics to flag obfuscation/evasion directives and functional constructs. Discriminant structural metrics (number of files, types, framework clues, signature artifacts) are recorded.
- Supervised Machine Learning Integration: Building on deterministic labels, a range of binary classifiers (Decision Trees, ensemble Random Forests [RF10, RF100], Linear SVM, Gaussian Processes, Naive Bayes) is trained to generalize to previously unseen evasion or obfuscation techniques. Feature sets incorporate code-level indicators, author profiling, and statically flagged functions.
Performance evaluation uses standard metrics:
Experimental results demonstrate robust detection (F1 ≈ 0.96 for evasive kits via RF100), with limited training data sufficing to generalize across recurring author patterns and novel obfuscation variants.
3. Behavioral Clustering by JavaScript Capabilities
A distinguishing feature is KitPhishr’s use of client-side dynamic analysis:
- API Trace Collection: VisibleV8 (Chromium-based) records first-party JavaScript API calls during page execution, forming unordered API usage sets per page.
- Clustering via Jaccard Index and HDBSCAN: Pairwise distance is defined as
HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) enables unsupervised grouping, without pre-setting the number of clusters.
- Annotation of Clusters: API call sets are mapped to phishing techniques (e.g., UI interactivity, dynamic code execution, basic fingerprinting) according to curated mappings. With over 434,000 unique phishing pages, KitPhishr identified 11,377 behavioral clusters, each annotated with observed tactics.
This pipeline achieves high kit identification accuracy (FMI = 0.97, V-measure = 0.91 for ground-truth families) and reveals universal and rare behavioral trends (UI interactivity: 90% of clusters; mouse detection: only 35 clusters).
4. KitPhishr Dataset Acquisition and Feature Evolution
KitPhishr incorporates the resource collection methodology described in (Kulkarni et al., 11 Sep 2025):
- URLs are sourced from PhishTank, with automated extraction and resource downloading including HTML, CSS, JavaScript, favicons, images, and screenshots.
- Datasets contain both legitimate (4,056 URLs) and phishing (5,666 URLs) cases, with structured subdirectories per PHISHID for distinct resource types.
Feature analysis illustrates evolving attacker patterns:
- Modern phishing domains prioritize short domain lifetimes (“Age_of_Domain”)—a highly correlated feature.
- URL structure and length have become more discriminative, with attackers shifting from naïve anchor tag tricks to mimicking legitimate outlinks.
- The reduction in use of IP addresses as landing points, in favor of compromised domain infrastructure, marks shifts in operational best practices.
This approach supports multi-modal training, benchmarking, and continuous feature relevance assessment for more generalized and adaptive detection models.
5. Applications: Early Detection, Threat Intelligence, and Research Utility
KitPhishr enables several practical capabilities:
- Early detection for platform providers: The integration of deterministic and ML-driven modules supports real-time warning systems able to spot both known and emergent phishing kits—even those with unseen evasion/obfuscation.
- Author and Kit Profiling: Persistent author signatures and reusable code fragments permit intelligence-driven tracking of adversary groups and adaptive countermeasures.
- Automated clustering at scale: Dynamic API fingerprinting replaces costly manual kit identification and serves as a scalable mechanism for combing large volumes of pages.
- Research support: Systematic labeling and behavioral annotation facilitate comparative evaluations and longitudinal studies of evolving phishing tactics.
A plausible implication is that dynamic clustering and behavior-aware detection could generalize across phishing domains and extend to adjacent threat classes, such as evasive malware web pages.
6. Impacts and Future Directions
KitPhishr’s combined approach has empirically demonstrated that:
- Recurring design habits among kit authors allow for generalizable and robust detection, even when the ground-truth dataset is sparse.
- Dynamic, behavioral fingerprinting via API call sets achieves markedly higher clustering accuracy than static hashing methods, suggesting a shift for future detection frameworks.
- Universal techniques (e.g., UI interactivity, basic fingerprinting) dominate, but detection of rare capabilities (e.g., advanced mouse and timing bot detection) highlights the necessity of comprehensive behavioral profiling.
Continued evolution in KitPhishr’s methodology may incorporate explainability frameworks (as exemplified in PhishXplain (Roy et al., 11 May 2025)) to augment analyst trust and end-user comprehension, as well as personalized, privacy-preserving mechanisms to thwart phishing in anonymity-focused environments.
7. Technical Summary Table: KitPhishr Key Components
Component | Methodology | Result/Significance |
---|---|---|
Static kit classification | Deterministic code analysis | High F1 under small training sets |
Dynamic kit clustering | API trace (Jaccard, HDBSCAN) | FMI = 0.97, V-measure = 0.91 |
Resource acquisition | PhishTank, automated fetch | Multi-modal, up-to-date datasets |
Feature annotation | Manual mapping of APIs | Behavioral profiling of kits |
Threat intelligence integration | Author profiling | Enhanced early detection, attribution |
This comprehensive, layered methodology positions KitPhishr as a fundamental contribution to the automatic triage, categorization, and behavioral analysis of phishing web infrastructure, supporting both operational defense and academic research into emerging adversarial tactics.