DecoyDB: Cybersecurity and Molecular Data Repository
- DecoyDB is a dual-purpose repository comprising adaptive decoy networks for cybersecurity deception and a large-scale molecular dataset for binding affinity prediction.
- In cybersecurity, it structures and updates inter-referential decoy elements in directed graphs, enabling real-time attack diversion and SIEM integration.
- In molecular learning, it provides high-resolution decoy poses and RMSD annotations to support self-supervised graph contrastive learning for improved affinity prediction.
DecoyDB refers to two distinct but notable concepts in computational science: (1) a lightweight database and repository architecture supporting adaptive inter-referential decoy networks for cybersecurity deception, and (2) a large-scale structure-aware dataset for self-supervised graph contrastive learning in protein–ligand binding affinity prediction. Both instantiations of DecoyDB address the need for scalable, nuanced, and context-sensitive repositories—one in the adversarial domain of security, and the other in molecular machine learning—by providing precisely structured collections of either decoy elements or molecular complexes and their derivatives.
1. DecoyDB as an Adaptive Repository in Cybersecurity Deception Frameworks
DecoyDB, introduced as part of advanced deception systems, functions as the central repository for managing, tracking, and adapting networks of decoy elements ("honeytokens") in defendable computing environments (Reti et al., 2021). These decoy elements are inter-referenced using a directed graph structure, forming what is described as a “rabbit hole” of deception intended to guide and track attacker movement away from critical assets.
The repository provides the following core capabilities:
- Records deployments of decoy types (e.g., source code comments, SQLite DB-entries, OS accounts, document files, file-metadata, and URLs) with explicit outbound references (edges) to other decoy elements.
- Stores metadata essential for adaptive operation, including the decoy’s location, type, reference relationships, and interaction history.
- Serves as the persistent state for the deployment and monitoring software, supporting real-time updates in response to observed attacker interactions.
DecoyDB is implemented using SQLite for its lightweight and reliable storage, enabling rapid queries and supporting the flexible addition and update of new decoy-relation entries by the deception daemon in response to attacker interactions detected via file-system events (e.g., via inotify on Unix systems).
2. Integration with Stochastic and Adaptive Deception Models
At the core of the DecoyDB-driven deception system is a stochastic model based on time-homogeneous discrete-time Markov chains (DTMCs), which determine the sequence of decoy types to be revealed to adversaries (Reti et al., 2021). The transition probability specifies the chance that after encountering a decoy of type , an attacker interacts with a referenced decoy of type . This framework allows:
- Systematic and probabilistically controlled deployment of reference chains.
- Dynamic adjustment of transition probabilities through an online learning rule:
For the followed reference:
For alternatives:
where is the learning rate.
Modifications to the underlying DTMC (as recorded in DecoyDB) ensure the evolving decoy network adapts to observed attacker behaviors, increasing the system’s effectiveness over time. This probabilistic adaptation is constrained to maintain normalization across transition probability vectors.
3. Role of DecoyDB in Inter-referential Deployment and Operational Modes
DecoyDB’s structured repository enables three deployment strategies:
- Graphical Mode: Precomputed directed graphs (e.g., Erdős-Rényi or k-out models) determine decoy layout and references, facilitating controlled attacker pathways.
- Infinite Mode: New decoy elements (and database entries) are generated and linked dynamically upon interaction, resulting in branching, potentially deep trees.
- Graphical-Infinite (Hybrid) Mode: Combines initial graph-based deployment (with records stored in DecoyDB) and subsequent online expansion as attacks unfold.
The repository ensures that decoy references are consistently updated as new links are activated and new elements are deployed, maintaining an accurate topological map of the "rabbit hole" network throughout an attack scenario.
4. Applications and Practical Implications in Security Operations
DecoyDB’s use in deception frameworks delivers several measurable benefits:
- Enhanced Intrusion Detection: By structuring decoy elements in inter-referenced chains, DecoyDB increases attacker "dwell time" and monitoring granularity. Anomalies—such as unexpected reference traversals or repeated cycles—provide high signal-to-noise alerts for defenders.
- Insider Threat Mitigation: When deployed with OS-level decoy accounts and document files, early-warning footprints can be mined from DecoyDB for forensic correlation.
- SIEM Integration: The repository can be polled or interfaced via SIEM-conform APIs, streamlining integration into larger enterprise threat intelligence pipelines.
A plausible implication is that the structured, real-time updating capabilities of DecoyDB support more nuanced policy-driven alerting and automated remediation strategies, as opposed to static, isolated decoy deployments.
5. DecoyDB as a Molecular Structure Dataset for Graph Contrastive Learning
Under a separate definition, DecoyDB is a large-scale, high-resolution dataset designed for graph contrastive learning (GCL) tasks in the prediction of protein–ligand binding affinities (Zhang et al., 8 Jul 2025). Its design directly addresses bottlenecks in machine learning-driven drug discovery, where scarcity of labeled data hinders effective GNN training.
The key properties of this dataset include:
- High-Resolution Reference Structures: Complexes from the PDB with crystallography resolutions ≤2.5 Å.
- Extensive Decoy Pose Generation: For each reference complex, ≈88 computationally docked decoy structures generated via AutoDock Vina with diverse RMSDs (from <0.1 Å up to ~25.5 Å) from the native pose.
- RMSD Annotation: Each decoy structure is meticulously annotated with its RMSD from the native conformation. Decoys with RMSD ≤2 Å are labeled as "near-native" and serve as positive pairs, while those exceeding this threshold serve as graded negative samples.
- Scale: The database covers over 5.35 million decoys, facilitating large-batch self-supervised pretraining.
The comprehensive structure-aware design of DecoyDB is crucial for training GNNs to discriminate finely between subtle and pronounced deviations in molecular recognition.
6. Framework for Contrastive Pretraining Using DecoyDB
The dataset underpins a dual-objective pretraining framework for graph neural networks:
- Two-Category Contrastive Loss:
where are the latent codes for the anchor and positive, is a temperature parameter, and is an importance weight (set proportional to RMSD for decoy negative samples; 1 otherwise).
- Denoising Score Matching Regularization:
where is a noise-perturbed conformation.
- Aggregate Loss:
where balances the two terms.
This contrastive-denoising pretraining results in representations that encode both absolute and context-specific binding pose quality, relevant for subsequent affinity regression or classification.
7. Impact and Evaluation of DecoyDB in Drug Discovery
Empirical results using GNNs pre-trained on DecoyDB demonstrate:
- Improved Prediction Accuracy: Lower RMSE and higher Pearson’s in binding affinity prediction against established benchmarks (e.g., PDBbind core sets).
- Superior Label Efficiency: Models converge faster and generalize better when fine-tuned on limited labeled data.
- Generalizability: Evaluation on "leakage-proof" data splits (excluding similar proteins/ligands between train and test) evidences sustained performance, highlighting representation robustness.
DecoyDB thus contributes to overcoming data scarcity and facilitating scalable, structure-aware machine learning pipelines for virtual screening and lead optimization in pharmaceutical research.
In summary, DecoyDB, in both cybersecurity deception and molecular machine learning, serves as a paradigm of adaptive, structurally annotated repositories essential for advanced modeling tasks—be it dynamic adversarial defense or high-throughput molecular property prediction. In each context, the repository's structure and metadata enable nuanced and data-driven adaptation, reinforcing its significance in respective research and application domains (Reti et al., 2021, Zhang et al., 8 Jul 2025).