SolRPDS: DeFi Rug Pull Dataset & RS Storage
- SolRPDS-DeFi is a comprehensive dataset capturing 3.69B Solana on-chain transactions, enabling advanced liquidity analytics to identify rug pulls in DeFi.
- SolRPDS-Storage (APLS) is an innovative RS-coded storage solution that uses parallel agent nodes to reduce degraded-read latency by up to 28% under heavy load.
- Both contributions exemplify practical applications of robust data processing and algorithmic design in blockchain security analytics and distributed storage performance.
SolRPDS refers to two distinct but technically significant works in contemporary computing: (1) the Solana Rug Pull Dataset for decentralized finance security research, and (2) the Solution for RS-coded PDS (APLS) in distributed storage systems. Both contributions are referenced as SolRPDS and are independently prominent in their domains—blockchain security analytics and erasure-coded storage performance, respectively. This article covers both, referencing each as SolRPDS-DeFi and SolRPDS-Storage (Editor's term) for clarity, and presents technical details based strictly on the published research.
1. SolRPDS-DeFi: Dataset for Rug Pull Analysis on Solana
SolRPDS-DeFi is the first public dataset designed to facilitate research on rug pulls in the Solana decentralized finance (DeFi) ecosystem (Alhaidari et al., 6 Apr 2025). Rug pulls are a prevalent class of exit scams in DeFi, characterized by developers extracting liquidity from decentralized exchange (DEX) token pools, often leaving users with unredeemable tokens. While similar datasets exist for Ethereum and Binance Smart Chain, systematic analyses for Solana were previously lacking.
The dataset provides comprehensive coverage of Solana DeFi activity over nearly four years (February 12, 2021 through November 1, 2024), including 3.69 billion on-chain transactions, 278 million liquidity pool actions, and 3.42 million token swaps covering major DEXs such as Raydium and Jupiter. It comprises 62,895 suspicious liquidity pools, annotated for inactivity (a primary rug-pull indicator), and details liquidity movements (additions, removals), inactivity periods, and amounts withdrawn.
2. SolRPDS-DeFi: Data Collection, Schema, and Feature Engineering
Data extraction for SolRPDS-DeFi employs a pipeline leveraging Flipside Crypto and Google BigQuery for Solana. All raw transactions are parsed to identify 15 liquidity action types, including deposit, addLiquidity, removeLiquidity, and withdraw. Two primary SQL common table expressions (CTEs)—RecentLiquidityAdds and RecentLiquidityRemoves—aggregate pool-level statistics such as total_added_liquidity, total_removed_liquidity, the number of actions, and timestamps.
Key schema fields include:
| Field | Type | Description |
|---|---|---|
| liquidity_pool_address | String | Identifies the unique pool account |
| mint | String | Token public key |
| total_added_liquidity | Float | Cumulative liquidity added |
| total_removed_liquidity | Float | Cumulative liquidity removed |
| num_liquidity_adds | Integer | Number of add events |
| num_liquidity_removes | Integer | Number of remove events |
| add_to_remove_ratio | Float | total_added_liquidity / total_removed_liquidity |
| first_pool_activity_ts | Timestamp | First recorded pool operation |
| last_pool_activity_ts | Timestamp | Most recent pool action timestamp |
| last_swap_ts | Timestamp | Most recent swap involving token |
| inactivity_status | Boolean | Indicates whether pool has become inactive |
Important derived features include:
- Inactivity period:
- Liquidity removal rate:
- Suspicion score (illustrative): (with weights , tuned on labeled data)
3. SolRPDS-DeFi: Annotation, Labeling, and Statistical Insights
Pools are annotated for activity using the last swap timestamp. A pool is labeled inactive if no further swaps occur after a RemoveLiquidity event before the dataset cutoff date. Labeling distinguishes:
- Active pools: ongoing swaps, balanced flows, inactivity_status = false
- Inactive (likely fraudulent) pools: high removal bursts, inactivity_status = true, negligible post-removal volume
- Suspected rug pulls: exhibit suspicious liquidity movements but remain tradable
- Confirmed rug pulls: inactivity and near/full liquidity drain
Confirmation procedures combine on-chain forensics—such as mint timing, liquidity add/removal chronology, and post-removal inactivity—with off-chain corroboration (community reports, project disappearance, etc.). Manual review refines heuristics for ambiguous cases.
Key statistics:
- Unique tokens: 33,746; unique pools: 63,520
- Suspicious pools: 62,895
- Inactive/rug-pull tokens: 22,195; active tokens: 11,551
- Mean(total_added_liquidity): ; Mean(total_removed_liquidity):
- Mean(num_liquidity_adds): 1,485; Mean(num_liquidity_removes): 1,027
- Mean(add_to_remove_ratio): (heavy-tailed)
- Inactive pools cluster at fewer removals (mean ≈ 13); 75% of inactive tokens last <1 day
- Marked surge in rug-pull tokens in 2023–2024
4. SolRPDS-DeFi: Applications, Detection Algorithms, and Limitations
SolRPDS-DeFi supports multiple research vectors:
- Online detection: Employing add/remove ratio () and inactivity () as features. Rules such as triggering an alert if and identify "suspicious" pools.
- Machine learning: Features from SolRPDS enable classifiers such as Random Forest and AdaBoost to achieve ~97% accuracy for rug-pull identification.
- Heuristic alarm systems: Threshold-based real-time surveillance
Limitations include annotation uncertainty (benign inactivity confounded with attacks), focus on DEX liquidity (omitting OTC/cross-chain drains), and potential mislabeling due to cutoff-date semantics.
The dataset repository is available via https://github.com/DeFiLabX/SolRPDS under CC BY 4.0, provided in CSV and JSON, and compatible with Python (Pandas, scikit-learn), Spark, SQL engines, and cloud-based querying interfaces.
5. SolRPDS-Storage (APLS): Low-Latency Degraded Reads in RS-Coded Storage
In the domain of distributed storage, SolRPDS refers to the APLS (All Parallelism + Light-loaded Starter) solution for accelerating degraded reads in Reed-Solomon (RS)-coded storage systems (Xie et al., 2023). In RS(k, m) codes, each stripe spans k data plus m parity chunks; recovery from node unavailability (degraded read) typically requires contacting k remaining nodes and reconstructing the missing chunk—an operation that incurs higher latency than normal reads.
Traditional systems, including ECPipe, use agent nodes and pipelined decoding but remain bottlenecked by the standard requirement to transmit bytes (chunk size ) to a single receiver, achieving only 1.3–1.6 the latency of normal reads, even under optimal conditions.
6. SolRPDS-Storage (APLS): Algorithmic Design, Formulation, and Performance
APLS addresses degraded-read latency by (1) engaging all surviving source nodes () as agents and (2) dynamically designating a light-loaded starter node with high spare bandwidth—rather than limiting the receiver to an existing source. The rebuilt chunk is divided into packets; each agent reconstructs bytes, assembling its portion using surviving packets per segment in a rotating assignment. Agents then transmit packets in parallel to the starter node, which assembles the full chunk for client delivery.
Analytically, for agent count and per-node bandwidth (with fraction allocable to degraded reads), APLS realizes:
- Classic/ECPipe:
- APLS (starter not in sources):
With , degraded-read latency . For maximal agent use (), and , approaches , which is less than the direct (normal) read.
Empirical evaluation on a 16-node Alibaba Cloud testbed (RS(10,4), chunk size 64 MB, helper bandwidth capped at 100–1500 Mbps) showed:
- 6–25% latency reduction over ECPipe under heavy load (100 Mbps)
- For small chunks (256 KB/4 MB), up to 28% lower latency than ECPipe at 200 Mbps
- Increasing agent count from to yielded latency declines from 16% to 45%, consistent with the theoretical bound
- APLS outperformed both single-agent and multi-agent ECPipe modes in nearly all scenarios
7. SolRPDS-Storage (APLS): Complexity, Trade-Offs, and Future Prospects
APLS retains overall network load of per degraded read (same as classic), but distributes it for higher parallelism. Each agent executes finite-field operations per byte but only for its assigned bytes.
Trade-offs include:
- Diminishing returns for high (more nodes, smaller , greater packet overhead)
- Potential for suboptimal starter allocation if load statistics are stale
- Overhead from very small packets (inefficiency, protocol overhead)
- Variability in across heterogeneous networks may call for weighted (non-uniform) data partitioning
- Batch processing and integration with regenerating codes or locally repairable codes (LRC) may further improve degraded read efficiency
APLS substantially closes the degraded-read performance gap, delivering degraded reads at nearly the cost of normal reads and surpassing prior state-of-the-art ECPipe by up to 28% under practical workloads (Xie et al., 2023).
References:
- "SolRPDS: A Dataset for Analyzing Rug Pulls in Solana Decentralized Finance" (Alhaidari et al., 6 Apr 2025)
- "Boosting the Performance of Degraded Reads in RS-coded Distributed Storage Systems" (Xie et al., 2023)