SolRPDS: DeFi Rug Pull Dataset & RS Storage

Updated 12 December 2025

SolRPDS-DeFi is a comprehensive dataset capturing 3.69B Solana on-chain transactions, enabling advanced liquidity analytics to identify rug pulls in DeFi.
SolRPDS-Storage (APLS) is an innovative RS-coded storage solution that uses parallel agent nodes to reduce degraded-read latency by up to 28% under heavy load.
Both contributions exemplify practical applications of robust data processing and algorithmic design in blockchain security analytics and distributed storage performance.

SolRPDS refers to two distinct but technically significant works in contemporary computing: (1) the Solana Rug Pull Dataset for decentralized finance security research, and (2) the Solution for RS-coded PDS (APLS) in distributed storage systems. Both contributions are referenced as SolRPDS and are independently prominent in their domains—blockchain security analytics and erasure-coded storage performance, respectively. This article covers both, referencing each as SolRPDS-DeFi and SolRPDS-Storage (Editor's term) for clarity, and presents technical details based strictly on the published research.

1. SolRPDS-DeFi: Dataset for Rug Pull Analysis on Solana

SolRPDS-DeFi is the first public dataset designed to facilitate research on rug pulls in the Solana decentralized finance (DeFi) ecosystem (Alhaidari et al., 6 Apr 2025). Rug pulls are a prevalent class of exit scams in DeFi, characterized by developers extracting liquidity from decentralized exchange (DEX) token pools, often leaving users with unredeemable tokens. While similar datasets exist for Ethereum and Binance Smart Chain, systematic analyses for Solana were previously lacking.

The dataset provides comprehensive coverage of Solana DeFi activity over nearly four years (February 12, 2021 through November 1, 2024), including 3.69 billion on-chain transactions, 278 million liquidity pool actions, and 3.42 million token swaps covering major DEXs such as Raydium and Jupiter. It comprises 62,895 suspicious liquidity pools, annotated for inactivity (a primary rug-pull indicator), and details liquidity movements (additions, removals), inactivity periods, and amounts withdrawn.

2. SolRPDS-DeFi: Data Collection, Schema, and Feature Engineering

Data extraction for SolRPDS-DeFi employs a pipeline leveraging Flipside Crypto and Google BigQuery for Solana. All raw transactions are parsed to identify 15 liquidity action types, including deposit, addLiquidity, removeLiquidity, and withdraw. Two primary SQL common table expressions (CTEs)—RecentLiquidityAdds and RecentLiquidityRemoves—aggregate pool-level statistics such as total_added_liquidity, total_removed_liquidity, the number of actions, and timestamps.

Key schema fields include:

Field	Type	Description
liquidity_pool_address	String	Identifies the unique pool account
mint	String	Token public key
total_added_liquidity	Float	Cumulative liquidity added
total_removed_liquidity	Float	Cumulative liquidity removed
num_liquidity_adds	Integer	Number of add events
num_liquidity_removes	Integer	Number of remove events
add_to_remove_ratio	Float	total_added_liquidity / total_removed_liquidity
first_pool_activity_ts	Timestamp	First recorded pool operation
last_pool_activity_ts	Timestamp	Most recent pool action timestamp
last_swap_ts	Timestamp	Most recent swap involving token
inactivity_status	Boolean	Indicates whether pool has become inactive

Important derived features include:

Inactivity period: $T_{\mathrm{inactivity}} = T_{\mathrm{last\_interaction}} - T_{\mathrm{previous\_interaction}}$
Liquidity removal rate: $R_{\mathrm{remove}} = \frac{\mathrm{total\_removed\_liquidity}}{\mathrm{total\_added\_liquidity}}$
Suspicion score (illustrative): $\mathrm{SuspicionScore} = \alpha R_{\mathrm{remove}} + \beta T_{\mathrm{inactivity}}$ (with weights $\alpha$ , $\beta$ tuned on labeled data)

3. SolRPDS-DeFi: Annotation, Labeling, and Statistical Insights

Pools are annotated for activity using the last swap timestamp. A pool is labeled inactive if no further swaps occur after a RemoveLiquidity event before the dataset cutoff date. Labeling distinguishes:

Active pools: ongoing swaps, balanced flows, inactivity_status = false
Inactive (likely fraudulent) pools: high removal bursts, inactivity_status = true, negligible post-removal volume
Suspected rug pulls: exhibit suspicious liquidity movements but remain tradable
Confirmed rug pulls: inactivity and near/full liquidity drain

Confirmation procedures combine on-chain forensics—such as mint timing, liquidity add/removal chronology, and post-removal inactivity—with off-chain corroboration (community reports, project disappearance, etc.). Manual review refines heuristics for ambiguous cases.

Key statistics:

Unique tokens: 33,746; unique pools: 63,520
Suspicious pools: 62,895
Inactive/rug-pull tokens: 22,195; active tokens: 11,551
Mean(total_added_liquidity): $4.99 \times 10^{13}$ ; Mean(total_removed_liquidity): $1.55 \times 10^{13}$
Mean(num_liquidity_adds): 1,485; Mean(num_liquidity_removes): 1,027
Mean(add_to_remove_ratio): $6.88 \times 10^{4}$ (heavy-tailed)
Inactive pools cluster at fewer removals (mean ≈ 13); 75% of inactive tokens last <1 day
Marked surge in rug-pull tokens in 2023–2024

4. SolRPDS-DeFi: Applications, Detection Algorithms, and Limitations

SolRPDS-DeFi supports multiple research vectors:

Online detection: Employing add/remove ratio ( $R_{\mathrm{remove}}$ ) and inactivity ( $T_{\mathrm{inactivity}}$ ) as features. Rules such as triggering an alert if $R_{\mathrm{remove}} > \theta_1$ and $(\mathrm{current\_ts} - \mathrm{last\_remove\_ts}) < \theta_2$ identify "suspicious" pools.
Machine learning: Features from SolRPDS enable classifiers such as Random Forest and AdaBoost to achieve ~97% accuracy for rug-pull identification.
Heuristic alarm systems: Threshold-based real-time surveillance

Limitations include annotation uncertainty (benign inactivity confounded with attacks), focus on DEX liquidity (omitting OTC/cross-chain drains), and potential mislabeling due to cutoff-date semantics.

The dataset repository is available via https://github.com/DeFiLabX/SolRPDS under CC BY 4.0, provided in CSV and JSON, and compatible with Python (Pandas, scikit-learn), Spark, SQL engines, and cloud-based querying interfaces.

5. SolRPDS-Storage (APLS): Low-Latency Degraded Reads in RS-Coded Storage

In the domain of distributed storage, SolRPDS refers to the APLS (All Parallelism + Light-loaded Starter) solution for accelerating degraded reads in Reed-Solomon (RS)-coded storage systems (Xie et al., 2023). In RS(k, m) codes, each stripe spans k data plus m parity chunks; recovery from node unavailability (degraded read) typically requires contacting k remaining nodes and reconstructing the missing chunk—an operation that incurs higher latency than normal reads.

Traditional systems, including ECPipe, use agent nodes and pipelined decoding but remain bottlenecked by the standard requirement to transmit $k \cdot c$ bytes (chunk size $c$ ) to a single receiver, achieving only 1.3–1.6 $\times$ the latency of normal reads, even under optimal conditions.

6. SolRPDS-Storage (APLS): Algorithmic Design, Formulation, and Performance

APLS addresses degraded-read latency by (1) engaging all surviving source nodes ( $q \geq k,\, q \leq k+m-1$ ) as agents and (2) dynamically designating a light-loaded starter node with high spare bandwidth—rather than limiting the receiver to an existing source. The rebuilt chunk is divided into packets; each agent reconstructs $c/q$ bytes, assembling its portion using $k$ surviving packets per segment in a rotating assignment. Agents then transmit packets in parallel to the starter node, which assembles the full chunk for client delivery.

Analytically, for agent count $q$ and per-node bandwidth $B$ (with $\theta_s$ fraction allocable to degraded reads), APLS realizes:

Classic/ECPipe: $T_1(c) = c / (\theta_s B)$
APLS (starter not in sources): $T_2(c) = \max\{k\cdot c/(q\theta_s B),\, (k-1)\cdot c/(q\theta_s B)\} \approx k \cdot c/(q\theta_s B)$

With $q > k$ , degraded-read latency $T_2(c) < T_1(c)$ . For maximal agent use ( $q = k+m-1$ ), and $\theta_s \sim 1$ , $T_2(c)$ approaches $(k/(k+m-1)) \cdot c/B$ , which is less than the direct (normal) read.

Empirical evaluation on a 16-node Alibaba Cloud testbed (RS(10,4), chunk size 64 MB, helper bandwidth capped at 100–1500 Mbps) showed:

6–25% latency reduction over ECPipe under heavy load (100 Mbps)
For small chunks (256 KB/4 MB), up to 28% lower latency than ECPipe at 200 Mbps
Increasing agent count $q$ from $k$ to $k+m-1$ yielded latency declines from 16% to 45%, consistent with the $k/q$ theoretical bound
APLS outperformed both single-agent and multi-agent ECPipe modes in nearly all scenarios

7. SolRPDS-Storage (APLS): Complexity, Trade-Offs, and Future Prospects

APLS retains overall network load of $k \cdot c$ per degraded read (same as classic), but distributes it for higher parallelism. Each agent executes $k$ finite-field operations per byte but only for its assigned $c/q$ bytes.

Trade-offs include:

Diminishing returns for high $q$ (more nodes, smaller $c/q$ , greater packet overhead)
Potential for suboptimal starter allocation if load statistics are stale
Overhead from very small packets (inefficiency, protocol overhead)
Variability in $\theta_s$ across heterogeneous networks may call for weighted (non-uniform) data partitioning
Batch processing and integration with regenerating codes or locally repairable codes (LRC) may further improve degraded read efficiency

APLS substantially closes the degraded-read performance gap, delivering degraded reads at nearly the cost of normal reads and surpassing prior state-of-the-art ECPipe by up to 28% under practical workloads (Xie et al., 2023).

References:

"SolRPDS: A Dataset for Analyzing Rug Pulls in Solana Decentralized Finance" (Alhaidari et al., 6 Apr 2025)
"Boosting the Performance of Degraded Reads in RS-coded Distributed Storage Systems" (Xie et al., 2023)