Papers
Topics
Authors
Recent
2000 character limit reached

SolRPDS: DeFi Rug Pull Dataset & RS Storage

Updated 12 December 2025
  • SolRPDS-DeFi is a comprehensive dataset capturing 3.69B Solana on-chain transactions, enabling advanced liquidity analytics to identify rug pulls in DeFi.
  • SolRPDS-Storage (APLS) is an innovative RS-coded storage solution that uses parallel agent nodes to reduce degraded-read latency by up to 28% under heavy load.
  • Both contributions exemplify practical applications of robust data processing and algorithmic design in blockchain security analytics and distributed storage performance.

SolRPDS refers to two distinct but technically significant works in contemporary computing: (1) the Solana Rug Pull Dataset for decentralized finance security research, and (2) the Solution for RS-coded PDS (APLS) in distributed storage systems. Both contributions are referenced as SolRPDS and are independently prominent in their domains—blockchain security analytics and erasure-coded storage performance, respectively. This article covers both, referencing each as SolRPDS-DeFi and SolRPDS-Storage (Editor's term) for clarity, and presents technical details based strictly on the published research.

1. SolRPDS-DeFi: Dataset for Rug Pull Analysis on Solana

SolRPDS-DeFi is the first public dataset designed to facilitate research on rug pulls in the Solana decentralized finance (DeFi) ecosystem (Alhaidari et al., 6 Apr 2025). Rug pulls are a prevalent class of exit scams in DeFi, characterized by developers extracting liquidity from decentralized exchange (DEX) token pools, often leaving users with unredeemable tokens. While similar datasets exist for Ethereum and Binance Smart Chain, systematic analyses for Solana were previously lacking.

The dataset provides comprehensive coverage of Solana DeFi activity over nearly four years (February 12, 2021 through November 1, 2024), including 3.69 billion on-chain transactions, 278 million liquidity pool actions, and 3.42 million token swaps covering major DEXs such as Raydium and Jupiter. It comprises 62,895 suspicious liquidity pools, annotated for inactivity (a primary rug-pull indicator), and details liquidity movements (additions, removals), inactivity periods, and amounts withdrawn.

2. SolRPDS-DeFi: Data Collection, Schema, and Feature Engineering

Data extraction for SolRPDS-DeFi employs a pipeline leveraging Flipside Crypto and Google BigQuery for Solana. All raw transactions are parsed to identify 15 liquidity action types, including deposit, addLiquidity, removeLiquidity, and withdraw. Two primary SQL common table expressions (CTEs)—RecentLiquidityAdds and RecentLiquidityRemoves—aggregate pool-level statistics such as total_added_liquidity, total_removed_liquidity, the number of actions, and timestamps.

Key schema fields include:

Field Type Description
liquidity_pool_address String Identifies the unique pool account
mint String Token public key
total_added_liquidity Float Cumulative liquidity added
total_removed_liquidity Float Cumulative liquidity removed
num_liquidity_adds Integer Number of add events
num_liquidity_removes Integer Number of remove events
add_to_remove_ratio Float total_added_liquidity / total_removed_liquidity
first_pool_activity_ts Timestamp First recorded pool operation
last_pool_activity_ts Timestamp Most recent pool action timestamp
last_swap_ts Timestamp Most recent swap involving token
inactivity_status Boolean Indicates whether pool has become inactive

Important derived features include:

  • Inactivity period: Tinactivity=Tlast_interactionTprevious_interactionT_{\mathrm{inactivity}} = T_{\mathrm{last\_interaction}} - T_{\mathrm{previous\_interaction}}
  • Liquidity removal rate: Rremove=total_removed_liquiditytotal_added_liquidityR_{\mathrm{remove}} = \frac{\mathrm{total\_removed\_liquidity}}{\mathrm{total\_added\_liquidity}}
  • Suspicion score (illustrative): SuspicionScore=αRremove+βTinactivity\mathrm{SuspicionScore} = \alpha R_{\mathrm{remove}} + \beta T_{\mathrm{inactivity}} (with weights α\alpha, β\beta tuned on labeled data)

3. SolRPDS-DeFi: Annotation, Labeling, and Statistical Insights

Pools are annotated for activity using the last swap timestamp. A pool is labeled inactive if no further swaps occur after a RemoveLiquidity event before the dataset cutoff date. Labeling distinguishes:

  • Active pools: ongoing swaps, balanced flows, inactivity_status = false
  • Inactive (likely fraudulent) pools: high removal bursts, inactivity_status = true, negligible post-removal volume
  • Suspected rug pulls: exhibit suspicious liquidity movements but remain tradable
  • Confirmed rug pulls: inactivity and near/full liquidity drain

Confirmation procedures combine on-chain forensics—such as mint timing, liquidity add/removal chronology, and post-removal inactivity—with off-chain corroboration (community reports, project disappearance, etc.). Manual review refines heuristics for ambiguous cases.

Key statistics:

  • Unique tokens: 33,746; unique pools: 63,520
  • Suspicious pools: 62,895
  • Inactive/rug-pull tokens: 22,195; active tokens: 11,551
  • Mean(total_added_liquidity): 4.99×10134.99 \times 10^{13}; Mean(total_removed_liquidity): 1.55×10131.55 \times 10^{13}
  • Mean(num_liquidity_adds): 1,485; Mean(num_liquidity_removes): 1,027
  • Mean(add_to_remove_ratio): 6.88×1046.88 \times 10^{4} (heavy-tailed)
  • Inactive pools cluster at fewer removals (mean ≈ 13); 75% of inactive tokens last <1 day
  • Marked surge in rug-pull tokens in 2023–2024

4. SolRPDS-DeFi: Applications, Detection Algorithms, and Limitations

SolRPDS-DeFi supports multiple research vectors:

  • Online detection: Employing add/remove ratio (RremoveR_{\mathrm{remove}}) and inactivity (TinactivityT_{\mathrm{inactivity}}) as features. Rules such as triggering an alert if Rremove>θ1R_{\mathrm{remove}} > \theta_1 and (current_tslast_remove_ts)<θ2(\mathrm{current\_ts} - \mathrm{last\_remove\_ts}) < \theta_2 identify "suspicious" pools.
  • Machine learning: Features from SolRPDS enable classifiers such as Random Forest and AdaBoost to achieve ~97% accuracy for rug-pull identification.
  • Heuristic alarm systems: Threshold-based real-time surveillance

Limitations include annotation uncertainty (benign inactivity confounded with attacks), focus on DEX liquidity (omitting OTC/cross-chain drains), and potential mislabeling due to cutoff-date semantics.

The dataset repository is available via https://github.com/DeFiLabX/SolRPDS under CC BY 4.0, provided in CSV and JSON, and compatible with Python (Pandas, scikit-learn), Spark, SQL engines, and cloud-based querying interfaces.

5. SolRPDS-Storage (APLS): Low-Latency Degraded Reads in RS-Coded Storage

In the domain of distributed storage, SolRPDS refers to the APLS (All Parallelism + Light-loaded Starter) solution for accelerating degraded reads in Reed-Solomon (RS)-coded storage systems (Xie et al., 2023). In RS(k, m) codes, each stripe spans k data plus m parity chunks; recovery from node unavailability (degraded read) typically requires contacting k remaining nodes and reconstructing the missing chunk—an operation that incurs higher latency than normal reads.

Traditional systems, including ECPipe, use agent nodes and pipelined decoding but remain bottlenecked by the standard requirement to transmit kck \cdot c bytes (chunk size cc) to a single receiver, achieving only 1.3–1.6×\times the latency of normal reads, even under optimal conditions.

6. SolRPDS-Storage (APLS): Algorithmic Design, Formulation, and Performance

APLS addresses degraded-read latency by (1) engaging all surviving source nodes (qk,qk+m1q \geq k,\, q \leq k+m-1) as agents and (2) dynamically designating a light-loaded starter node with high spare bandwidth—rather than limiting the receiver to an existing source. The rebuilt chunk is divided into packets; each agent reconstructs c/qc/q bytes, assembling its portion using kk surviving packets per segment in a rotating assignment. Agents then transmit packets in parallel to the starter node, which assembles the full chunk for client delivery.

Analytically, for agent count qq and per-node bandwidth BB (with θs\theta_s fraction allocable to degraded reads), APLS realizes:

  • Classic/ECPipe: T1(c)=c/(θsB)T_1(c) = c / (\theta_s B)
  • APLS (starter not in sources): T2(c)=max{kc/(qθsB),(k1)c/(qθsB)}kc/(qθsB)T_2(c) = \max\{k\cdot c/(q\theta_s B),\, (k-1)\cdot c/(q\theta_s B)\} \approx k \cdot c/(q\theta_s B)

With q>kq > k, degraded-read latency T2(c)<T1(c)T_2(c) < T_1(c). For maximal agent use (q=k+m1q = k+m-1), and θs1\theta_s \sim 1, T2(c)T_2(c) approaches (k/(k+m1))c/B(k/(k+m-1)) \cdot c/B, which is less than the direct (normal) read.

Empirical evaluation on a 16-node Alibaba Cloud testbed (RS(10,4), chunk size 64 MB, helper bandwidth capped at 100–1500 Mbps) showed:

  • 6–25% latency reduction over ECPipe under heavy load (100 Mbps)
  • For small chunks (256 KB/4 MB), up to 28% lower latency than ECPipe at 200 Mbps
  • Increasing agent count qq from kk to k+m1k+m-1 yielded latency declines from 16% to 45%, consistent with the k/qk/q theoretical bound
  • APLS outperformed both single-agent and multi-agent ECPipe modes in nearly all scenarios

7. SolRPDS-Storage (APLS): Complexity, Trade-Offs, and Future Prospects

APLS retains overall network load of kck \cdot c per degraded read (same as classic), but distributes it for higher parallelism. Each agent executes kk finite-field operations per byte but only for its assigned c/qc/q bytes.

Trade-offs include:

  • Diminishing returns for high qq (more nodes, smaller c/qc/q, greater packet overhead)
  • Potential for suboptimal starter allocation if load statistics are stale
  • Overhead from very small packets (inefficiency, protocol overhead)
  • Variability in θs\theta_s across heterogeneous networks may call for weighted (non-uniform) data partitioning
  • Batch processing and integration with regenerating codes or locally repairable codes (LRC) may further improve degraded read efficiency

APLS substantially closes the degraded-read performance gap, delivering degraded reads at nearly the cost of normal reads and surpassing prior state-of-the-art ECPipe by up to 28% under practical workloads (Xie et al., 2023).


References:

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to SolRPDS.