BlockFound: Customized blockchain foundation model for anomaly detection (2410.04039v3)

Published 5 Oct 2024 in cs.CR and cs.AI

Abstract: We propose BlockFound, a customized foundation model for anomaly blockchain transaction detection. Unlike existing methods that rely on rule-based systems or directly apply off-the-shelf LLMs, BlockFound introduces a series of customized designs to model the unique data structure of blockchain transactions. First, a blockchain transaction is multi-modal, containing blockchain-specific tokens, texts, and numbers. We design a modularized tokenizer to handle these multi-modal inputs, balancing the information across different modalities. Second, we design a customized mask language learning mechanism for pretraining with RoPE embedding and FlashAttention for handling longer sequences. After training the foundation model, we further design a novel detection method for anomaly detection. Extensive evaluations on Ethereum and Solana transactions demonstrate BlockFound's exceptional capability in anomaly detection while maintaining a low false positive rate. Remarkably, BlockFound is the only method that successfully detects anomalous transactions on Solana with high accuracy, whereas all other approaches achieved very low or zero detection recall scores. This work not only provides new foundation models for blockchain but also sets a new benchmark for applying LLMs in blockchain data.

Summary

The paper introduces a novel blockchain foundation model that leverages a modularized tokenizer to accurately parse multi-modal transaction data.
The paper demonstrates customized pretraining using RoPE and FlashAttention to enhance anomaly detection in complex blockchain environments.
The paper shows superior performance on Ethereum and Solana by achieving high precision and reduced false positive rates in detecting irregular transactions.

Customized Blockchain Foundation Model for Anomaly Detection

The paper introduces a novel approach, coined as a customized blockchain foundation model specifically tailored for anomaly detection in blockchain transactions. This model aims to improve upon existing methods that primarily rely on rule-based systems or generic LLMs, offering a more sophisticated mechanism to model the complex and unique data structures of blockchain transactions.

Key Contributions

The proposed model differentiates itself through several design innovations aimed at handling the multi-modal nature of blockchain transactions, which encompass tokens, texts, and numeric data specific to blockchain environments. Important contributions include:

Modularized Tokenizer: The model employs a specially designed tokenizer tailored to manage the diverse modalities in transaction data. This modular design balances the informational content by appropriately tokenizing blockchain-specific elements, such as address signatures and numeric transaction values, which are often inaccurately handled by traditional LLM tokenizers.
Customized Pretraining Techniques: The foundation model integrates a mask language learning mechanism with Rotary Position Embeddings (RoPE) and FlashAttention, a strategy that effectively manages longer sequence inputs typical in blockchain data. This pretraining is guided by a BERT-like architecture, deviating from the GPT-based approaches used in prior work, thus optimizing computational efficiency and improving model training.
Anomaly Detection Methodology: The paper introduces a novel method for post-training anomaly detection, which uses reconstruction errors from the foundation model to identify irregular transactions. This metric effectively highlights transactions that deviate from the learned normal patterns, demonstrating superior detection capabilities.

Evaluation and Results

The model's effectiveness was rigorously tested on Ethereum and Solana transaction datasets. Key findings from these evaluations include:

The model achieved exceptional accuracy in detecting anomalous transactions while maintaining a low false positive rate. Notably, it was the only method that successfully identified anomalous transactions on Solana with high precision, whereas alternative approaches yielded poor recall scores.
Comparative analysis with other methods like BlockGPT and heuristic-based models revealed that this customized model significantly outperformed its peers in both precision and detection accuracy.
Extensive ablation studies validated the importance of the core components, particularly the tokenizer and the pretraining modifications, supporting the model's design choices.

Implications

This work sets a new benchmark for applying foundation models to blockchain data, showcasing significant advancements in anomaly detection capabilities. The practical implications of this model are profound: it offers a robust tool for real-time monitoring of DeFi applications, thereby enhancing security measures critical to the financial landscape of blockchain systems.

From a theoretical perspective, the research advances understanding in the application of transformer models to non-traditional data formats, such as those found in blockchain transactions. It encourages further exploration of customized LLMs tailored to specific domains, leveraging modular design and targeted pretraining schemas.

Future Directions

The promising results point towards several future research avenues. One key area is incorporating explainability into the model to ensure transparent anomaly detection, crucial for user trust and regulatory compliance in financial contexts. Additionally, refining the integration of commercial LLMs with specialized tokenization strategies and prompt engineering could enhance detection capabilities, suggesting a hybrid approach that benefits from both pre-trained LLM insights and domain-specific adjustments.

In conclusion, the proposed foundation model for blockchain anomaly detection marks a significant contribution to both practical application and theoretical understanding of AI in blockchain technology, potentially influencing future developments in secure, decentralized financial services.