NFT1000: A Cross-Modal Dataset for Non-Fungible Token Retrieval (2402.16872v2)

Published 29 Jan 2024 in cs.IR

Abstract: With the rise of "Metaverse" and "Web 3.0", Non-Fungible Token (NFT) has emerged as a kind of pivotal digital asset, garnering significant attention. By the end of March 2024, more than 1.7 billion NFTs have been minted across various blockchain platforms. To effectively locate a desired NFT, conducting searches within a vast array of NFTs is essential. The challenge in NFT retrieval is heightened due to the high degree of similarity among different NFTs, regarding regional and semantic aspects. In this paper, we will introduce a benchmark dataset named "NFT Top1000 Visual-Text Dataset" (NFT1000), containing 7.56 million image-text pairs, and being collected from 1000 most famous PFP1 NFT collections2 by sales volume on the Ethereum blockchain. Based on this dataset and leveraging the CLIP series of pre-trained models as our foundation, we propose the dynamic masking fine-tuning scheme. This innovative approach results in a 7.4\% improvement in the top1 accuracy rate, while utilizing merely 13\% of the total training data (0.79 million vs. 6.1 million). We also propose a robust metric Comprehensive Variance Index (CVI) to assess the similarity and retrieval difficulty of visual-text pairs data. The dataset will be released as an open-source resource. For more details, please refer to: https://github.com/ShuxunoO/NFT-Net.git.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces NFT1000, the first NFT-centric visual-text dataset designed for effective retrieval of high-similarity NFTs.
It outlines a novel retrieval task and benchmarks various CLIP models to establish baseline performance metrics.
The study presents the Comprehensive Variance Index, a robust metric that closely correlates with retrieval difficulty in NFT datasets.

Overview of "NFT1000: A VISUAL-TEXT DATASET FOR NON-FUNGIBLE TOKEN RETRIEVAL"

The paper introduces the NFT1000 dataset, a significant contribution to the intersection of blockchain and computer vision, specifically focused on the retrieval of Non-Fungible Tokens (NFTs). With the proliferation of NFTs in the context of "Metaverse" and "Web3.0", the need for efficient retrieval methods has escalated. By November 2023, more than 1.4 billion NFT tokens have been minted, presenting a formidable challenge for both academia and industry in terms of efficient and precise retrieval amidst high regional and semantic similarity among these tokens.

NFT1000 Dataset

The NFT1000 dataset encompasses 7.56 million image-text pairs, derived from the top 1000 NFT collections by sales volume on the Ethereum blockchain. Each collection represents an NFT project compliant with the ERC-721 standard, averaging 6600 image-text pairs per collection. This results in a total data volume of 1.75TB, suitable for various downstream tasks like retrieval, generation, and visual question answering in the NFT domain.

Contributions

The primary contributions of the paper are as follows:

Construction of the NFT1000 Dataset: This is the first NFT-centric visual-text dataset in the computer vision domain.
Introduction of a Retrieval Task: The paper proposes a task that focuses on the retrieval of high-similarity image-text pairs, relevant in the context of large-scale NFT datasets.
Benchmark Testing: The authors evaluate several CLIP (Contrastive Language-Image Pretraining) models to provide baseline performance metrics.
Comprehensive Variance Index (CVI): The development of CVI offers a robust metric to assess the similarity and retrieval difficulty of visual-text pairs.

Data Characteristics and Processing

The inherent structure of NFTs involves metadata files that describe the attributes of each token. This dataset standardizes these attributes into image-caption pairs, facilitating machine learning applications. In terms of preprocessing, static images are converted to PNG format while dynamic media like GIFs and MP4s are represented by a single frame.

To avoid data leakage during model training and testing, the dataset is divided into training, validation, and test sets based on entire NFT projects rather than random image samples.

Experimental Evaluation

The NFT1000 dataset's validity is assessed using zero-shot inference and fine-tuning on various CLIP models, including OpenAI's CLIP-ViT variations, Meta's META-CLIP, and BAAI's EVA-CLIP02. The experiments demonstrate the dataset's uniqueness and its distinct distribution compared to the training data of these models.

Comprehensive Variance Index (CVI)

The Comprehensive Variance Index is proposed as a metric for evaluating the similarity within batches of image-text pairs. The CVI is based on the variance of the cosine similarity distributions of feature vectors, considering both intra-modal (image-image, text-text) and inter-modal (image-text) similarities. Empirical results show a high correlation between CVI and retrieval difficulty, validating its effectiveness.

Implications and Future Work

The NFT1000 dataset and the associated tasks set a new benchmark for cross-modal retrieval in the burgeoning field of NFTs. Practically, this dataset can enhance AI-driven search and retrieval systems in blockchain environments. Theoretically, it bridges a gap in the computer vision domain by presenting a unique challenge of high-similarity data.

Future work includes:

Data Optimization: Removing redundant data to enhance model efficiency and generalization.
Dataset Expansion: Extending beyond Ethereum to include NFTs from other blockchains like Solana and Polygon, aiming to construct a dataset with hundreds of millions of pairs.
Generative Models: Exploring the generative potential aligned with the NFT1000 dataset to create diverse NFT artworks.

Conclusion

The construction of the NFT1000 dataset marks a significant development in NFT retrieval research. By addressing the high-similarity challenge inherent to NFTs, the authors provide a dataset that is poised to advance the capabilities of AI in the blockchain domain. The introduction of the Comprehensive Variance Index further broadens the methodological toolkit available for cross-modal retrieval tasks. Future efforts will focus on expanding and refining this dataset to maintain its relevance and utility in ongoing research and industry applications.

PDF Markdown

Related Papers

Tweets

https://twitter.com/Jason_sats/status/1831536031406911698

https://twitter.com/Sounz0_0/status/1831608042019852348

https://twitter.com/tuieeo191910/status/1833143366286287139