AndroidCode Dataset
- AndroidCode is a richly annotated dataset aggregating over 5 million Android APKs from multiple global app stores.
- It uses custom web crawlers and unique SHA-256 identifiers to ensure data integrity and provide a continuously refreshed snapshot of the Android app ecosystem.
- The dataset supports diverse research applications including malware detection, static/dynamic analysis, and longitudinal software evolution studies.
AndroidCode, also known as AndroZoo++, constitutes the largest and most richly annotated research corpus of Android application packages (APKs) and associated metadata for empirical and reproducible research across the Android ecosystem. Designed as a research infrastructure to foster large-scale analysis in security, data mining, and software engineering, the dataset provides a comprehensive, continuously updated snapshot of the Android app landscape, spanning multiple app marketplaces and supporting a wide spectrum of analytic workflows. While the term "AndroidCode" does not appear in major survey literature as of 2018 (Geiger et al., 2018), AndroZoo++ is synonymous with the "AndroidCode" label in the primary dataset documentation (Li et al., 2017).
1. Collection Scope and Metadata Dimensions
AndroZoo++ aggregates more than 5 million unique APKs, continuously crawling and archiving applications from approximately ten global app stores including Google Play, F-Droid, Anzhi, and AppChina. The acquisition strategy employs custom web crawlers per market, deduplicating against market-native identifiers and storing each APK with a cryptographic SHA-256 key. The dataset is organized along six major metadata groupings (M₁–M₆), covering over 30 subfields per application:
- M₁ (APK): SHA256, SHA1, MD5, APK byte size, source market, signing certificate
- M₂ (Manifest): application ID, version code, target API, declared permissions, supported features
- M₃ (DEX): DEX size, last modification, flags for native code, cryptography, dynamic loading, reflection, and a full class list
- M₄ (Releasing): Play Store category, author, rating, install bucket, updating date, contact descriptors
- M₅ (Security): VirusTotal scan reports, AndroBugs security scan outputs
- M₆ (Miscellanea): Piggyback pairs, a catalog of 1,113 common libraries, 240 ad libraries, app lineage traces
These metadata facilitate research into app provenance, behavioral analysis, fast malware triage, library identification, and longitudinal software evolution.
2. Data Acquisition Methodology
Crawling and archival are engineered for breadth, freshness, and compliance with different market protocols. For Google Play, the platform reverse-engineers private APIs, scales to multiple accounts, and distributes requests via global hosts to respect rate limits. All APK retrieval is keyed to unique SHA-256 digests, guaranteeing immutable and redundant-free archiving.
Metadata refreshes occur nightly for the releasing and manifest fields (M₂, M₄), ensuring reflective store-side state, while archival APKs are never overwritten. Crawlers enumerate, deduplicate, and acquire only free applications, and multiple market sources maximize ecosystem coverage.
3. Access Mechanics, Data Schema, and Licensing
The repository is accessible at https://androzoo.uni.lu. Data is available for academic research on a non-commercial, non-redistributable basis, subject to registration with faculty sponsor endorsement and compliance with copyright restrictions.
- Metadata Tables: Delivered as bulk CSV/JSON files, structured with fields such as sha256, apk_size, permission_list (semicolon-delimited), class_list (newline-separated), vt_report (JSON), etc.
- APKs: Downloaded individually by SHA256 key over HTTP.
- REST API: Supports JSON queries for per-app metadata and APK streaming.
- Licensing: Grants research-only usage, prohibits direct APK redistribution, and requires credit to “AndroZoo++ (University of Luxembourg).” Community contributions (e.g., parsers for new security scan outputs) are encouraged under Apache 2.0–compatible terms via contributor license agreements.
4. Dataset Statistics and Empirical Properties
Empirical properties of AndroZoo++ include the following:
- App Count and Growth: Over 5,000,000 APKs as of the latest snapshot, with tens of thousands of new entries added monthly.
- Malware Prevalence: ≈4% ( ), where |Mal| denotes APKs with ≥1 VirusTotal detection.
- Family Structure: 75,963 app families (grouped by application ID), median family size 3, with heavy-tailed distribution—many families have 100+ versions.
- Library Usage: >60% of DEX classes in typical apps originate from known libraries in the curated catalogs.
Table summarizing select metrics:
| Property | Value/Formula | Comment |
|---|---|---|
| Total Apps | A | |
| Malware Prevalence | Based on VirusTotal multi-engine detection | |
| App Families | F | |
| Library Class Ratio | >60% | Fraction of classes from known libraries |
5. Maintenance, Extensibility, and Community Integration
AndroZoo++ supports extensibility by integrating new metadata fields and analytic outputs based on community demand. Requests for new metadata are triaged, implemented as extraction scripts, and incorporated into the nightly processing pipeline. Community pull requests for new parsers or scan reports are accepted, with CLA requirements to ensure licensing clarity. This design ensures the dataset remains a contemporary benchmark and reference set for the research community.
6. Research Applications and Supported Workflows
The dataset underpins a diverse range of research use-cases:
- Malware Detection and Characterization: Leveraging VirusTotal (M₅) and library information (M₆) for family detection, payload analysis, and repackaging detection.
- Static and Dynamic Analysis: ICC- and reflection-aware analyses, parameter mining, API recommendation, and code similarity studies.
- Software Evolution: Longitudinal analysis via app lineages, market-driven adoption curves, and family versioning histories.
- Product-Line Mining: Clustering and analysis of code variants and application “products” using SimiDroid and similar techniques.
Supported research continues to expand as new metadata fields and analytic modules are integrated.
7. Comparative Perspective and Nomenclature
No dataset named “AndroidCode” appears among the 31 surveyed collections of Android datasets as of Geiger & Malavolta (2018) (Geiger et al., 2018). References to “AndroidCode” are synonymous post-2017 with AndroZoo++ (Li et al., 2017). For specific dataset features—commit-history granularity, source-binary alignment, or domain-specific analytic tags—researchers are directed to AndroZoo++ or other surveyed collections with a documented scope-match to their use-case.
AndroZoo++—as “AndroidCode”—remains the principal, scale-leading, and structurally rich corpus for Android app research, enabling reproducible evaluation, method benchmarking, and advanced program analysis across security- and engineering-centric disciplines.