Papers
Topics
Authors
Recent
Search
2000 character limit reached

AndroidCode Dataset

Updated 17 February 2026
  • AndroidCode is a richly annotated dataset aggregating over 5 million Android APKs from multiple global app stores.
  • It uses custom web crawlers and unique SHA-256 identifiers to ensure data integrity and provide a continuously refreshed snapshot of the Android app ecosystem.
  • The dataset supports diverse research applications including malware detection, static/dynamic analysis, and longitudinal software evolution studies.

AndroidCode, also known as AndroZoo++, constitutes the largest and most richly annotated research corpus of Android application packages (APKs) and associated metadata for empirical and reproducible research across the Android ecosystem. Designed as a research infrastructure to foster large-scale analysis in security, data mining, and software engineering, the dataset provides a comprehensive, continuously updated snapshot of the Android app landscape, spanning multiple app marketplaces and supporting a wide spectrum of analytic workflows. While the term "AndroidCode" does not appear in major survey literature as of 2018 (Geiger et al., 2018), AndroZoo++ is synonymous with the "AndroidCode" label in the primary dataset documentation (Li et al., 2017).

1. Collection Scope and Metadata Dimensions

AndroZoo++ aggregates more than 5 million unique APKs, continuously crawling and archiving applications from approximately ten global app stores including Google Play, F-Droid, Anzhi, and AppChina. The acquisition strategy employs custom web crawlers per market, deduplicating against market-native identifiers and storing each APK with a cryptographic SHA-256 key. The dataset is organized along six major metadata groupings (M₁–M₆), covering over 30 subfields per application:

  • M₁ (APK): SHA256, SHA1, MD5, APK byte size, source market, signing certificate
  • M₂ (Manifest): application ID, version code, target API, declared permissions, supported features
  • M₃ (DEX): DEX size, last modification, flags for native code, cryptography, dynamic loading, reflection, and a full class list
  • M₄ (Releasing): Play Store category, author, rating, install bucket, updating date, contact descriptors
  • M₅ (Security): VirusTotal scan reports, AndroBugs security scan outputs
  • M₆ (Miscellanea): Piggyback pairs, a catalog of 1,113 common libraries, 240 ad libraries, app lineage traces

These metadata facilitate research into app provenance, behavioral analysis, fast malware triage, library identification, and longitudinal software evolution.

2. Data Acquisition Methodology

Crawling and archival are engineered for breadth, freshness, and compliance with different market protocols. For Google Play, the platform reverse-engineers private APIs, scales to multiple accounts, and distributes requests via global hosts to respect rate limits. All APK retrieval is keyed to unique SHA-256 digests, guaranteeing immutable and redundant-free archiving.

Metadata refreshes occur nightly for the releasing and manifest fields (M₂, M₄), ensuring reflective store-side state, while archival APKs are never overwritten. Crawlers enumerate, deduplicate, and acquire only free applications, and multiple market sources maximize ecosystem coverage.

3. Access Mechanics, Data Schema, and Licensing

The repository is accessible at https://androzoo.uni.lu. Data is available for academic research on a non-commercial, non-redistributable basis, subject to registration with faculty sponsor endorsement and compliance with copyright restrictions.

  • Metadata Tables: Delivered as bulk CSV/JSON files, structured with fields such as sha256, apk_size, permission_list (semicolon-delimited), class_list (newline-separated), vt_report (JSON), etc.
  • APKs: Downloaded individually by SHA256 key over HTTP.
  • REST API: Supports JSON queries for per-app metadata and APK streaming.
  • Licensing: Grants research-only usage, prohibits direct APK redistribution, and requires credit to “AndroZoo++ (University of Luxembourg).” Community contributions (e.g., parsers for new security scan outputs) are encouraged under Apache 2.0–compatible terms via contributor license agreements.

4. Dataset Statistics and Empirical Properties

Empirical properties of AndroZoo++ include the following:

  • App Count and Growth: Over 5,000,000 APKs as of the latest snapshot, with tens of thousands of new entries added monthly.
  • Malware Prevalence: ≈4% ( pmalware=MalA0.04p_{\mathrm{malware}} = \frac{|Mal|}{|A|} \approx 0.04 ), where |Mal| denotes APKs with ≥1 VirusTotal detection.
  • Family Structure: 75,963 app families (grouped by application ID), median family size 3, with heavy-tailed distribution—many families have 100+ versions.
  • Library Usage: >60% of DEX classes in typical apps originate from known libraries in the curated catalogs.

Table summarizing select metrics:

Property Value/Formula Comment
Total Apps A
Malware Prevalence pmalware0.04p_{\mathrm{malware}} \approx 0.04 Based on VirusTotal multi-engine detection
App Families F
Library Class Ratio >60% Fraction of classes from known libraries

5. Maintenance, Extensibility, and Community Integration

AndroZoo++ supports extensibility by integrating new metadata fields and analytic outputs based on community demand. Requests for new metadata are triaged, implemented as extraction scripts, and incorporated into the nightly processing pipeline. Community pull requests for new parsers or scan reports are accepted, with CLA requirements to ensure licensing clarity. This design ensures the dataset remains a contemporary benchmark and reference set for the research community.

6. Research Applications and Supported Workflows

The dataset underpins a diverse range of research use-cases:

  • Malware Detection and Characterization: Leveraging VirusTotal (M₅) and library information (M₆) for family detection, payload analysis, and repackaging detection.
  • Static and Dynamic Analysis: ICC- and reflection-aware analyses, parameter mining, API recommendation, and code similarity studies.
  • Software Evolution: Longitudinal analysis via app lineages, market-driven adoption curves, and family versioning histories.
  • Product-Line Mining: Clustering and analysis of code variants and application “products” using SimiDroid and similar techniques.

Supported research continues to expand as new metadata fields and analytic modules are integrated.

7. Comparative Perspective and Nomenclature

No dataset named “AndroidCode” appears among the 31 surveyed collections of Android datasets as of Geiger & Malavolta (2018) (Geiger et al., 2018). References to “AndroidCode” are synonymous post-2017 with AndroZoo++ (Li et al., 2017). For specific dataset features—commit-history granularity, source-binary alignment, or domain-specific analytic tags—researchers are directed to AndroZoo++ or other surveyed collections with a documented scope-match to their use-case.

AndroZoo++—as “AndroidCode”—remains the principal, scale-leading, and structurally rich corpus for Android app research, enabling reproducible evaluation, method benchmarking, and advanced program analysis across security- and engineering-centric disciplines.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AndroidCode Dataset.