Papers
Topics
Authors
Recent
2000 character limit reached

ProgrammableWeb Dataset Overview

Updated 16 November 2025
  • ProgrammableWeb Dataset is a comprehensive, community-driven catalog of Web APIs and mashups that provides detailed metadata, usage relations, and time-evolution data.
  • It supports static and dynamic ecosystem analyses through sophisticated correction techniques and temporal annotations, ensuring reliable research outputs.
  • The dataset underpins empirical studies on API recommendation, survival analysis, and network modeling with rigorous preprocessing and feature extraction methods.

The ProgrammableWeb dataset constitutes a comprehensive, community-driven catalog of Web APIs (application programming interfaces) and mashups (Web application projects) that use them, sourced from http://www.programmableweb.com. It is foundational for empirical research in service ecosystem analysis, personalized API recommendation, temporal network modeling, and the paper of software service co-evolution. The dataset, as described in canonical studies such as “WebAPIRec: Recommending Web APIs to Software Projects via Personalized Ranking” (Thung et al., 2017) and “Data Correction and Evolution Analysis of the ProgrammableWeb Service Ecosystem” (Liu et al., 2021), encompasses metadata, usage relations, time-evolution, and corrected status for APIs and mashups, supporting both static and dynamic analyses of web service ecosystems.

1. Dataset Collection and Scope

ProgrammableWeb aggregates and organizes information on publicly documented Web APIs and application "mashups" that consume these APIs. Each entity (API or mashup) is described by textual summaries, tags (keywords), submission metadata, and, where relevant, linkage to associated APIs.

  • Origin and Time Span: Data are crawled from ProgrammableWeb, covering all entries from inception to the snapshot of interest. The WebAPIRec paper (Thung et al., 2017) utilized a mid-2014 snapshot; evolution analysis (Liu et al., 2021) uses later snapshots, with dynamic testing and correction methods applied as of September 2020.
  • Scale (mid-2014 snapshot):
    • 9,883 APIs (|A|)
    • 4,315 Mashup projects (|P|)
    • Entity and relationship statistics reflect strict filtering: deprecated entities are excluded, and only active relationships are retained.

The dataset structure supports fine-grained ecosystem, recommendation, and evolutionary studies by accurately capturing API-mashup consumption relationships, entity metadata, and time-varying status.

2. Data Schema and Entity Attributes

Entities and their attributes reflect specialized needs for both static recommendation and longitudinal network analysis.

Entity Primary Fields
API api_id, name, provider, developer, description_text, endpoint_url, primary_category (≈30 types), tags, submission and status timestamps, deprecation metadata—including both raw (as reported) and corrected (algorithmic) “death” times, and lineage fields (split/transfer)
Mashup mashup_id, name, developer, description_text, homepage_url, primary_category, tags, submission and status timestamps, lists of associated APIs (both raw and corrected for split/transfer/death events)
Category category_id (e.g., “Mapping”, “Social”)
Tag tag_string
  • Relationships:
    • Mashup–API invocation: Each mashup invokes a subset of APIs. In corrected datasets, invocation edges are temporally localized with [start, end] intervals reflecting service availability.
    • API–API co-occurrence: Defined when two APIs are used in the same mashup at the same time.
    • Category–Category co-occurrence: Derived based on API categories co-invoked within the same mashup.
  • Temporal Annotations: All entities and relations are enriched with creation, deprecation, and corrected activity intervals, enabling the construction of dynamic network snapshots at daily/yearly resolutions.

3. Data Quality Challenges and Correction Methodologies

Several reliability issues have been empirically identified in the raw ProgrammableWeb data (Liu et al., 2021):

  • Untrustworthy “deathpool” timestamps: Many deprecation dates cluster artificially, some precede creation dates, and survival analysis based on these dates is biased.
  • Incorrect API and Mashup Status: Only 44.7% of APIs labeled "available" are verifiably online; for mashups, only 32% remain live despite 80.4% labeled available. Misreported status leads to erroneous ecosystem structure if uncorrected.
  • Transfer, Split, and Composition Errors: Functionality sometimes migrates between APIs (“transfer”) or splits into multiple APIs; mashups may list dead APIs or exhibit partial functionality due to these events.

Correction Approach:

  1. Automated endpoint network testing, repeated over multiple dates, identifies actual reachability.
  2. Text mining captures split/transfer patterns from API/mashup descriptions.
  3. Manual validation is performed on random subsamples (≈100 APIs, 530 mashups).
  4. Unknown deprecation timelines are imputed with a normal-distribution survival model:

    μ^=1ni=1nxi,σ^2=1ni=1n(xiμ^)2\hat{\mu} = \frac{1}{n} \sum_{i=1}^n x_i,\quad \hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \hat{\mu})^2

    For each API/mashup, a life duration dxd_x is sampled and used to estimate a corrected death date, subject to last-verified interval constraints.

  5. Mashup–API composition lists are updated to reflect corrected API availability, transfer, or split events.
  6. Node and edge temporal activity intervals are recomputed for use in dynamic network modeling.

Pseudocode for life-cycle estimation and mashup correction is provided explicitly in (Liu et al., 2021).

4. Preprocessing and Feature Construction

Textual information undergoes a multi-step transformation to enable effective information retrieval (IR), feature engineering, and machine learning tasks (Thung et al., 2017):

  • Noun Retention: POS tagging retains only nouns to focus on conceptual salience.
  • Tokenization and Stopword Removal: Standard tokenization and SMART-list stopword removal are applied.
  • Stemming: Porter stemming reduces words to root forms.
  • Textual Field Aggregation: API names, summaries, and long descriptions are merged; for mashups, long descriptions and tag lists are used.

From the processed text, a vector space model (VSM) is constructed using tf-idf weights:

wd,D,C=TFd,Dlog(NCDFd,C)w_{d, D, C} = \mathrm{TF}_{d, D} \cdot \log\left(\frac{N_C}{\mathrm{DF}_{d, C}}\right)

Cosine similarity between vectors is then used in constructing features for recommendation and classification.

Feature Functions: For recommendation, 12 features xj(p,a)x_j(p, a) describe project–API pairs, including neighborhood textual and tag similarity metrics at multiple neighbor sizes (k = 5, 10, 15, 20, 25) and direct project–API similarities.

  • For reproducibility, researchers replicate these steps by re-harvesting and applying these transformations on the appropriate data snapshot.

5. Statistical Properties and Ecosystem Dynamics

Aggregate Metrics

  • Sparsity: The API–project usage matrix RR of size 4315×98834315 \times 9883 has density <0.2%< 0.2\%; each project uses approximately 3.2 APIs on average (median = 2), each API is used by approximately 1.4 projects (median = 1).
  • Distributional Characteristics: API use is long-tailed; most projects use 1–3 APIs, a minority use many. Popular APIs (e.g., Google Maps, Twitter) are used by hundreds of projects; most APIs are rarely used.

Dynamic Network Models

(Liu et al., 2021) demonstrates three temporal network constructs based on the corrected dataset:

  1. Mashup–API Bipartite Graph: Nodes (MM, AA), edges (m,a,start,end)(m,a,\mathrm{start},\mathrm{end}) for mashup-invoked APIs over their joint activity interval.
  2. API–API Homogeneous Network: Unweighted edge (u,v)(u,v) exists at tt if APIs co-occur in any mashup alive at tt.
  3. Category–Category Hypergraph: Derived by mapping APIs to primary categories; co-occurrence in mashups is used to define inter-category edges.

Network evolution is analyzed via snapshot sequences, with network statistics such as degree, edge growth, temporal centrality, and power-law fit computed per time window.

Correction-Induced Revisions

  • The corrected dataset revises the apparent health, connectivity, and diversity trends of the ecosystem:
    • Both the number of available APIs/mashups and the ecosystem’s diversity peak around 2014–2016 and decline thereafter, in contrast to the illusion of continuous growth in uncorrected data.
    • Degree distributions depart from clear power-law behavior post-correction.
    • The largest API–API network component shrank from over 1,000 APIs in 2013 to under 200 by 2020.

6. Experimental Protocols and Data Splitting

In API recommendation and retrieval tasks (Thung et al., 2017), standard experimental protocols are enforced:

  • 10-Fold Cross-Validation: Projects are partitioned into 10 folds; in each, parameters are trained on 9 and evaluated on the 10th.
  • Learning Curves: Training set size is varied from 10% to 90% to assess generalization and data-efficiency.
  • Pairwise Ranking Loss: Training data comprises triplets (p,a,a)(p, a, a') with pairwise preference (project pp uses aa and not aa'). A linear scoring function f(p,a)f(p, a) is trained with squared hinge loss and L2L_2 regularization (λ = 1).

7. Availability, Limitations, and Research Directions

  • Download and Reproducibility: The corrected ProgrammableWeb dataset (complete with status corrections, activity intervals, and dynamic network snapshots) is available at https://github.com/HIT-ICES/Correted-ProgrammableWeb-dataset, with code and Jupyter notebooks for analysis (Liu et al., 2021).
  • Known Limitations:
    • A high proportion of misreported statuses and unreliable “deathpool” dates in the raw dataset mandates reliance on the corrected version for longitudinal or reliability-sensitive studies.
    • Descriptive text is often noisy or sparse. Practice/demo (“toy”) mashups distort usage statistics.
    • API invocation order within mashups is not recorded.
    • Gaps exist in the coverage of newer web service types, such as cognitive/speech/vision APIs.

Open Problems and Opportunities:

  • Generative dynamic models for birth-death processes in API/mashup networks.
  • Survival-aware recommendation algorithms incorporating estimated service longevity.
  • Filtering and improved curation for “toy” mashups and sparse text scenarios.
  • Analysis of API family trees, split/transfer events, and microservice integration.
  • Extension of these methodologies for multi-layered enterprise and microservice-centric ecosystems.

A plausible implication is that studies not addressing the data quality issues or using the uncorrected version risk substantially biased ecosystem models or incorrect inference about API/mashup evolution.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to ProgrammableWeb Dataset.