Papers
Topics
Authors
Recent
2000 character limit reached

WebLI Dataset: Multi-Modal Web Analysis

Updated 1 January 2026
  • WebLI is a multi-modal dataset comprised of 49,438 validated web pages from diverse global regions and thematic categories.
  • It integrates visual webshots, HTML attributes, and quantitative features to enable comprehensive analysis of web structure.
  • The dataset supports empirical studies in web design, error-page detection, and ML benchmarking using reproducible methods.

The WebLI dataset constitutes a large-scale, multi-modal resource for computational web page analysis, integrating comprehensive visual, qualitative, and quantitative data from 49,438 fully validated web pages. Designed to facilitate empirical studies of web appearance, structure, and topical distribution, WebLI is notable for combining visual webshots, HTML-derived features, and categorical metadata with global coverage spanning all sovereign countries and major thematic domains. Developed by Mejía-Escobar, Cazorla, and Martínez-Martín, the dataset is publicly released under unrestricted CC0-like licensing, supporting both academic and commercial usage (Mejia-Escobar et al., 2021).

1. Dataset Composition and Topical-Geographic Scope

WebLI encompasses 49,438 unique, non-error web pages. The topical segmentation consists of six high-level classes: Arts & Entertainment (7,752 pages), Business & Economy (8,438), Education (7,892), Government (7,354), News & Media (10,044), and Science & Environment (7,958), distributed across a global corpus of URLs. Geographic stratification was achieved by harvesting from country-specific TLDs, yielding approximate continental representation as follows: Europe (35%), Asia (30%), North America (15%), South America (10%), Africa (7%), and Oceania (3%). Only web pages with both successful parameter extraction and valid webshot were retained, ensuring a high-quality reference set (Mejia-Escobar et al., 2021).

2. Data Modalities, Structure, and Storage

Each web page record in WebLI consists of four complementary modalities:

  • Screenshot image ("webshot"): Full-page JPEG image capturing rendered appearance.
  • Raw HTML: Original HTML file retrieved during collection.
  • Extracted quantitative features: Key HTML parameters, including counts of <img>, <script>, <link rel="stylesheet">, <table>, <iframe>, and <style> tags, as well as timing and size metrics.
  • Qualitative labels: Categorical variables specifying category, country, continent, and retrieval method ("Searching" or "Browsing").

The dataset is organized via a hierarchical storage structure on OSF: | Directory | Content Type | Example Path | |------------------|----------------------------------|------------------------------------------------| | /images/ | Webshots organized by category | /images/Arts_and_Entertainment/*.jpg | | /data/ | Feature & metadata files | /data/webli_parameters.csv, country_list.txt | | /code/ | Scripts and notebooks | /code/scraping_search.py, cnn_error_detection.ipynb | | /docs/ | Documentation | /docs/README.md |

Total storage requirement is approximately 18 GB, including ~17 GB for images and ~200 KB for CSV metadata.

3. Automated Data Collection and Quality Control

Data acquisition combined two collection strategies:

  • "Searching": Automated Google queries scripted in Python to retrieve the top 100 results per country-category pairing, e.g., “(arts OR entertainment) site:.fr ext:html.”
  • "Browsing": Iterative extraction from the Best-Of-The-Web directory using Python scripting templates.

Feature extraction employed a Python+BeautifulSoup workflow to record HTML download times, sizes, and structural feature counts. Page rendering and screenshot capture utilized an R-scripted pipeline (capture_webshots.R) with PhantomJS, storing webshots as JPGs following a standardized naming convention ([source] [categoryID] [countryCode]_[seq].jpg). Data rows lacking either parameter extraction or valid webshot were discarded. Additional filtering included programmatic outlier removal based on the interquartile-range (IQR) rule and binary CNN-based error-page detection with a final validation accuracy of ≈97% on held-out "Browsing" samples. Missing data in specific fields were tagged with −1 (Mejia-Escobar et al., 2021).

4. Metadata Schema and Example Record

The dataset's metadata is captured in CSV or JSON with one entry per web page. Key fields include:

  • name (unique identifier), url, source, country, continent, category
  • download_time (ms), html_bytes (bytes), n_images, n_script_files, n_css_files, n_tables, n_iframes, n_style_tags
  • img_bytes (webshot file size), img_width, img_height (pixels)

Sample JSON record:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
{
  "name": "B2NL_791",
  "url": "http://www.example.nl/news/today.html",
  "source": "Browsing",
  "country": "Netherlands",
  "continent": "Europe",
  "category": "News_and_Media",
  "download_time": 24.8,
  "html_bytes": 48216,
  "n_images": 4,
  "n_script_files": 2,
  "n_css_files": 1,
  "n_tables": 0,
  "n_iframes": 0,
  "n_style_tags": 1,
  "img_bytes": 312384,
  "img_width": 992,
  "img_height": 3859
}

5. Feature Definitions, Evaluation Metrics, and Baseline Models

Fundamental statistical and ML constructs used in dataset curation and benchmarking include:

  • Outlier removal: Values where x<Q11.5IQRx < Q_1 - 1.5 \cdot IQR or x>Q3+1.5IQRx > Q_3 + 1.5 \cdot IQR (with Q1Q_1, Q3Q_3 the lower and upper quartiles).
  • Error-page classification accuracy: Accuracy=TP+TNTP+TN+FP+FN\mathrm{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}.
  • Binary cross-entropy loss for error-page CNN:

Lbinary=1Ni=1N[yilogy^i+(1yi)log(1y^i)]\mathcal{L}_{\mathit{binary}} = -\frac{1}{N}\sum_{i=1}^N \left[y_i\log \hat y_i + (1-y_i)\log(1-\hat y_i)\right]

  • Categorical cross-entropy loss for 6-way classification:

Lcategorical=1Ni=1Nc=16yi,clog(y^i,c)\mathcal{L}_{\mathit{categorical}} = -\frac{1}{N}\sum_{i=1}^N \sum_{c=1}^6 y_{i,c} \log(\hat y_{i,c})

with yi,cy_{i,c} as one-hot labels.

Baseline models and achieved benchmarks:

  • Error-page detection: CNN with three Conv2D+ReLU+MaxPool blocks, fully-connected classifier, binary cross-entropy loss, and RMSprop optimization; validated accuracy ≈97%.
  • Multi-class subject categorization: Transfer learning with frozen ImageNet pre-trained ResNet-50 backbone, new 256-unit dense head, softmax output; train accuracy ≈94% (500 epochs), validation accuracy ≈40% (on 2,202 images, pronounced overfitting observed) (Mejia-Escobar et al., 2021).

6. Access, Licensing, and Reproducibility

WebLI is publicly accessible via OSF at https://osf.io/7ghd2/, including all raw images, feature CSVs, and full processing code base (Python and R scripts, Jupyter notebooks). Licensing is CC0/Public Domain—users may employ the data and scripts without restriction beyond standard citation requirements. The dataset design enables immediate reproducibility for web-scale ML benchmarking and comparative studies.

7. Applications and Research Significance

WebLI supports a diverse array of empirical web research tasks, including:

  • Automated detection and exclusion of error or “under construction” web pages from downstream pipelines (via CNNs).
  • Analysis of global web design characteristics, structural complexity, and content typology by country or topical category.
  • Benchmarking ML models for visual and structural web page categorization.
  • Analysis of cross-country and cross-domain design languages and feature distributions.

A plausible implication is that WebLI provides a foundation for extending ML-driven approaches to web page classification, rendering analysis, and cross-cultural studies in web design at scale. Its integration of multi-modal data modalities distinguishes it from previous datasets limited to either visual or textual content (Mejia-Escobar et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to WebLI Dataset.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube