C-WebShop: A Multimodal Web Page Dataset
- C-WebShop is a multimodal dataset that integrates visual, textual, and numerical features from 49,438 web pages across six subject areas.
- It supports machine learning experiments through tasks like error page detection with CNNs and subject categorization using transfer learning with ResNet-50.
- The dataset facilitates cross-cultural analysis of web design by providing region-specific metadata and reproducible data collection pipelines.
The C-WebShop dataset is a large-scale, multimodal resource specifically designed for empirical research on the characteristics and classification of web pages. It integrates visual, textual, and numerical features extracted from 49,438 web pages spanning all countries and six major subject categories. By combining automated scraping from both search engines and curated web directories, the dataset achieves broad thematic coverage and substantial cultural diversity, making it suitable for machine learning experiments in web analytics, computer vision, and deep learning.
1. Composition and Structure
C-WebShop is explicitly multimodal, consisting of three primary data types:
- Visual Data: Full-page “webshots” saved as JPG images, capturing the complete visual appearance of each web page.
- Textual Data: Core descriptors including the page URL and a structured “Name” identifier encode the source, thematic category, and originating country.
- Numerical Data: Quantitative metrics parsed from HTML, such as download time (ms), total source code size, counts of images, scripts, CSS files, tables, iFrames, style tags, and image metadata (file size and pixel dimensions).
All page samples are pre-categorized into one of six topics: Arts and Entertainment, Business and Economy, Education, Government, News and Media, Science and Environment. Metadata also records country and continent, facilitating region-specific and cross-cultural analysis of web design.
| Modality | Example Features | File Format / Encoding |
|---|---|---|
| Visual | Full-page screenshot (“webshot”) | JPG |
| Textual | URL, “Name” identifier (e.g., “B2Netherlands_791”) | Plaintext/CSV |
| Numerical | download time, HTML element counts, file size | Structured CSV/TSV |
2. Data Collection and Processing Workflow
The dataset was constructed via two complementary pipelines:
- Searching: Automated Python scripts executed Google search queries using permutations of keywords, operators (“OR”, “site”, “ext:html”), and country codes. Each query fetched approximately 100 results, for which URLs and metadata (including inferred continent and topic) were stored.
- Browsing: Customized scripts traversed the Best of the Web (BOTW) directory, leveraging its built-in organization by country and category to systematically scrape curated links.
Subsequent steps involved:
- Parameter Extraction: HTML parsing (via Beautiful Soup) extracted the prescribed set of quantitative indicators.
- Webshot Acquisition: R scripts, utilizing Webshot and PhantomJS, generated complete page screenshots. The naming convention precisely links each image to its source attributes.
- Robust Error Handling: Automated routines marked missing or erroneous downloads with a placeholder (-1). Post-hoc debugging used a manually verified subset and a convolutional neural network (CNN) to automatically discriminate valid from error pages.
3. Analytical Applications
C-WebShop underpins a range of machine learning tasks:
- Error Web Page Detection (Binary Classification): A CNN using visual webshots as input and sigmoid output identifies error states (e.g., “404 Not Found”, suspension pages). The architecture utilizes stacked convolutional + max-pooling layers followed by a dense classifier.
1 2 3 4 5 6 7 |
model.add(Conv2D(32, (3, 3), input_shape=(256, 256, 3))) model.add(Activation('relu')) model.add(MaxPooling2D(pool_size=(2, 2))) # ... additional layers ... # Final layer for binary classification: model.add(Dense(1)) model.add(Activation('sigmoid')) |
The model's accuracy is evaluated by:
where TP and TN denote true positives and true negatives, respectively.
- Subject-Based Categorization (Multi-Class): Transfer learning with pre-trained ResNet-50 CNNs is employed. The convolutional layers remain static for feature extraction, while the classifier is retrained for six-class prediction. This setting is complicated by substantial visual heterogeneity within subjects.
4. Cultural and Regional Design Representation
C-WebShop systematically samples pages from “all countries worldwide” and tracks both country and continent for each entry. This enables empirical analysis of:
- Regional Aesthetics: Variation in color schemes, layout conventions, and iconographic trends according to geographic origin.
- Cultural Characteristics: Thematic versus stylistic elements across categories and countries.
The scope ensures that cross-cultural and international differences in web design are quantitatively represented, supporting studies in information aesthetics and cultural informatics.
5. Data Accessibility and Reproducibility
The entire dataset, including collection and preprocessing scripts (Python and R), is publicly available via the Open Science Framework (OSF). Download options provide access to:
- Webshot image archives (class- and country-organized).
- Textual and numeric datasheets (structured CSV/TSV).
- Source code for reproduction and extension.
No restrictive licenses are noted; distribution is free, promoting open research in web analytics and deep learning.
6. Significance for Research and Methodological Implications
C-WebShop is a substantial contribution to the empirical paper of web pages, overcoming the challenge of integrating visual and functional parameters at scale. Its combination of automated search and curated browsing leads to high thematic and regional diversity, which is pivotal for training and evaluating computer vision and web classification models under realistic, heterogeneous conditions.
A plausible implication is that models validated on this dataset may generalize more robustly to “in-the-wild” web data than those trained on narrowly scoped web page corpora. The availability of complete sampling scripts and reproducible pipelines facilitates methodological transparency and allows adaptation for expanded research objectives.
C-WebShop thus serves as both a benchmark for machine learning on visual web page attributes and a source for cross-cultural design analysis, uniquely positioned in the intersection of web mining, empirical computer vision, and global information studies.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free