Iterative National TLD Crawling
- Iterative national TLD crawling is a systematic process combining public data streams and specialized infrastructure to extract, filter, and update ccTLD web corpora.
- It employs methodologies such as CT log parsing, Common Crawl snapshot analysis, and real-time polling to maintain high coverage and freshness.
- The approach achieves measurable improvements in web corpus quality and volume, with significant gains in novel content and domain resolution rates.
Iterative national top-level domain (TLD) crawling refers to the repeated, systematic acquisition and processing of web data from country-code TLDs (ccTLDs) such as .hr, .bg, .ru, and others, using a combination of public data sources and specialized crawling infrastructure. This approach addresses the challenge of obtaining representative, up-to-date web corpora for distinct national webs where access to authoritative domain lists (i.e., zone files) is often restricted. Recent research formalizes methodologies for constructing, maintaining, and evaluating large-scale ccTLD crawls, with particular attention to coverage, freshness, and noise management (Sommese et al., 2023, Pungeršek et al., 16 Jan 2026).
1. Data Sources and Extraction Techniques
Two principal public data streams inform ccTLD discovery: Certificate Transparency (CT) logs and Common Crawl (CC) web snapshots. CT logs comprise append-only records of issued TLS certificates and can be queried via public mirrors, APIs, or streamed in near real-time. Extraction involves parsing X.509 certificate fields—specifically CommonName and SubjectAlternativeName fields—followed by normalization to the registered domain level using resources such as the Public Suffix List. Filtering restricts occurrences to target ccTLDs, discarding wildcards and, where needed, de-duplicating expired or repeated domain names.
Common Crawl data is available as monthly archives of WARC-format crawls on Amazon S3. Hostnames are extracted using URL index processing (typically with tools such as jq and tldextract), then filtered to ccTLDs of interest. Filtering criteria remain consistent with those applied to CT output.
The iterative crawling process is typically initialized by aggregating all historical data available from these sources, followed by periodic re-extraction: CT logs are continuously tailed or polled, whereas Common Crawl is sampled according to monthly or bimonthly release cadence. Additional augmentation may derive from port scans (e.g., masscan or zmap) across the IPv4 address space for web ports, and—where available—passive DNS datasets.
2. Iterative Crawling Architecture and Workflow
A canonical ccTLD-focused crawling infrastructure encompasses the following pipeline:
- Seed-list construction: Begin with national ccTLDs and augment with generic TLDs whose hosts directly link into ccTLD sites.
- Frontier and concurrency management: Maintain a prioritized per-host queue (e.g., crawlDB in MaCoCu/SpiderLing) to ensure adherence to politeness constraints and robots.txt directives. Rate limiting policies typically enforce delays of at least 1 request/second per host, with a finite global worker thread pool.
- Batching and segmentation: Segment seeds by TLD or domain hash; label each crawl iteration with a unique ID and timestamp. Options include excluding previously seen URLs or, as in CLASSLA-web, fully re-crawling with deduplication downstream (Pungeršek et al., 16 Jan 2026).
- Fetch and content processing pipeline:
- Raw HTML to cleaned text using jusText.
- Paragraph-level near-duplicate removal with normalization (Onion), masking digits, punctuation, and hyperlinks.
- Document and paragraph-level language identification via ensembles (e.g., CLD2, trigrams, Naïve Bayes specialized by language cluster).
- Exclusion of out-of-language, short, or malformed documents.
- Manual audit of top-N contributing hosts to filter machine-generated or spam-laden domains.
- Annotation for genre (p ≥ 0.8) and topic (IPTC classifier, p ≥ 0.6), plus linguistic features using frameworks such as CLASSLA-Stanza.
- Export of finalized corpora in standard formats (e.g., JSONL, VERT).
A procedural overview is summarized below:
| Stage | Tool/Component | Objective |
|---|---|---|
| Seed extraction | CT/CC + tldextract | National domain discovery |
| Frontier management | crawlDB or equivalent | Politeness, concurrency, segmentation |
| Fetching | Multi-threaded crawlers | Raw content acquisition |
| Filtering/Dedup | jusText, Onion | Content cleaning, duplication control |
| Audit/Annotation | CLD2, Naïve Bayes, CLASSLA-Stanza | Language and genre/topic labeling |
3. Formal Evaluation Metrics and Coverage
Distinct metrics have been established to quantify and compare crawl effectiveness over time and across methodologies:
- Coverage () measures the fraction of ground-truth ccTLD domains recovered from public sources.
- Overlap and Gain for corpus iterations:
- Overlap:
- Gain:
- Live web presence: Percentage of extracted domains resolving to valid A/AAAA records and with ports 80/443 open, reflecting practical web visibility.
Empirically:
- Average coverage for public-source ccTLD lists was 59% in 2023 (43%–80% across 19 ccTLDs), rising from 37% (15%–50%) in 2018—an approximate 1.6× growth (+4–5 percentage points per year) (Sommese et al., 2023).
- Among extracted domains, >90% resolved and served HTTP(S), and CT logs contributed an absolute ~52 percentage points (pp) of coverage, with Common Crawl adding ~7 pp.
- Typical lag for newly registered ccTLD domains to appear in CT logs was ≤1 day for 60% of cases, ≤5 days for 80%.
For iterative corpus builds as in CLASSLA-web:
- Two-year interval re-crawls produced corpora with only ~18% overlap, yielding ~82% new material.
- Corpus size increased by +46% (texts) and +57% (words). Some national webs saw overlap as low as 11–14% after two years, while even modest-sized ccTLDs yielded >70% novel content (Pungeršek et al., 16 Jan 2026).
4. Quality Challenges: Machine-Generated and Low-Quality Content
A critical challenge in iterative ccTLD crawling is the proliferation of machine-generated, templated, or low-quality sites. In CLASSLA-web 2.0, manual inspection identified ~15% of texts from the top 250 domains as predominantly low-quality or auto-generated, a significant increase from ~1% in CLASSLA-web 1.0 for comparable sets. This degradation necessitates both automated (near-duplicate heuristics, template detection) and manual interventions. Strategies include maintaining a domain blacklist, prioritizing or blocking suspect sub-domains, and ongoing audits focusing on high-volume contributors.
5. Practical Management and Best Practices
Sustained iterative crawling requires robust operational strategies:
- Automation: Maximize automation for crawling, deduplication, and language identification, supplemented by scheduled manual audits for key hosts.
- Language identification: Deploy ensemble models and specialist classifiers for linguistically diverse webs.
- Near-duplicate detection: Tune normalization and deduplication heuristics with reference to manual gold standards; mask variable tokens during matching.
- Corpus overlap estimation: Employ URL overlap as a computationally efficient proxy for text overlap, using linear regressions derived from empirical datasets (e.g., ).
- Freshness and recrawl intervals: Empirical findings indicate that 18–24 month recrawl intervals consistently produce >80% novel content, balancing data freshness and infrastructure demands (Pungeršek et al., 16 Jan 2026).
- Transparency: Document filter policies and provide mechanisms for content removal on request, enhancing reproducibility and user control.
Technical implementation frequently utilizes Bloom filters or persistent key–value stores for deduplication and timestamp tracking, along with resilient queues such as Kafka or Redis for domain scheduling. Politeness and rate-limiting—enforced via per-host queues and connection limits—are essential both for ethical practice and effective resource utilization.
6. Limitations, Trade-offs, and Recommendations
Iterative ccTLD crawling based solely on public resources faces inherent limitations:
- CT logs offer incomplete but rapidly improving coverage—restricted to TLS-enabled domains.
- Common Crawl is limited to domains encountered through hyperlink traversal.
- Port scans and passive DNS datasets, while costlier or less universally available, recover an additional 10–20% of domains.
Trade-offs include the computational and storage burden of parsing large-scale CT and CC data (e.g., 38 CT logs, petabytes of WARC files), offset against the cost and administrative challenges of obtaining ground-truth zone files or conducting exhaustive port scans.
Recommendations underscore deploying a combined, regularly refreshed CT+CC pipeline to maximize web-relevant ccTLD coverage at minimal financial outlay. To further approach full national web enumeration, supplementary scans and, where feasible, negotiated zone file access remain advisable. Continuous monitoring of corpus health and proactive filtering are considered critical to offset growing volumes of non-human-generated content, preserving the utility and representativeness of the resulting national web corpora (Sommese et al., 2023, Pungeršek et al., 16 Jan 2026).