Mosaiks Housing Price Data

Updated 4 February 2026

Mosaiks Housing Price Data is derived from online advertisements and refined through rigorous deduplication and machine learning techniques.
The methodology integrates spatial blocking, record linkage, and clustering to generate unbiased market indicators such as supply, demand, liquidity, and price indexes.
High-frequency data with fine spatial granularity enables timely detection of housing market cycles and facilitates robust policy analysis.

Mosaiks Housing Price Data refers to granular, high-frequency data derived from online housing sales advertisements and systematically processed for research and applied market analysis. These data, exemplified by the methodology outlined in Loberto, Luciani, and Pangallo (2019), integrate machine learning-based deduplication with robust statistical techniques to construct unbiased indicators of housing supply, demand, liquidity, and price dynamics. The distinctive merits are real-time availability, fine spatial granularity, and the informational richness of asking-price and user-interaction metrics, facilitating the timely detection of housing market cycles and shocks (Loberto et al., 2020).

1. Biases and Structural Challenges in Online Listing Data

Online housing advertisements introduce several pervasive biases, predominantly through the presence of duplicates, which stem from open-listing mandates (permitting a homeowner to list with multiple brokers), brokers' strategic reposting, and self-posting by private sellers. These practices substantially inflate the apparent supply of dwellings, especially at fine spatial levels, and induce overrepresentation of expensive or illiquid properties. This leads to distortions in aggregated metrics such as average asking price, time-on-market, and estimated turnover. A rigorous approach to deduplication is thus essential prior to any inferential or descriptive analysis.

2. Deduplication: Machine Learning Record Linkage and Clustering

Deduplication employs a multi-stage workflow combining record linkage, supervised learning, and graph-based clustering:

Blocking (candidate selection): Restricts pairwise comparisons to ads within a 400 m radius and an absolute asking-price gap ≤ 25%, drastically reducing computational complexity.
Feature engineering: Constructs similarity metrics, including absolute and relative differences in price ( $\Delta P$ , $\Delta P\%$ ), area, floor, encoded categorical variables (e.g., maintenance status, energy class), boolean matches (elevator, terrace, etc.), geodesic distance, and textual similarity (Levenshtein distance for same-poster ads, doc2vec cosine distance for cross-poster ads).
Supervised classification: Utilizes C5.0 decision trees on hand-labeled ad pairs (distinct models for same- and different-agency pairs), optimized to minimize misclassification. Achieved out-of-sample $F_1$ scores are in the $0.93-0.95$ range.
Graph-based clustering: Ads are nodes, edges exist if duplicate probability $p_{ij}>0.5$ . Clusters (candidate dwellings) are validated if their internal connectivity is at least $5/6$ of the total possible edges, pruning as necessary.
Time-machine deduplication: Processes weekly in temporal order, ensuring historical integrity by limiting cluster mergers, and incrementally updating deduplicated inventories.
Aggregation: For each dwelling cluster, attributes are set via modal or mean values; entry/exit dates correspond to the earliest ad and latest removal, respectively (Loberto et al., 2020).

3. Construction of Market Indicators: Definitions and Formulas

The deduplicated Mosaiks data enables construction of core housing market indicators:

Supply: Short-run supply in market $j$ at time $t$ is the active dwelling count, $S_{jt} = \#\{ \text{dwellings listed in } j \text{ at } t \}$ .
Demand: Micro-demand is proxied by page-view counts (Clicks), aggregated per dwelling per week. Relative demand at dwelling level ( $\mathrm{ONLINT}_i$ ) is normalized by market averages over each dwelling's time-on-market. Aggregate demand in market $\Delta P\%$ 0 at time $\Delta P\%$ 1 is

$\Delta P\%$ 2

Liquidity: Measured by the delisting rate (sold or withdrawn), $\Delta P\%$ 3, where $\Delta P\%$ 4 is the number of delistings in period $\Delta P\%$ 5.
Price indexes: Hedonic regressions in log-price form,

$\Delta P\%$ 6

with time-dummy coefficients $\Delta P\%$ 7 defining the ask-price index, $\Delta P\%$ 8. If available, transaction prices yield $\Delta P\%$ 9, with the fractional discount computed as $F_1$ 0.

Official OMI-based measures: Use min/max band midpoints $F_1$ 1 for each microzone, with city/national aggregates formed by weighted averages.

4. Comparative Analysis: Asking versus Transaction Prices

The relationship between asking and transaction prices is characterized via the dynamic “discount” factor. Transaction-price indexes lag asking-price indexes during downturns due to increased discounts ( $F_1$ 2), and are published with greater latency and lower frequency, incurring higher sampling noise. In empirical analyses (Rome/Milan 2015–18), the ask-price index anticipated cyclical turning points missed by sale statistics. For nowcasting, dynamic models include the estimated discount:

$F_1$ 3

A practical implication is that unless a sale-price nowcast is essential, the ask-price index alone provides a smoother and more timely signal (Loberto et al., 2020).

5. Timeliness and Spatial Granularity: Microzone Analysis

With weekly snapshot frequencies and geospatial coverage of 109 Italian NUTS-3 capitals subdivided into approximately 27,000 OMI “microzones” (homogeneous neighborhoods), the Mosaiks data achieves spatial and temporal granularity that surpasses official sources (commonly quarterly and at city or NUTS-2 level). This enables detection of short-run shocks and submarket shifts by computing weekly microzone-level supply, demand, and liquidity. The data structure is well-suited for high-resolution cycle tracking and targeted policy analysis (Loberto et al., 2020).

6. Stepwise Preprocessing and Indicator Workflow

The recommended workflow for exploiting Mosaiks housing data involves:

Raw Data Ingestion: Load time-stamped snapshots (typically weekly) with full metadata—ad and broker IDs, location, attributes, text, price, and click counts.
Preliminary Cleaning: Restrict to target geographies, drop listings with $F_1$ 42 week lifetime, code ordered categorical fields as integers, impute missing attributes via doc2vec and multiple imputation.
Blocking and Candidate Pair Generation: Restrict duplicate checks to spatial–price-filtered pairs.
Feature Calculation: Compute attribute gaps, encodings, and similarities for each candidate pair.
Model Training: Label and fit C5.0 duplicate classifiers, monitoring $F_1$ 5 for both broker scenarios.
Prediction and Clustering: Classify candidate pairs, build and prune clusters via connectivity thresholds and time-machine rules.
Dwelling Aggregation: Assign consensus attribute values, entry, and exit dates for each “true dwelling.”
Outlier Screening: Exclude anomalous dwellings based on hedonic regression residuals ( $F_1$ 6 or $F_1$ 7).
Indicator Calculation: Compile supply, demand, liquidity, ask-price index, and (optionally) discount and turnover-based indexes. 10. Bias Adjustment: For accurate sale-price nowcasting, regress transaction indexes on asks and discounts; otherwise, use ask-based indicators for market assessment.

Key principles include mandatory deduplication, reliance on real-time click data for demand, and the use of live listings for supply definitions. The data infrastructure enables the extraction of micro-level, bias-corrected housing market indicators (Loberto et al., 2020).

7. Significance, Limitations, and Use Cases

The systematic cleaning and indicator construction with Mosaiks housing price data harness the benefits of high-frequency, fine-scale online listing analytics, overcoming traditional data source limitations. Timely ask-price indexes support real-time cycle detection and submarket monitoring. A limitation is the need for careful adjustment for duplicate and misclassified listings, as failure to do so introduces significant biases. Additionally, while official transaction prices retain value for benchmarking, their lag and lower frequency reduce their utility for short-run analysis. The methodology is extensible to diverse geographies, provided analogous advertising and attribute structures exist (Loberto et al., 2020).

Markdown Report Issue Upgrade to Chat

References (1)

What do online listings tell us about the housing market? (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mosaiks Housing Price Data.