Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Synthesizing Products for Online Catalogs (1105.4251v1)

Published 21 May 2011 in cs.DB

Abstract: A high-quality, comprehensive product catalog is essential to the success of Product Search engines and shopping sites such as Yahoo! Shopping, Google Product Search or Bing Shopping. But keeping catalogs up-to-date becomes a challenging task, calling for the need of automated techniques. In this paper, we introduce the problem of product synthesis, a key component of catalog creation and maintenance. Given a set of offers advertised by merchants, the goal is to identify new products and add them to the catalog together with their (structured) attributes. A fundamental challenge is the scale of the problem: a Product Search engine receives data from thousands of merchants and millions of products; the product taxonomy contains thousands of categories, where each category comes in a different schema; and merchants use representations for products that are different from the ones used in the catalog of the Product Search engine. We propose a system that provides an end-to-end solution to the product synthesis problem, and includes components for extraction, and addresses issues involved in data extraction from offers, schema reconciliation, and data fusion. We developed a novel and scalable technique for schema matching which leverages knowledge about previously-known instance-level associations between offers and products; and it is trained using automatically created training sets (no manually-labeled data is needed). We present an experimental evaluation of our system using data from Bing Shopping for more than 800K offers, a thousand merchants, and 400 categories. The evaluation confirms that our approach is able to automatically generate a large number of accurate product specifications, and that our schema reconciliation component outperforms state-of-the-art schema matching techniques in terms of precision and recall.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Hoa Nguyen (11 papers)
  2. Ariel Fuxman (10 papers)
  3. Stelios Paparizos (2 papers)
  4. Juliana Freire (46 papers)
  5. Rakesh Agrawal (7 papers)
Citations (31)

Summary

We haven't generated a summary for this paper yet.