Leveraging Transitive Relations for Crowdsourced Joins (1408.6916v2)

Published 29 Aug 2014 in cs.DB

Abstract: The development of crowdsourced query processing systems has recently attracted a significant attention in the database community. A variety of crowdsourced queries have been investigated. In this paper, we focus on the crowdsourced join query which aims to utilize humans to find all pairs of matching objects from two collections. As a human-only solution is expensive, we adopt a hybrid human-machine approach which first uses machines to generate a candidate set of matching pairs, and then asks humans to label the pairs in the candidate set as either matching or non-matching. Given the candidate pairs, existing approaches will publish all pairs for verification to a crowdsourcing platform. However, they neglect the fact that the pairs satisfy transitive relations. As an example, if $o_1$ matches with $o_2$, and $o_2$ matches with $o_3$, then we can deduce that $o_1$ matches with $o_3$ without needing to crowdsource $(o_1, o_3)$. To this end, we study how to leverage transitive relations for crowdsourced joins. We propose a hybrid transitive-relations and crowdsourcing labeling framework which aims to crowdsource the minimum number of pairs to label all the candidate pairs. We prove the optimal labeling order in an ideal setting and propose a heuristic labeling order in practice. We devise a parallel labeling algorithm to efficiently crowdsource the pairs following the order. We evaluate our approaches in both simulated environment and a real crowdsourcing platform. Experimental results show that our approaches with transitive relations can save much more money and time than existing methods, with a little loss in the result quality.

Authors (5)

Jiannan Wang (37 papers)
Guoliang Li (126 papers)
Tim Kraska (78 papers)
Michael J. Franklin (29 papers)
Jianhua Feng (9 papers)

Citations (225)

View on Semantic Scholar

Summary

Leveraging Transitive Relations for Crowdsourced Joins

The paper "Leveraging Transitive Relations for Crowdsourced Joins" addresses a significant challenge in the domain of crowdsourced query processing systems, particularly focusing on crowdsourced join queries for entity resolution. The proposed approach emphasizes optimizing the combination of human and machine contributions to improve efficiency, thereby reducing the cost and effort associated with using human involvement in labeling tasks.

The core challenge tackled by the authors in this paper is the use of crowds to identify pairs of matching objects from two collections, typically a labor-intensive process. As direct human-only implementations are economically impractical, the paper advocates a hybrid approach that initially employs machine-generated algorithms to create a candidate set of matching pairs. Subsequently, human participation is engaged to verify these pairs. However, existing methods neglect to leverage transitive relations inherent in such object pairs, leading to unnecessary verifications.

Specifically, the paper introduces a framework that utilizes these transitive relations to deduce certain pairs without human labeling. For instance, if object $o_1$ matches $o_2$ , and $o_2$ matches $o_3$ , the transitive relationship implies a match between $o_1$ and $o_3$ . This deduction circumvents the need to explicitly verify the pair $(o_1, o_3)$ , thus potentially saving significant resources.

The paper further refines this concept by proposing a heuristic labeling order based on the likelihood of pairs being a match, derived from machine-learning methods. It is asserted that deducing pair labels following this order can minimize the number of crowdsourced validations required, supposing a perfect ordering could first resolve all matches and then non-matches.

In addition to the ordering technique, a parallel labeling algorithm is proposed, which enables simultaneous handling of multiple pairs. This parallel approach contrasts sharply with traditional sequential methods by reducing overall task completion time, thereby enhancing the scalability and responsiveness of crowdsourced systems. The algorithm was evaluated using both simulated and real crowdsourcing platforms, demonstrating superior performance in terms of cost reduction and processing speed compared to traditional methods.

Empirical evidence pointed out in the paper substantiates the claim that transitive relations can significantly mitigate the economic burden of crowdsourced tasks. While the paper does acknowledge a marginal decrease in result quality due to deduced relation errors, the tradeoff is deemed manageable in various practical datasets, often providing a robust net benefit.

One of the paper's striking outcomes is the robust numeric efficiency demonstrated in diverse experimental settings, resulting in substantial reductions in the number of pairs needing human verification. Such savings are especially pronounced in datasets with larger clusters of matching objects, capitalizing maximally on the transitive properties discussed.

From a theoretical perspective, the paper identifies gaps in existing claims regarding the optimal ordering problem, formally acknowledging its NP-hard complexity. This insight offers a realistic limitation to optimal solutions but concurrently opens up avenues for heuristic and probabilistic methods to find practical, effective solutions.

In summation, this research provides valuable contributions both in theory and practical system design for database management, specifically within the scope of leveraging human intelligence effectively alongside machine capabilities. Future developments in this area might explore more dynamic and adaptive systems that can predictively model human response patterns to maximize efficiency further. Additionally, expanding this approach to accommodate non-equality joins or exploration into broader relational contexts could open up new applications for this heuristic framework.

Related Papers

CrowdER: Crowdsourcing Entity Resolution (2012)
Bayesian Crowdsourcing with Constraints (2020)
Candidate Labeling for Crowd Learning (2018)
CDAS: A Crowdsourcing Data Analytics System (2012)
Getting It All from the Crowd (2012)

Find Related Papers