The Complexity of Causality and Responsibility for Query Answers and non-Answers (1009.2021v2)

Published 10 Sep 2010 in cs.DB and cs.AI

Abstract: An answer to a query has a well-defined lineage expression (alternatively called how-provenance) that explains how the answer was derived. Recent work has also shown how to compute the lineage of a non-answer to a query. However, the cause of an answer or non-answer is a more subtle notion and consists, in general, of only a fragment of the lineage. In this paper, we adapt Halpern, Pearl, and Chockler's recent definitions of causality and responsibility to define the causes of answers and non-answers to queries, and their degree of responsibility. Responsibility captures the notion of degree of causality and serves to rank potentially many causes by their relative contributions to the effect. Then, we study the complexity of computing causes and responsibilities for conjunctive queries. It is known that computing causes is NP-complete in general. Our first main result shows that all causes to conjunctive queries can be computed by a relational query which may involve negation. Thus, causality can be computed in PTIME, and very efficiently so. Next, we study computing responsibility. Here, we prove that the complexity depends on the conjunctive query and demonstrate a dichotomy between PTIME and NP-complete cases. For the PTIME cases, we give a non-trivial algorithm, consisting of a reduction to the max-flow computation problem. Finally, we prove that, even when it is in PTIME, responsibility is complete for LOGSPACE, implying that, unlike causality, it cannot be computed by a relational query.

Citations (223)

View on Semantic Scholar

Summary

The paper introduces a framework that adapts Halpern-Pearl’s causality to determine responsible tuples for both query answers and non-answers.
It demonstrates that causes for conjunctive queries are computable in PTIME, while responsibility computation becomes NP-hard for queries with self-joins.
It outlines practical implications by partitioning data into endogenous and exogenous tuples, enhancing error diagnosis and data cleaning in databases.

An Analysis of Causality and Responsibility in Database Query Results

The paper "The Complexity of Causality and Responsibility for Query Answers and Non-Answers" by Alexandra Meliou, Wolfgang Gatterbauer, Katherine F. Moore, and Dan Suciu presents a nuanced exploration of causality and responsibility within the context of database queries. This paper explores the intricate landscape of both the causality of answers (Why-So) and the causality of non-answers (Why-No) as derived from queries posed against data in relational databases.

Theoretical Framework and Concepts

The core concept introduced in the paper is an adaptation of Halpern and Pearl's definition of causality. Actual causality is attributed to tuples in a database that contribute to a query result, characterized by a countertop test; a tuple is a cause if its removal (coupled with removal of others, called contingencies) changes the set of query results. Regarding query answers, causality is depicted as a composition of lineage, or tuple lineage, defining how an answer was derived from the data. Responsibility, a finer measure, further refines the notion by ranking causes based on the minimal size of their contingency set.

The paper advocates the division of tuples into endogenous and exogenous categories. This separation is pivotal for computing explanations for query results as it isolates tuples of interest (endogenous) for causal inspection.

Complexity Analysis

One of the primary contributions of the paper is its detailed analysis of the complexity of causality for conjunctive query answers. Through intricate theoretical constructions, the paper demonstrates that causality falls within PTIME complexity for conjunctive queries. More specifically, it proves that causes (lineage terms that determine results) can be computed using relational queries. The causes for Why-So and Why-No can be derived using stratified Datalog, convincingly showing how fundamental database systems can be employed for such computations with efficiency.

A pivotal aspect of the work is the identification of tractable and intractable cases in responsibility computations. For conjunctive queries without self-joins, a strict dichotomy is established: the responsibility is computable in polynomial time (PTIME) if the query aligns with certain structural properties, or it veers into NP-hard territory otherwise. This distinction clarifies when efficient algorithms suffice and when the problem escalates in complexity. The paper offers an innovative application of flow algorithms—specifically, the reduction to the max-flow problem—to compute responsibility in the tractable cases.

Practical and Theoretical Implications

The implications of this research are significant and multifaceted. Practically, the ability to rank causes by their degree of responsibility can aid in prioritizing data cleaning or revision tasks, and in diagnosing unexpected query results. The paper advances the theory of database provenance by augmenting existing lineage-based approaches with a robust causal framework. The introduction of endogenous vs. exogenous tuple partitions enhances the granularity of data analysis, offering new avenues for interpretation and error diagnosis in large data sets.

The research paves the way for future studies to explore causality further in more complex query types, potentially extending the results to cover a broader class of logic-based or procedural queries, including those with self-joins.

In conclusion, the paper offers a methodologically rigorous approach to understanding the underpinnings of query causality and responsibility in databases, providing clear delineation of computational feasibility and laying the groundwork for both theoretical advancements and pragmatic applications in database management systems.

PDF Markdown