Gene Regulatory Network Inference in the Presence of Dropouts: a Causal View (2403.15500v1)

Published 21 Mar 2024 in q-bio.QM, cs.LG, and q-bio.MN

Abstract: Gene regulatory network inference (GRNI) is a challenging problem, particularly owing to the presence of zeros in single-cell RNA sequencing data: some are biological zeros representing no gene expression, while some others are technical zeros arising from the sequencing procedure (aka dropouts), which may bias GRNI by distorting the joint distribution of the measured gene expressions. Existing approaches typically handle dropout error via imputation, which may introduce spurious relations as the true joint distribution is generally unidentifiable. To tackle this issue, we introduce a causal graphical model to characterize the dropout mechanism, namely, Causal Dropout Model. We provide a simple yet effective theoretical result: interestingly, the conditional independence (CI) relations in the data with dropouts, after deleting the samples with zero values (regardless if technical or not) for the conditioned variables, are asymptotically identical to the CI relations in the original data without dropouts. This particular test-wise deletion procedure, in which we perform CI tests on the samples without zeros for the conditioned variables, can be seamlessly integrated with existing structure learning approaches including constraint-based and greedy score-based methods, thus giving rise to a principled framework for GRNI in the presence of dropouts. We further show that the causal dropout model can be validated from data, and many existing statistical models to handle dropouts fit into our model as specific parametric instances. Empirical evaluation on synthetic, curated, and real-world experimental transcriptomic data comprehensively demonstrate the efficacy of our method.

Citations (4)

View on Semantic Scholar

Summary

The paper introduces a novel causal dropout model (CDM) and test-wise deletion method to recover accurate conditional independence relations in gene regulatory network inference.
The method, evaluated on simulated and real scRNA-seq datasets, consistently outperforms imputation and full-sample approaches in terms of graph accuracy and computational efficiency.
The framework not only corrects dropout biases but also validates key structural assumptions, revealing self-masking effects in gene expression data for reliable causal discovery.

This paper addresses the challenge of Gene Regulatory Network Inference (GRNI) from single-cell RNA sequencing (scRNA-seq) data, which is often plagued by "dropouts" – an abundance of zero values. These zeros can be either biological (true absence of expression) or technical (due to limitations in the sequencing process). Technical dropouts can distort the data distribution and bias GRNI.

The authors propose a novel framework centered around a Causal Dropout Model (CDM) to characterize the dropout mechanism. The CDM defines four sets of variables for each gene $i$ :

$Z_i$ : The true underlying gene expression. The network among $Z = \{Z_i\}$ is the GRN to be recovered.
$D_i$ : A boolean variable indicating a technical dropout ( $D_i=1$ for dropout, $0$ otherwise).
$X_i$ : The observed expression, generated by $X_i = (1-D_i)Z_i$ .
$R_i$ : A boolean zero observational indicator, $R_i = D_i \text{ OR } \mathbbm{1}(Z_i=0)$, meaning $R_i=1$ if $X_i=0$ . $R_i$ is fully observable from $X_i$ .

A key aspect of the CDM is the edge $Z_i \rightarrow D_i$ , representing a "self-masking" dropout mechanism where a gene's true expression level influences its probability of dropout. The paper shows that many existing parametric models for dropouts (e.g., zero-inflated models, truncation models, probabilistic dropout models) can be seen as specific instances of this non-parametric CDM.

Limitations of Existing Approaches:

The paper argues that imputation methods, which treat all zeros as missing values to be filled in, lack theoretical guarantees. The true underlying distribution $p(Z)$ is generally unidentifiable from the observed $p(X)$ due to the self-masking nature of dropouts and the fact that technical zeros cannot be distinguished from biological zeros (i.e., $D_i$ is latent). Examples are provided to show this unidentifiability even under specific parametric assumptions.

Proposed Solution: Test-wise Deletion for CI Testing

Despite the unidentifiability of $p(Z)$ , the authors demonstrate theoretically that Conditional Independence (CI) relations in the true data $Z$ can be recovered. The core theoretical contribution is Theorem 1 ("Correct CI estimations"):

Under assumptions (A1) Causal sufficiency, Markov, faithfulness over $Z \cup X \cup R$ , acyclicity, consistent CI testing; (A2) $D_i \not\rightarrow Z_j$ (dropout doesn't affect gene expression); (A3) $D_i$ is only directly affected by $Z_i$ ; and (A4) Faithful observability (dependencies conditioned on $R$ are preserved in non-zero values), the following holds:

$Z_i \perp Z_j | Z_S \Leftrightarrow X_i \perp X_j | Z_S, R_S = \mathbf{0}$

This means that to test for $Z_i \perp Z_j | Z_S$ , one can test for $X_i \perp X_j | Z_S$ using only the data samples where the conditioning variables $X_k$ (for $k \in S$ ) are non-zero (i.e., $R_S = \mathbf{0}$ ). This procedure is termed "test-wise deletion."

Implementation and Application:

This test-wise deletion procedure can be seamlessly integrated into existing constraint-based causal discovery algorithms (like the PC algorithm) and greedy score-based methods (like Greedy Equivalence Search, GES). Definition 1 (General procedure for causal discovery with dropout correction):

Perform any consistent causal discovery algorithm.
For CI tests $Z_i \perp Z_j | Z_S$ , use the equivalence $X_i \perp X_j | Z_S, R_S = \mathbf{0}$ .
Infer the graph structure among $Z$ . Assumption (A5) ensures that after deletion, sufficient samples remain for CI tests.

Validating the Causal Dropout Model:

The paper also provides a way to validate the structural assumptions of the CDM, particularly (A3), which states that a gene's dropout is only affected by its own expression. By relaxing (A3), the authors propose a method to discover the GRN structure and the causal relationships from $Z$ to $R$ (dropout mechanisms). Definition 2 (Generalized GRN and dropout mechanisms discovery):

Perform the procedure in Definition 1, but infer $Z_i \perp Z_j | Z_S$ if and only if $Z_i \perp Z_j | Z_S, R_{S \cup \{i,j\}} = \mathbf{0}$ . This involves deleting samples with zeros for all variables involved in the CI test (not just the conditioning set). Theorem 2 ("Identification of GRN and dropout mechanisms") (under A1, A2, A4, A5):

If $Z_i, Z_j$ are non-adjacent in the output of Definition 2, they are indeed non-adjacent in the true GRN, and $Z_i$ does not cause $R_j$ , and $Z_j$ does not cause $R_i$ .
If $Z_i, Z_j$ are adjacent, then their true GRN adjacency has one specific unidentifiable case, and the causal relationships $Z_i \rightarrow R_j$ and $Z_j \rightarrow R_i$ are naturally unidentifiable (due to $Z_i \leftrightarrow Z_j$ in GRN making $Z_i \rightarrow R_j$ unresolvable from $Z_i \leftarrow Z_j \rightarrow R_j$ ).

This implies that the proposed framework can help assess if dropout causes are indeed primarily self-masking or if cross-gene effects on dropout exist.

Experimental Evaluation:

The proposed method ("test-wise deletion") was extensively evaluated:

Linear SEM Simulated Data:
- Compared against MAGIC (imputation), mixedCCA (parametric model), using full samples, and an Oracle (true $Z$ data). PC and GES algorithms were used.
- Settings: $p \in \{10,20,30,100\}$ nodes, various graph densities. Data distributions: Gaussian, Lognormal. Dropout mechanisms: fixed rates, truncating low expressions, dropout probabilistically determined by expression.
- Results: Test-wise deletion consistently outperformed other methods in terms of Structural Hamming Distance (SHD) and was close to Oracle performance. Applying algorithms to full samples led to denser, less accurate graphs. MixedCCA performed well in its ideal setting (Gaussian, truncation) but was still outperformed. An experiment on 20,000 nodes using FGES also showed superior F1-score and faster runtime for the test-wise deletion method compared to using full samples.
Realistic BoolODE Synthetic and Curated Data (BEELINE framework):
- Used BoolODE simulator for 6 synthetic and 4 literature-curated datasets (5000 cells, 50% dropout).
- Algorithms: PC, GES, and 7 other GRNI-specific SOTA algorithms (e.g., SINCERITIES, GRNBOOST2).
- Strategies: oracle*, test-wise deletion, full samples, imputed, binarization.
- Results (F1-score of skeleton edges):
  - Dropouts significantly harm GRNI performance.
  - Existing strategies like imputation and binarization often performed worse than using full samples, sometimes being counterproductive.
  - Test-wise deletion consistently improved performance over full samples and imputation across most dataset-algorithm pairs.
Real-World Experimental Data:
- Perturb-seq data (Dixit et al., 2016) for 21 transcription factors in unperturbed bone-marrow dendritic cells (9843 cells).
- Compared PC on full samples (PC-full) with PC using test-wise deletion (PC-test-del) against known regulatory interactions.
- Results: PC-full inferred numerous edges not supported by prior knowledge and some incorrect directions. PC-test-del produced a sparser graph where the majority of edges were previously known, with fewer unsupported predictions. Similar improvements were noted on HESC and CMLC datasets (details in appendix).

Implementation Considerations and Pseudocode for Test-wise Deletion CI Test:

To implement a conditional independence test $X_i \perp X_j | X_S$ using test-wise deletion for inferring $Z_i \perp Z_j | Z_S$ :

function TestWiseDeletionCI(data_X, i, j, S, alpha):
  // data_X is the n_samples x p_genes matrix of observed expressions
  // i, j are indices of genes to test
  // S is a list of indices of conditioning genes
  // alpha is the significance level

  relevant_samples = data_X

  // Filter samples based on non-zero values in the conditioning set
  if S is not empty:
    for k in S:
      relevant_samples = relevant_samples[relevant_samples[:, k] != 0]
  
  // Check if enough samples remain (addresses Assumption A5)
  if number_of_rows(relevant_samples) < min_sample_threshold:
    // Handle insufficient data, e.g., return dependent or skip test
    return DEPENDENT // Or raise an error / specific flag

  // Extract X_i, X_j, and X_S from relevant_samples
  X_i_filtered = relevant_samples[:, i]
  X_j_filtered = relevant_samples[:, j]
  X_S_filtered = relevant_samples[:, S]

  // Perform a standard conditional independence test on the filtered data
  // Example: using partial correlation for Gaussian data, or a kernel-based CI test
  p_value = PerformCITest(X_i_filtered, X_j_filtered, X_S_filtered)

  if p_value > alpha:
    return INDEPENDENT
  else:
    return DEPENDENT

Computational Requirements and Limitations:

The main computational overhead comes from the chosen causal discovery algorithm (PC, GES, etc.). The test-wise deletion itself is a data preprocessing step for each CI test. A potential limitation, acknowledged by the authors, is the reduction in sample size for CI tests, especially with large conditioning sets or high dropout rates. This could affect statistical power (Assumption A5). Figure 4 in the paper and Appendix Figure app_varying_dropout_rates show that even with high dropout rates (e.g., 70%), sufficient samples can remain.

Conclusion:

The paper introduces a principled, non-parametric causal framework (CDM) for understanding dropouts in scRNA-seq data. The key practical outcome is a simple yet effective "test-wise deletion" procedure for CI testing that, when integrated into standard causal discovery algorithms, significantly improves GRNI accuracy in the presence of dropouts. This method is shown to be superior to common imputation techniques. The framework also allows for the validation of assumptions about dropout mechanisms. Future work aims to address the sample size reduction caused by deletion.

PDF Markdown

Gene Regulatory Network Inference in the Presence of Dropouts: a Causal View (2403.15500v1)

Summary

Related Papers