Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Gene Regulatory Network Inference in the Presence of Dropouts: a Causal View (2403.15500v1)

Published 21 Mar 2024 in q-bio.QM, cs.LG, and q-bio.MN

Abstract: Gene regulatory network inference (GRNI) is a challenging problem, particularly owing to the presence of zeros in single-cell RNA sequencing data: some are biological zeros representing no gene expression, while some others are technical zeros arising from the sequencing procedure (aka dropouts), which may bias GRNI by distorting the joint distribution of the measured gene expressions. Existing approaches typically handle dropout error via imputation, which may introduce spurious relations as the true joint distribution is generally unidentifiable. To tackle this issue, we introduce a causal graphical model to characterize the dropout mechanism, namely, Causal Dropout Model. We provide a simple yet effective theoretical result: interestingly, the conditional independence (CI) relations in the data with dropouts, after deleting the samples with zero values (regardless if technical or not) for the conditioned variables, are asymptotically identical to the CI relations in the original data without dropouts. This particular test-wise deletion procedure, in which we perform CI tests on the samples without zeros for the conditioned variables, can be seamlessly integrated with existing structure learning approaches including constraint-based and greedy score-based methods, thus giving rise to a principled framework for GRNI in the presence of dropouts. We further show that the causal dropout model can be validated from data, and many existing statistical models to handle dropouts fit into our model as specific parametric instances. Empirical evaluation on synthetic, curated, and real-world experimental transcriptomic data comprehensively demonstrate the efficacy of our method.

Citations (4)

Summary

  • The paper introduces a novel causal dropout model (CDM) and test-wise deletion method to recover accurate conditional independence relations in gene regulatory network inference.
  • The method, evaluated on simulated and real scRNA-seq datasets, consistently outperforms imputation and full-sample approaches in terms of graph accuracy and computational efficiency.
  • The framework not only corrects dropout biases but also validates key structural assumptions, revealing self-masking effects in gene expression data for reliable causal discovery.

This paper addresses the challenge of Gene Regulatory Network Inference (GRNI) from single-cell RNA sequencing (scRNA-seq) data, which is often plagued by "dropouts" – an abundance of zero values. These zeros can be either biological (true absence of expression) or technical (due to limitations in the sequencing process). Technical dropouts can distort the data distribution and bias GRNI.

The authors propose a novel framework centered around a Causal Dropout Model (CDM) to characterize the dropout mechanism. The CDM defines four sets of variables for each gene ii:

  • ZiZ_i: The true underlying gene expression. The network among Z={Zi}Z = \{Z_i\} is the GRN to be recovered.
  • DiD_i: A boolean variable indicating a technical dropout (Di=1D_i=1 for dropout, $0$ otherwise).
  • XiX_i: The observed expression, generated by Xi=(1Di)ZiX_i = (1-D_i)Z_i.
  • RiR_i: A boolean zero observational indicator, $R_i = D_i \text{ OR } \mathbbm{1}(Z_i=0)$, meaning Ri=1R_i=1 if Xi=0X_i=0. RiR_i is fully observable from XiX_i.

A key aspect of the CDM is the edge ZiDiZ_i \rightarrow D_i, representing a "self-masking" dropout mechanism where a gene's true expression level influences its probability of dropout. The paper shows that many existing parametric models for dropouts (e.g., zero-inflated models, truncation models, probabilistic dropout models) can be seen as specific instances of this non-parametric CDM.

Limitations of Existing Approaches:

The paper argues that imputation methods, which treat all zeros as missing values to be filled in, lack theoretical guarantees. The true underlying distribution p(Z)p(Z) is generally unidentifiable from the observed p(X)p(X) due to the self-masking nature of dropouts and the fact that technical zeros cannot be distinguished from biological zeros (i.e., DiD_i is latent). Examples are provided to show this unidentifiability even under specific parametric assumptions.

Proposed Solution: Test-wise Deletion for CI Testing

Despite the unidentifiability of p(Z)p(Z), the authors demonstrate theoretically that Conditional Independence (CI) relations in the true data ZZ can be recovered. The core theoretical contribution is Theorem 1 ("Correct CI estimations"):

Under assumptions (A1) Causal sufficiency, Markov, faithfulness over ZXRZ \cup X \cup R, acyclicity, consistent CI testing; (A2) Di↛ZjD_i \not\rightarrow Z_j (dropout doesn't affect gene expression); (A3) DiD_i is only directly affected by ZiZ_i; and (A4) Faithful observability (dependencies conditioned on RR are preserved in non-zero values), the following holds:

ZiZjZSXiXjZS,RS=0Z_i \perp Z_j | Z_S \Leftrightarrow X_i \perp X_j | Z_S, R_S = \mathbf{0}

This means that to test for ZiZjZSZ_i \perp Z_j | Z_S, one can test for XiXjZSX_i \perp X_j | Z_S using only the data samples where the conditioning variables XkX_k (for kSk \in S) are non-zero (i.e., RS=0R_S = \mathbf{0}). This procedure is termed "test-wise deletion."

Implementation and Application:

This test-wise deletion procedure can be seamlessly integrated into existing constraint-based causal discovery algorithms (like the PC algorithm) and greedy score-based methods (like Greedy Equivalence Search, GES). Definition 1 (General procedure for causal discovery with dropout correction):

  1. Perform any consistent causal discovery algorithm.
  2. For CI tests ZiZjZSZ_i \perp Z_j | Z_S, use the equivalence XiXjZS,RS=0X_i \perp X_j | Z_S, R_S = \mathbf{0}.
  3. Infer the graph structure among ZZ. Assumption (A5) ensures that after deletion, sufficient samples remain for CI tests.

Validating the Causal Dropout Model:

The paper also provides a way to validate the structural assumptions of the CDM, particularly (A3), which states that a gene's dropout is only affected by its own expression. By relaxing (A3), the authors propose a method to discover the GRN structure and the causal relationships from ZZ to RR (dropout mechanisms). Definition 2 (Generalized GRN and dropout mechanisms discovery):

Perform the procedure in Definition 1, but infer ZiZjZSZ_i \perp Z_j | Z_S if and only if ZiZjZS,RS{i,j}=0Z_i \perp Z_j | Z_S, R_{S \cup \{i,j\}} = \mathbf{0}. This involves deleting samples with zeros for all variables involved in the CI test (not just the conditioning set). Theorem 2 ("Identification of GRN and dropout mechanisms") (under A1, A2, A4, A5):

  • If Zi,ZjZ_i, Z_j are non-adjacent in the output of Definition 2, they are indeed non-adjacent in the true GRN, and ZiZ_i does not cause RjR_j, and ZjZ_j does not cause RiR_i.
  • If Zi,ZjZ_i, Z_j are adjacent, then their true GRN adjacency has one specific unidentifiable case, and the causal relationships ZiRjZ_i \rightarrow R_j and ZjRiZ_j \rightarrow R_i are naturally unidentifiable (due to ZiZjZ_i \leftrightarrow Z_j in GRN making ZiRjZ_i \rightarrow R_j unresolvable from ZiZjRjZ_i \leftarrow Z_j \rightarrow R_j).

This implies that the proposed framework can help assess if dropout causes are indeed primarily self-masking or if cross-gene effects on dropout exist.

Experimental Evaluation:

The proposed method ("test-wise deletion") was extensively evaluated:

  1. Linear SEM Simulated Data:
    • Compared against MAGIC (imputation), mixedCCA (parametric model), using full samples, and an Oracle (true ZZ data). PC and GES algorithms were used.
    • Settings: p{10,20,30,100}p \in \{10,20,30,100\} nodes, various graph densities. Data distributions: Gaussian, Lognormal. Dropout mechanisms: fixed rates, truncating low expressions, dropout probabilistically determined by expression.
    • Results: Test-wise deletion consistently outperformed other methods in terms of Structural Hamming Distance (SHD) and was close to Oracle performance. Applying algorithms to full samples led to denser, less accurate graphs. MixedCCA performed well in its ideal setting (Gaussian, truncation) but was still outperformed. An experiment on 20,000 nodes using FGES also showed superior F1-score and faster runtime for the test-wise deletion method compared to using full samples.
  2. Realistic BoolODE Synthetic and Curated Data (BEELINE framework):
    • Used BoolODE simulator for 6 synthetic and 4 literature-curated datasets (5000 cells, 50% dropout).
    • Algorithms: PC, GES, and 7 other GRNI-specific SOTA algorithms (e.g., SINCERITIES, GRNBOOST2).
    • Strategies: oracle*, test-wise deletion, full samples, imputed, binarization.
    • Results (F1-score of skeleton edges):
      • Dropouts significantly harm GRNI performance.
      • Existing strategies like imputation and binarization often performed worse than using full samples, sometimes being counterproductive.
      • Test-wise deletion consistently improved performance over full samples and imputation across most dataset-algorithm pairs.
  3. Real-World Experimental Data:
    • Perturb-seq data (Dixit et al., 2016) for 21 transcription factors in unperturbed bone-marrow dendritic cells (9843 cells).
    • Compared PC on full samples (PC-full) with PC using test-wise deletion (PC-test-del) against known regulatory interactions.
    • Results: PC-full inferred numerous edges not supported by prior knowledge and some incorrect directions. PC-test-del produced a sparser graph where the majority of edges were previously known, with fewer unsupported predictions. Similar improvements were noted on HESC and CMLC datasets (details in appendix).

Implementation Considerations and Pseudocode for Test-wise Deletion CI Test:

To implement a conditional independence test XiXjXSX_i \perp X_j | X_S using test-wise deletion for inferring ZiZjZSZ_i \perp Z_j | Z_S:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
function TestWiseDeletionCI(data_X, i, j, S, alpha):
  // data_X is the n_samples x p_genes matrix of observed expressions
  // i, j are indices of genes to test
  // S is a list of indices of conditioning genes
  // alpha is the significance level

  relevant_samples = data_X

  // Filter samples based on non-zero values in the conditioning set
  if S is not empty:
    for k in S:
      relevant_samples = relevant_samples[relevant_samples[:, k] != 0]
  
  // Check if enough samples remain (addresses Assumption A5)
  if number_of_rows(relevant_samples) < min_sample_threshold:
    // Handle insufficient data, e.g., return dependent or skip test
    return DEPENDENT // Or raise an error / specific flag

  // Extract X_i, X_j, and X_S from relevant_samples
  X_i_filtered = relevant_samples[:, i]
  X_j_filtered = relevant_samples[:, j]
  X_S_filtered = relevant_samples[:, S]

  // Perform a standard conditional independence test on the filtered data
  // Example: using partial correlation for Gaussian data, or a kernel-based CI test
  p_value = PerformCITest(X_i_filtered, X_j_filtered, X_S_filtered)

  if p_value > alpha:
    return INDEPENDENT
  else:
    return DEPENDENT

Computational Requirements and Limitations:

The main computational overhead comes from the chosen causal discovery algorithm (PC, GES, etc.). The test-wise deletion itself is a data preprocessing step for each CI test. A potential limitation, acknowledged by the authors, is the reduction in sample size for CI tests, especially with large conditioning sets or high dropout rates. This could affect statistical power (Assumption A5). Figure 4 in the paper and Appendix Figure app_varying_dropout_rates show that even with high dropout rates (e.g., 70%), sufficient samples can remain.

Conclusion:

The paper introduces a principled, non-parametric causal framework (CDM) for understanding dropouts in scRNA-seq data. The key practical outcome is a simple yet effective "test-wise deletion" procedure for CI testing that, when integrated into standard causal discovery algorithms, significantly improves GRNI accuracy in the presence of dropouts. This method is shown to be superior to common imputation techniques. The framework also allows for the validation of assumptions about dropout mechanisms. Future work aims to address the sample size reduction caused by deletion.