- The paper introduces a novel causal dropout model (CDM) and test-wise deletion method to recover accurate conditional independence relations in gene regulatory network inference.
- The method, evaluated on simulated and real scRNA-seq datasets, consistently outperforms imputation and full-sample approaches in terms of graph accuracy and computational efficiency.
- The framework not only corrects dropout biases but also validates key structural assumptions, revealing self-masking effects in gene expression data for reliable causal discovery.
This paper addresses the challenge of Gene Regulatory Network Inference (GRNI) from single-cell RNA sequencing (scRNA-seq) data, which is often plagued by "dropouts" – an abundance of zero values. These zeros can be either biological (true absence of expression) or technical (due to limitations in the sequencing process). Technical dropouts can distort the data distribution and bias GRNI.
The authors propose a novel framework centered around a Causal Dropout Model (CDM) to characterize the dropout mechanism.
The CDM defines four sets of variables for each gene i:
- Zi: The true underlying gene expression. The network among Z={Zi} is the GRN to be recovered.
- Di: A boolean variable indicating a technical dropout (Di=1 for dropout, $0$ otherwise).
- Xi: The observed expression, generated by Xi=(1−Di)Zi.
- Ri: A boolean zero observational indicator, $R_i = D_i \text{ OR } \mathbbm{1}(Z_i=0)$, meaning Ri=1 if Xi=0. Ri is fully observable from Xi.
A key aspect of the CDM is the edge Zi→Di, representing a "self-masking" dropout mechanism where a gene's true expression level influences its probability of dropout. The paper shows that many existing parametric models for dropouts (e.g., zero-inflated models, truncation models, probabilistic dropout models) can be seen as specific instances of this non-parametric CDM.
Limitations of Existing Approaches:
The paper argues that imputation methods, which treat all zeros as missing values to be filled in, lack theoretical guarantees. The true underlying distribution p(Z) is generally unidentifiable from the observed p(X) due to the self-masking nature of dropouts and the fact that technical zeros cannot be distinguished from biological zeros (i.e., Di is latent). Examples are provided to show this unidentifiability even under specific parametric assumptions.
Proposed Solution: Test-wise Deletion for CI Testing
Despite the unidentifiability of p(Z), the authors demonstrate theoretically that Conditional Independence (CI) relations in the true data Z can be recovered. The core theoretical contribution is Theorem 1 ("Correct CI estimations"):
Under assumptions (A1) Causal sufficiency, Markov, faithfulness over Z∪X∪R, acyclicity, consistent CI testing; (A2) Di→Zj (dropout doesn't affect gene expression); (A3) Di is only directly affected by Zi; and (A4) Faithful observability (dependencies conditioned on R are preserved in non-zero values), the following holds:
Zi⊥Zj∣ZS⇔Xi⊥Xj∣ZS,RS=0
This means that to test for Zi⊥Zj∣ZS, one can test for Xi⊥Xj∣ZS using only the data samples where the conditioning variables Xk (for k∈S) are non-zero (i.e., RS=0). This procedure is termed "test-wise deletion."
Implementation and Application:
This test-wise deletion procedure can be seamlessly integrated into existing constraint-based causal discovery algorithms (like the PC algorithm) and greedy score-based methods (like Greedy Equivalence Search, GES).
Definition 1 (General procedure for causal discovery with dropout correction):
- Perform any consistent causal discovery algorithm.
- For CI tests Zi⊥Zj∣ZS, use the equivalence Xi⊥Xj∣ZS,RS=0.
- Infer the graph structure among Z.
Assumption (A5) ensures that after deletion, sufficient samples remain for CI tests.
Validating the Causal Dropout Model:
The paper also provides a way to validate the structural assumptions of the CDM, particularly (A3), which states that a gene's dropout is only affected by its own expression. By relaxing (A3), the authors propose a method to discover the GRN structure and the causal relationships from Z to R (dropout mechanisms).
Definition 2 (Generalized GRN and dropout mechanisms discovery):
Perform the procedure in Definition 1, but infer Zi⊥Zj∣ZS if and only if Zi⊥Zj∣ZS,RS∪{i,j}=0. This involves deleting samples with zeros for all variables involved in the CI test (not just the conditioning set).
Theorem 2 ("Identification of GRN and dropout mechanisms") (under A1, A2, A4, A5):
- If Zi,Zj are non-adjacent in the output of Definition 2, they are indeed non-adjacent in the true GRN, and Zi does not cause Rj, and Zj does not cause Ri.
- If Zi,Zj are adjacent, then their true GRN adjacency has one specific unidentifiable case, and the causal relationships Zi→Rj and Zj→Ri are naturally unidentifiable (due to Zi↔Zj in GRN making Zi→Rj unresolvable from Zi←Zj→Rj).
This implies that the proposed framework can help assess if dropout causes are indeed primarily self-masking or if cross-gene effects on dropout exist.
Experimental Evaluation:
The proposed method ("test-wise deletion") was extensively evaluated:
- Linear SEM Simulated Data:
- Compared against MAGIC (imputation), mixedCCA (parametric model), using full samples, and an Oracle (true Z data). PC and GES algorithms were used.
- Settings: p∈{10,20,30,100} nodes, various graph densities. Data distributions: Gaussian, Lognormal. Dropout mechanisms: fixed rates, truncating low expressions, dropout probabilistically determined by expression.
- Results: Test-wise deletion consistently outperformed other methods in terms of Structural Hamming Distance (SHD) and was close to Oracle performance. Applying algorithms to full samples led to denser, less accurate graphs. MixedCCA performed well in its ideal setting (Gaussian, truncation) but was still outperformed.
An experiment on 20,000 nodes using FGES also showed superior F1-score and faster runtime for the test-wise deletion method compared to using full samples.
- Realistic BoolODE Synthetic and Curated Data (BEELINE framework):
- Used BoolODE simulator for 6 synthetic and 4 literature-curated datasets (5000 cells, 50% dropout).
- Algorithms: PC, GES, and 7 other GRNI-specific SOTA algorithms (e.g., SINCERITIES, GRNBOOST2).
- Strategies: oracle*, test-wise deletion, full samples, imputed, binarization.
- Results (F1-score of skeleton edges):
- Dropouts significantly harm GRNI performance.
- Existing strategies like imputation and binarization often performed worse than using full samples, sometimes being counterproductive.
- Test-wise deletion consistently improved performance over full samples and imputation across most dataset-algorithm pairs.
- Real-World Experimental Data:
- Perturb-seq data (Dixit et al., 2016) for 21 transcription factors in unperturbed bone-marrow dendritic cells (9843 cells).
- Compared PC on full samples (PC-full) with PC using test-wise deletion (PC-test-del) against known regulatory interactions.
- Results: PC-full inferred numerous edges not supported by prior knowledge and some incorrect directions. PC-test-del produced a sparser graph where the majority of edges were previously known, with fewer unsupported predictions. Similar improvements were noted on HESC and CMLC datasets (details in appendix).
Implementation Considerations and Pseudocode for Test-wise Deletion CI Test:
To implement a conditional independence test Xi⊥Xj∣XS using test-wise deletion for inferring Zi⊥Zj∣ZS:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
|
function TestWiseDeletionCI(data_X, i, j, S, alpha):
// data_X is the n_samples x p_genes matrix of observed expressions
// i, j are indices of genes to test
// S is a list of indices of conditioning genes
// alpha is the significance level
relevant_samples = data_X
// Filter samples based on non-zero values in the conditioning set
if S is not empty:
for k in S:
relevant_samples = relevant_samples[relevant_samples[:, k] != 0]
// Check if enough samples remain (addresses Assumption A5)
if number_of_rows(relevant_samples) < min_sample_threshold:
// Handle insufficient data, e.g., return dependent or skip test
return DEPENDENT // Or raise an error / specific flag
// Extract X_i, X_j, and X_S from relevant_samples
X_i_filtered = relevant_samples[:, i]
X_j_filtered = relevant_samples[:, j]
X_S_filtered = relevant_samples[:, S]
// Perform a standard conditional independence test on the filtered data
// Example: using partial correlation for Gaussian data, or a kernel-based CI test
p_value = PerformCITest(X_i_filtered, X_j_filtered, X_S_filtered)
if p_value > alpha:
return INDEPENDENT
else:
return DEPENDENT |
Computational Requirements and Limitations:
The main computational overhead comes from the chosen causal discovery algorithm (PC, GES, etc.). The test-wise deletion itself is a data preprocessing step for each CI test. A potential limitation, acknowledged by the authors, is the reduction in sample size for CI tests, especially with large conditioning sets or high dropout rates. This could affect statistical power (Assumption A5). Figure 4 in the paper and Appendix Figure app_varying_dropout_rates
show that even with high dropout rates (e.g., 70%), sufficient samples can remain.
Conclusion:
The paper introduces a principled, non-parametric causal framework (CDM) for understanding dropouts in scRNA-seq data. The key practical outcome is a simple yet effective "test-wise deletion" procedure for CI testing that, when integrated into standard causal discovery algorithms, significantly improves GRNI accuracy in the presence of dropouts. This method is shown to be superior to common imputation techniques. The framework also allows for the validation of assumptions about dropout mechanisms. Future work aims to address the sample size reduction caused by deletion.