Benchmarking methods for identification of DE genes along continuous trajectory

Introduction

In this example, we will demonstrate how to use scDesign3Py to generate negative control and benchmark methods for identifying differentially expressed (DE) genes along continuous trajectory from scRNA-seq data. Please note that here we only did a very brief benchmarking for illustration purpose, not for formal comparison.

Import packages and Read in data

import pacakges

import anndata as ad
import numpy as np
import pandas as pd
import scDesign3Py

Read in data

The raw data is from the scvelo, which describes pancreatic endocrinogenesis, and is converted to .h5ad file using the R package sceasy. We pre-select the top 1000 highly variable genes and filter out some cell types to ensure a single trajectory.

To save computational time, we only use the top 200 genes.

data = ad.read_h5ad("data/PANCREAS.h5ad")
data = data[:,:200]
data
View of AnnData object with n_obs × n_vars = 2087 × 200
    obs: 'clusters_coarse', 'clusters', 'S_score', 'G2M_score', 'cell_type', 'sizeFactor', 'pseudotime'
    var: 'highly_variable_genes'
    obsm: 'X_pca', 'X_umap', 'X_x_pca', 'X_x_umap'

Simulation

We use the step-by-step functions instead of the one-shot function to generate synthetic data since these step-by-step functions allow us to alter estimated parameters and generate new data based on our desired parameters.

example = scDesign3Py.scDesign3(n_cores=3,parallelization="pbmcmapply")
example_data = example.construct_data(    
    anndata = data,
    default_assay_name = "counts",
    celltype = "cell_type",
    pseudotime="pseudotime",
    corr_formula = "1")
example.set_r_random_seed(123)
example_marginal = example.fit_marginal(
    mu_formula = "s(pseudotime, k = 10, bs = 'cr')",
    sigma_formula = "1",
    family_use = "nb",
    usebam = False
)
example.set_r_random_seed(123)
example_copula = example.fit_copula()
example_para = example.extract_para()

Here, we examine the mean_mat, which is one of the outputs from the previous function extract_para(). For each gene, we calculate the difference in the between the maximum mean parameter and minimum mean parameter across all cells. We select genes which the gene’s mean difference across cells are in the top 50 largest differences. We regard these genes as DE genes. Then, we manually set the mean parameters of the rest genes to be the same across all cells. We regard all genes with the same mean parameter across cells as non-DE genes. Of course, this is a very flexible step and users may choose other ideas to modify the mean matrix.

diff_idx = (example_para["mean_mat"].max() - example_para["mean_mat"].min()).sort_values(ascending=False).index
de_idx = diff_idx[:50]
no_de_idx = diff_idx[50:]
example_para["mean_mat"][no_de_idx] = ((example_para["mean_mat"].max() + example_para["mean_mat"].min())/2)[no_de_idx]
example.set_r_random_seed(123)
example_newcount = example.simu_new(mean_mat=example_para["mean_mat"])

DE genes identification

Now, we use the simulated data to benchmark the performance of scFates, a DEG identification method as an example.

simu_data = ad.AnnData(X=example_newcount, obs=example_data["newCovariate"])
simu_data.layers["log_transformed"] = np.log1p(simu_data.X)
simu_data.obs["t"] = simu_data.obs["pseudotime"]
simu_data.obs["seg"] = 1
import scFates as scf
qvals = pd.DataFrame(index=simu_data.var_names)
methods = ["scFates"]
scf.tl.test_association(simu_data)
qvals["scFates"] = simu_data.var["fdr"]

Since we manually created non-DE genes in the extra_para() step, now we can calculate the actual false discovery proportion(FDP) and power of the DE tests we conducted above with various target FDR threshold.

targetFDR = np.concatenate([np.arange(0.01,0.11,0.01),np.arange(0.2,0.6,0.1)])
fdp = pd.DataFrame(index=targetFDR,columns=methods)
power = pd.DataFrame(index=targetFDR,columns=methods)

for method in methods:
    curr_p = qvals[method]
    for threshold in targetFDR:
        discovery = curr_p[curr_p <= threshold]
        true_positive = discovery.index.intersection(de_idx).shape[0]

        if len(discovery) == 0:
            fdp.loc[threshold,method] = 0
        else:
            fdp.loc[threshold,method] = (len(discovery) - true_positive)/len(discovery)
        
        power.loc[threshold,method] = true_positive/de_idx.shape[0]

Visualization

We visualize the Target FDR vs Actual FDP and Target FDR vs Power below.

import matplotlib.pyplot as plt
Hide code cell source
for method in methods:
    plt.plot(targetFDR,fdp[method],"o-",label=method)

plt.plot(targetFDR, targetFDR, '--', color='gray')
plt.xlabel('Target FDR')
plt.ylabel('Actual FDP')
plt.legend(loc="best")
plt.show()
../../_images/015a346f0cc865836876ff74f8275821cd11d3bc2d6a6952f0ae78dbc41d7f8d.png
Hide code cell source
for method in methods:
    plt.plot(targetFDR,power[method],"o-",label=method)

plt.xlabel('Target FDR')
plt.ylabel('Power')
plt.legend(loc="best")
plt.show()
../../_images/a568f17c55304c2f67ec8b81bf5cdbc4b47d5b2dfa2b690a8d1de788eb6b01c9.png