Run scDesign3Py pipeline step by step
Introduction
In this section, we will show how to run the whole scDesign3Py pipeline step by step.
Some basic introduction of scDesign3Py package is included in the All in one simulation section, so if you have no idea of how to use the package, you can first go through that tutorial.
In this section, we will introduce the following methods according to their actual execution order.
Step 0: Preparation
import packages
import anndata as ad
import scDesign3Py
Read in data
The raw data is from the scvelo and we only choose top 30 genes to save time.
data = ad.read_h5ad("data/PANCREAS.h5ad")
data = data[:, 0:30]
data
View of AnnData object with n_obs × n_vars = 2087 × 30
obs: 'clusters_coarse', 'clusters', 'S_score', 'G2M_score', 'cell_type', 'sizeFactor', 'pseudotime'
var: 'highly_variable_genes'
obsm: 'X_pca', 'X_umap', 'X_x_pca', 'X_x_umap'
Create the scDesign3
instance
test = scDesign3Py.scDesign3(n_cores=3, parallelization="pbmcmapply", return_py=True)
test.set_r_random_seed(123)
Step 1: Construct data
This function construct the input dataset.
Note
The default assay counts stored in anndata.AnnData.X don’t have a specified name, when you are going to use the default assay, you should assign a name to it in the default_assay_name parameter. Else if you are using the assay stored in anndata.AnnData.layers[assay_use], you can specify the name in the assay_use parameter.
const_data = test.construct_data(
anndata=data,
default_assay_name="counts",
celltype="cell_type",
pseudotime="pseudotime",
corr_formula="1",
)
The results are all converted to pandas.DataFrame
so that you can easily check and manipulate the result in python.
const_data["dat"].head()
cell_type | pseudotime | corr_group | |
---|---|---|---|
AAACCTGAGAGGGATA | Pre-endocrine | 0.677236 | 1.0 |
AAACCTGGTAAGTGGC | Ngn3 high EP | 0.431569 | 1.0 |
AAACGGGCAAAGAATC | Beta | 0.749373 | 1.0 |
AAACGGGGTACAGTTC | Beta | 0.925441 | 1.0 |
AAACGGGGTGAAATCA | Ngn3 high EP | 0.380345 | 1.0 |
Step 2: Fit marginal
Fit regression models for each gene (feature) based on your specification.
Note
Though we have already set the parallel method when creating the instance, we can change the setting temporarily when executing the methods one by one.
Here is an example where we change the parallel method to bpmapply , thus we need an extra bpparam object got from the get_bpparam() function. (Details of the function is included in Get BPPARAM section)
bpparam = scDesign3Py.get_bpparam(mode="MulticoreParam", show=False)
marginal = test.fit_marginal(
data=const_data,
mu_formula="s(pseudotime, k = 10, bs = 'cr')",
sigma_formula="1",
usebam=False,
family_use="nb",
n_cores=3,
parallelization="bpmapply",
bpparam=bpparam,
)
Warning
So far there has been an unfixed problem in converting the marginal list to OrdDict. Use .rx2 method to get values.
If you want to manipulate the results
print(marginal.rx2("Pyy").rx2("fit").rx2("coefficients"))
(Intercept) s(pseudotime).1 s(pseudotime).2 s(pseudotime).3 s(pseudotime).4
1.9507488 -1.4124446 -1.8604564 -1.6723522 -0.8250225
s(pseudotime).5 s(pseudotime).6 s(pseudotime).7 s(pseudotime).8 s(pseudotime).9
1.5108538 2.7502221 3.5714840 3.9859354 2.8515292
marginal.rx2("Pyy").rx2("fit").rx2("coefficients")[0] = 1.96
print(marginal.rx2("Pyy").rx2("fit").rx2("coefficients"))
(Intercept) s(pseudotime).1 s(pseudotime).2 s(pseudotime).3 s(pseudotime).4
1.9600000 -1.4124446 -1.8604564 -1.6723522 -0.8250225
s(pseudotime).5 s(pseudotime).6 s(pseudotime).7 s(pseudotime).8 s(pseudotime).9
1.5108538 2.7502221 3.5714840 3.9859354 2.8515292
Step 3: Fit copula
Fit a copula, obtain AIC and BIC.
copula = test.fit_copula(
input_data=const_data["dat"],
marginal_dict=marginal,
important_feature="auto",
copula="vine"
)
We can evaluate the model by checking the AIC.
copula["model_aic"]
aic.marginal 271036.643970
aic.copula -2254.105974
aic.total 268782.537996
dtype: float64
Note
The return value is a rpy2.rlike.container.OrdDict . Not all elements in this dict like object have to be named but they have a given order. None as a key value means an absence of name for the element. For the values without a named key, you can call byindex method to get them by index (rank).
Here, we show an example to get the vine copula values. If you call byindex
method, you will get a tuple with the first value being the key and the second value being the value.
The example fetches the R vinecop class property pair_copulas
, and get the family
info. The equal R version code is copula$copula_list$"1"$pair_copulas[[1]][[1]]$"family"
.
print(copula["copula_list"]["1"]["pair_copulas"].byindex(0)[-1].byindex(0)[-1]["family"])
[1] "gaussian"
Step 4: Extract parameters
Extract out the estimated parameters so you can make some modifications and use the modified parameters to generate new data if needed. The following parameters can be extracted:
a cell-by-gene mean matrix
a sigma matrix which is:
a cell-by-gene matrix of \(\frac{1}{\phi}\) for negative binomial distribution
a cell-by-gene matrix of the standard deviation \(\sigma\) for Gaussian distribution
a cell-by-gene matrix of 1 for poisson distribution
a zero matrix which is:
a cell-by-gene matrix of zero probabilities for zero-inflated negative binomial and zero-inflated poisson distributions
a zero matrix for negative binomial, Gaussian, and poisson distributions
para = test.extract_para(
marginal_dict=marginal,
data=const_data["dat"],
new_covariate=None,
)
The output matrix can be modified based on pandas.DataFrame
syntax.
para["mean_mat"].iloc[0:6,0:5]
Pyy | Iapp | Chgb | Rbp4 | Spp1 | |
---|---|---|---|---|---|
AAACCTGAGAGGGATA | 50.419465 | 1.181685 | 53.842404 | 20.240920 | 0.242246 |
AAACCTGGTAAGTGGC | 1.301210 | 0.256551 | 5.701465 | 0.623444 | 0.215905 |
AAACGGGCAAAGAATC | 104.864109 | 8.201911 | 38.288643 | 39.306191 | 0.217556 |
AAACGGGGTACAGTTC | 162.884022 | 289.929054 | 18.686662 | 46.505082 | 0.276204 |
AAACGGGGTGAAATCA | 0.740181 | 0.236181 | 0.941316 | 0.253139 | 0.229452 |
AAACGGGTCAAACAAG | 0.565350 | 0.321886 | 0.076826 | 0.234823 | 6.812986 |
Step 5: Simulate new data
simu_new = test.simu_new(
mean_mat=para["mean_mat"],
sigma_mat=para["sigma_mat"],
zero_mat=para["zero_mat"],
copula_dict=copula["copula_list"],
input_data=const_data["dat"],
new_covariate=const_data["newCovariate"],
important_feature=copula["important_feature"],
filtered_gene=const_data["filtered_gene"],
)
The final simulated result is also a pandas.DataFrame
object.
simu_new.iloc[0:6,0:6]
Pyy | Iapp | Chgb | Rbp4 | Spp1 | Chga | |
---|---|---|---|---|---|---|
AAACCTGAGAGGGATA | 31.0 | 1.0 | 62.0 | 19.0 | 0.0 | 12.0 |
AAACCTGGTAAGTGGC | -0.0 | 1.0 | 5.0 | -0.0 | 1.0 | 1.0 |
AAACGGGCAAAGAATC | 201.0 | 2.0 | 128.0 | 29.0 | -0.0 | 33.0 |
AAACGGGGTACAGTTC | 22.0 | 261.0 | 13.0 | 88.0 | 0.0 | 10.0 |
AAACGGGGTGAAATCA | 0.0 | 1.0 | -0.0 | -0.0 | -0.0 | -0.0 |
AAACGGGTCAAACAAG | 0.0 | -0.0 | 0.0 | 0.0 | 5.0 | 0.0 |