Run scDesign3Py pipeline step by step

Introduction

In this section, we will show how to run the whole scDesign3Py pipeline step by step.

Some basic introduction of scDesign3Py package is included in the All in one simulation section, so if you have no idea of how to use the package, you can first go through that tutorial.

In this section, we will introduce the following methods according to their actual execution order.

Construct data (API)
Fit marginal (API)
Fit copula (API)
Extract parameters (API)
Simulate new data (API)

Step 0: Preparation

import packages

import anndata as ad
import scDesign3Py

Read in data

The raw data is from the scvelo and we only choose top 30 genes to save time.

data = ad.read_h5ad("data/PANCREAS.h5ad")
data = data[:, 0:30]
data

View of AnnData object with n_obs × n_vars = 2087 × 30
    obs: 'clusters_coarse', 'clusters', 'S_score', 'G2M_score', 'cell_type', 'sizeFactor', 'pseudotime'
    var: 'highly_variable_genes'
    obsm: 'X_pca', 'X_umap', 'X_x_pca', 'X_x_umap'

Create the `scDesign3` instance

test = scDesign3Py.scDesign3(n_cores=3, parallelization="pbmcmapply", return_py=True)
test.set_r_random_seed(123)

Step 1: Construct data

This function construct the input dataset.

Note

The default assay counts stored in anndata.AnnData.X don’t have a specified name, when you are going to use the default assay, you should assign a name to it in the default_assay_name parameter. Else if you are using the assay stored in anndata.AnnData.layers[assay_use], you can specify the name in the assay_use parameter.

const_data = test.construct_data(
    anndata=data,
    default_assay_name="counts",
    celltype="cell_type",
    pseudotime="pseudotime",
    corr_formula="1",
)

The results are all converted to pandas.DataFrame so that you can easily check and manipulate the result in python.

const_data["dat"].head()

	cell_type	pseudotime	corr_group
AAACCTGAGAGGGATA	Pre-endocrine	0.677236	1.0
AAACCTGGTAAGTGGC	Ngn3 high EP	0.431569	1.0
AAACGGGCAAAGAATC	Beta	0.749373	1.0
AAACGGGGTACAGTTC	Beta	0.925441	1.0
AAACGGGGTGAAATCA	Ngn3 high EP	0.380345	1.0

Step 2: Fit marginal

Fit regression models for each gene (feature) based on your specification.

Note

Though we have already set the parallel method when creating the instance, we can change the setting temporarily when executing the methods one by one.

Here is an example where we change the parallel method to bpmapply , thus we need an extra bpparam object got from the get_bpparam() function. (Details of the function is included in Get BPPARAM section)

bpparam = scDesign3Py.get_bpparam(mode="MulticoreParam", show=False)
marginal = test.fit_marginal(
    data=const_data,
    mu_formula="s(pseudotime, k = 10, bs = 'cr')",
    sigma_formula="1",
    usebam=False,
    family_use="nb",
    n_cores=3,
    parallelization="bpmapply",
    bpparam=bpparam,
)

Warning

So far there has been an unfixed problem in converting the marginal list to OrdDict. Use .rx2 method to get values.

If you want to manipulate the results

print(marginal.rx2("Pyy").rx2("fit").rx2("coefficients"))

    (Intercept) s(pseudotime).1 s(pseudotime).2 s(pseudotime).3 s(pseudotime).4 
      1.9507488      -1.4124446      -1.8604564      -1.6723522      -0.8250225 
s(pseudotime).5 s(pseudotime).6 s(pseudotime).7 s(pseudotime).8 s(pseudotime).9 
      1.5108538       2.7502221       3.5714840       3.9859354       2.8515292 

marginal.rx2("Pyy").rx2("fit").rx2("coefficients")[0] = 1.96
print(marginal.rx2("Pyy").rx2("fit").rx2("coefficients"))

    (Intercept) s(pseudotime).1 s(pseudotime).2 s(pseudotime).3 s(pseudotime).4 
      1.9600000      -1.4124446      -1.8604564      -1.6723522      -0.8250225 
s(pseudotime).5 s(pseudotime).6 s(pseudotime).7 s(pseudotime).8 s(pseudotime).9 
      1.5108538       2.7502221       3.5714840       3.9859354       2.8515292 

Step 3: Fit copula

Fit a copula, obtain AIC and BIC.

copula = test.fit_copula(
    input_data=const_data["dat"],
    marginal_dict=marginal,
    important_feature="auto",
    copula="vine"
)

We can evaluate the model by checking the AIC.

copula["model_aic"]

aic.marginal    271036.643970
aic.copula       -2254.105974
aic.total       268782.537996
dtype: float64

Note

The return value is a rpy2.rlike.container.OrdDict . Not all elements in this dict like object have to be named but they have a given order. None as a key value means an absence of name for the element. For the values without a named key, you can call byindex method to get them by index (rank).

Here, we show an example to get the vine copula values. If you call byindex method, you will get a tuple with the first value being the key and the second value being the value.

The example fetches the R vinecop class property pair_copulas, and get the family info. The equal R version code is copula$copula_list$"1"$pair_copulas[[1]][[1]]$"family".

print(copula["copula_list"]["1"]["pair_copulas"].byindex(0)[-1].byindex(0)[-1]["family"])

[1] "gaussian"

Step 4: Extract parameters

Extract out the estimated parameters so you can make some modifications and use the modified parameters to generate new data if needed. The following parameters can be extracted:

a cell-by-gene mean matrix
a sigma matrix which is:
- a cell-by-gene matrix of $\frac{1}{\phi}$ for negative binomial distribution
- a cell-by-gene matrix of the standard deviation $\sigma$ for Gaussian distribution
- a cell-by-gene matrix of 1 for poisson distribution
a zero matrix which is:
- a cell-by-gene matrix of zero probabilities for zero-inflated negative binomial and zero-inflated poisson distributions
- a zero matrix for negative binomial, Gaussian, and poisson distributions

para = test.extract_para(
    marginal_dict=marginal,
    data=const_data["dat"],
    new_covariate=None,
)

The output matrix can be modified based on pandas.DataFrame syntax.

para["mean_mat"].iloc[0:6,0:5]

	Pyy	Iapp	Chgb	Rbp4	Spp1
AAACCTGAGAGGGATA	50.419465	1.181685	53.842404	20.240920	0.242246
AAACCTGGTAAGTGGC	1.301210	0.256551	5.701465	0.623444	0.215905
AAACGGGCAAAGAATC	104.864109	8.201911	38.288643	39.306191	0.217556
AAACGGGGTACAGTTC	162.884022	289.929054	18.686662	46.505082	0.276204
AAACGGGGTGAAATCA	0.740181	0.236181	0.941316	0.253139	0.229452
AAACGGGTCAAACAAG	0.565350	0.321886	0.076826	0.234823	6.812986

Step 5: Simulate new data

simu_new = test.simu_new(
    mean_mat=para["mean_mat"],
    sigma_mat=para["sigma_mat"],
    zero_mat=para["zero_mat"],
    copula_dict=copula["copula_list"],
    input_data=const_data["dat"],
    new_covariate=const_data["newCovariate"],
    important_feature=copula["important_feature"],
    filtered_gene=const_data["filtered_gene"],
)

The final simulated result is also a pandas.DataFrame object.

simu_new.iloc[0:6,0:6]

	Pyy	Iapp	Chgb	Rbp4	Spp1	Chga
AAACCTGAGAGGGATA	31.0	1.0	62.0	19.0	0.0	12.0
AAACCTGGTAAGTGGC	-0.0	1.0	5.0	-0.0	1.0	1.0
AAACGGGCAAAGAATC	201.0	2.0	128.0	29.0	-0.0	33.0
AAACGGGGTACAGTTC	22.0	261.0	13.0	88.0	0.0	10.0
AAACGGGGTGAAATCA	0.0	1.0	-0.0	-0.0	-0.0	-0.0
AAACGGGTCAAACAAG	0.0	-0.0	0.0	0.0	5.0	0.0

Run scDesign3Py pipeline step by step

Introduction

Step 0: Preparation

import packages

Read in data

Create the scDesign3 instance

Step 1: Construct data

Step 2: Fit marginal

Step 3: Fit copula

Step 4: Extract parameters

Step 5: Simulate new data

Create the `scDesign3` instance