Run scDesign3Py pipeline step by step

Introduction

In this section, we will show how to run the whole scDesign3Py pipeline step by step.

Some basic introduction of scDesign3Py package is included in the All in one simulation section, so if you have no idea of how to use the package, you can first go through that tutorial.

In this section, we will introduce the following methods according to their actual execution order.

Step 0: Preparation

import packages

import anndata as ad
import scDesign3Py

Read in data

The raw data is from the scvelo and we only choose top 30 genes to save time.

data = ad.read_h5ad("data/PANCREAS.h5ad")
data = data[:, 0:30]
data
View of AnnData object with n_obs × n_vars = 2087 × 30
    obs: 'clusters_coarse', 'clusters', 'S_score', 'G2M_score', 'cell_type', 'sizeFactor', 'pseudotime'
    var: 'highly_variable_genes'
    obsm: 'X_pca', 'X_umap', 'X_x_pca', 'X_x_umap'

Create the scDesign3 instance

test = scDesign3Py.scDesign3(n_cores=3, parallelization="pbmcmapply", return_py=True)
test.set_r_random_seed(123)

Step 1: Construct data

This function construct the input dataset.

Note

The default assay counts stored in anndata.AnnData.X don’t have a specified name, when you are going to use the default assay, you should assign a name to it in the default_assay_name parameter. Else if you are using the assay stored in anndata.AnnData.layers[assay_use], you can specify the name in the assay_use parameter.

const_data = test.construct_data(
    anndata=data,
    default_assay_name="counts",
    celltype="cell_type",
    pseudotime="pseudotime",
    corr_formula="1",
)

The results are all converted to pandas.DataFrame so that you can easily check and manipulate the result in python.

const_data["dat"].head()
cell_type pseudotime corr_group
AAACCTGAGAGGGATA Pre-endocrine 0.677236 1.0
AAACCTGGTAAGTGGC Ngn3 high EP 0.431569 1.0
AAACGGGCAAAGAATC Beta 0.749373 1.0
AAACGGGGTACAGTTC Beta 0.925441 1.0
AAACGGGGTGAAATCA Ngn3 high EP 0.380345 1.0

Step 2: Fit marginal

Fit regression models for each gene (feature) based on your specification.

Note

Though we have already set the parallel method when creating the instance, we can change the setting temporarily when executing the methods one by one.

Here is an example where we change the parallel method to bpmapply , thus we need an extra bpparam object got from the get_bpparam() function. (Details of the function is included in Get BPPARAM section)

bpparam = scDesign3Py.get_bpparam(mode="MulticoreParam", show=False)
marginal = test.fit_marginal(
    data=const_data,
    mu_formula="s(pseudotime, k = 10, bs = 'cr')",
    sigma_formula="1",
    usebam=False,
    family_use="nb",
    n_cores=3,
    parallelization="bpmapply",
    bpparam=bpparam,
)

Warning

So far there has been an unfixed problem in converting the marginal list to OrdDict. Use .rx2 method to get values.

If you want to manipulate the results

print(marginal.rx2("Pyy").rx2("fit").rx2("coefficients"))
    (Intercept) s(pseudotime).1 s(pseudotime).2 s(pseudotime).3 s(pseudotime).4 
      1.9507488      -1.4124446      -1.8604564      -1.6723522      -0.8250225 
s(pseudotime).5 s(pseudotime).6 s(pseudotime).7 s(pseudotime).8 s(pseudotime).9 
      1.5108538       2.7502221       3.5714840       3.9859354       2.8515292 
marginal.rx2("Pyy").rx2("fit").rx2("coefficients")[0] = 1.96
print(marginal.rx2("Pyy").rx2("fit").rx2("coefficients"))
    (Intercept) s(pseudotime).1 s(pseudotime).2 s(pseudotime).3 s(pseudotime).4 
      1.9600000      -1.4124446      -1.8604564      -1.6723522      -0.8250225 
s(pseudotime).5 s(pseudotime).6 s(pseudotime).7 s(pseudotime).8 s(pseudotime).9 
      1.5108538       2.7502221       3.5714840       3.9859354       2.8515292 

Step 3: Fit copula

Fit a copula, obtain AIC and BIC.

copula = test.fit_copula(
    input_data=const_data["dat"],
    marginal_dict=marginal,
    important_feature="auto",
    copula="vine"
)

We can evaluate the model by checking the AIC.

copula["model_aic"]
aic.marginal    271036.643970
aic.copula       -2254.105974
aic.total       268782.537996
dtype: float64

Note

The return value is a rpy2.rlike.container.OrdDict . Not all elements in this dict like object have to be named but they have a given order. None as a key value means an absence of name for the element. For the values without a named key, you can call byindex method to get them by index (rank).

Here, we show an example to get the vine copula values. If you call byindex method, you will get a tuple with the first value being the key and the second value being the value.

The example fetches the R vinecop class property pair_copulas, and get the family info. The equal R version code is copula$copula_list$"1"$pair_copulas[[1]][[1]]$"family".

print(copula["copula_list"]["1"]["pair_copulas"].byindex(0)[-1].byindex(0)[-1]["family"])
[1] "gaussian"

Step 4: Extract parameters

Extract out the estimated parameters so you can make some modifications and use the modified parameters to generate new data if needed. The following parameters can be extracted:

  • a cell-by-gene mean matrix

  • a sigma matrix which is:

    • a cell-by-gene matrix of \(\frac{1}{\phi}\) for negative binomial distribution

    • a cell-by-gene matrix of the standard deviation \(\sigma\) for Gaussian distribution

    • a cell-by-gene matrix of 1 for poisson distribution

  • a zero matrix which is:

    • a cell-by-gene matrix of zero probabilities for zero-inflated negative binomial and zero-inflated poisson distributions

    • a zero matrix for negative binomial, Gaussian, and poisson distributions

para = test.extract_para(
    marginal_dict=marginal,
    data=const_data["dat"],
    new_covariate=None,
)

The output matrix can be modified based on pandas.DataFrame syntax.

para["mean_mat"].iloc[0:6,0:5]
Pyy Iapp Chgb Rbp4 Spp1
AAACCTGAGAGGGATA 50.419465 1.181685 53.842404 20.240920 0.242246
AAACCTGGTAAGTGGC 1.301210 0.256551 5.701465 0.623444 0.215905
AAACGGGCAAAGAATC 104.864109 8.201911 38.288643 39.306191 0.217556
AAACGGGGTACAGTTC 162.884022 289.929054 18.686662 46.505082 0.276204
AAACGGGGTGAAATCA 0.740181 0.236181 0.941316 0.253139 0.229452
AAACGGGTCAAACAAG 0.565350 0.321886 0.076826 0.234823 6.812986

Step 5: Simulate new data

simu_new = test.simu_new(
    mean_mat=para["mean_mat"],
    sigma_mat=para["sigma_mat"],
    zero_mat=para["zero_mat"],
    copula_dict=copula["copula_list"],
    input_data=const_data["dat"],
    new_covariate=const_data["newCovariate"],
    important_feature=copula["important_feature"],
    filtered_gene=const_data["filtered_gene"],
)

The final simulated result is also a pandas.DataFrame object.

simu_new.iloc[0:6,0:6]
Pyy Iapp Chgb Rbp4 Spp1 Chga
AAACCTGAGAGGGATA 31.0 1.0 62.0 19.0 0.0 12.0
AAACCTGGTAAGTGGC -0.0 1.0 5.0 -0.0 1.0 1.0
AAACGGGCAAAGAATC 201.0 2.0 128.0 29.0 -0.0 33.0
AAACGGGGTACAGTTC 22.0 261.0 13.0 88.0 0.0 10.0
AAACGGGGTGAAATCA 0.0 1.0 -0.0 -0.0 -0.0 -0.0
AAACGGGTCAAACAAG 0.0 -0.0 0.0 0.0 5.0 0.0