Simulate datasets with multiple lineages

Introduction

In this example, we will show how to use scDesign3Py to simulate multiple lineages single-cell data.

Import packages and Read in data

import pacakges

import anndata as ad
import numpy as np
import scDesign3Py

Read in the reference data

The raw data is from the GEO with ID GSE72859, which describes myeloid progenitors from mouse bone marrow.

We pre-select the top 1000 highly variable genes. To save time, we only use the top 30 genes.

data = ad.read_h5ad("data/MARROW.h5ad")
data = data[:,0:30]
data

View of AnnData object with n_obs × n_vars = 2660 × 30
    obs: 'Seq_batch_ID', 'Amp_batch_ID', 'well_coordinates', 'Plate_ID', 'Pool_barcode', 'Cell_barcode', 'CD34_measurement', 'FcgR3_measurement', 'cluster', 'Size_Factor', 'cell_type', 'Pseudotime', 'State', 'cell_type2', 'ery_meg_lineage_score', 'cell_stemness_score', 'no_expression', 'sizeFactor', 'pseudotime1', 'pseudotime2', 'l1', 'l2'
    var: 'gene_short_name'
    obsm: 'X_umap'

As we can see, this example dataset has two sets of pseudotime, thus two lineages. The variables pseudotime1 and pseudotime2 contain the corresponding pseudotime for each cell. The variables l1 and l2 indicate whether a particular cell belong to the first and/or second lineages.

data.obs[["pseudotime1","pseudotime2","l1","l2"]].head()

	pseudotime1	pseudotime2	l1	l2
W31105	0.950862	0.568357	TRUE	TRUE
W31106	9.168276	-1.000000	TRUE	FALSE
W31107	-1.000000	7.981990	FALSE	TRUE
W31108	11.394132	-1.000000	TRUE	FALSE
W31109	-1.000000	8.080133	FALSE	TRUE

Simulation

Then, we can use this multiple-lineage dataset to generate new data by setting the parameter mu_formula as two smooth terms for each lineage.

test = scDesign3Py.scDesign3(n_cores=3,parallelization="pbmcmapply")
test.set_r_random_seed(123)
simu_res = test.scdesign3(
    anndata=data,
    default_assay_name="counts",
    celltype="cell_type",
    pseudotime=["pseudotime1", "pseudotime2", "l1", "l2"],
    mu_formula="s(pseudotime1, k = 10, by = l1, bs = 'cr') + s(pseudotime2, k = 10, by = l2, bs = 'cr')",
    sigma_formula="1",
    family_use="nb",
    usebam=False,
    corr_formula="1",
    copula="gaussian",
)

Then we can construct new data using the simulated count matrix.

simu_data = ad.AnnData(X=simu_res["new_count"], obs=simu_res["new_covariate"])
simu_data.layers["log_transformed"] = np.log1p(simu_data.X)
data.layers["log_transformed"] = np.log1p(data.X)

Visualization

plot1 = scDesign3Py.plot_reduceddim(
    ref_anndata=data,
    anndata_list=simu_data,
    name_list=["Reference", "scDesign3"],
    assay_use="log_transformed",
    if_plot=True,
    color_by="pseudotime1",
    n_pc=20,
    point_size=5,
)
plot2 = scDesign3Py.plot_reduceddim(
    ref_anndata=data,
    anndata_list=simu_data,
    name_list=["Reference", "scDesign3"],
    assay_use="log_transformed",
    if_plot=True,
    color_by="pseudotime2",
    n_pc=20,
    point_size=5,
)

Pseudotime1

plot1["p_umap"]

../../_images/023f1041c251646fe26758ded675cbba876f305d11baa180b11d63263fb20630.png

Pseudotime2

plot2["p_umap"]

../../_images/e7d8706d030541e5eec02742c22bfc47026ef06e85dd713b4e5830c5beea6517.png