Simulate datasets with multiple lineages

Introduction

In this example, we will show how to use scDesign3Py to simulate multiple lineages single-cell data.

Import packages and Read in data

import pacakges

import anndata as ad
import numpy as np
import scDesign3Py

Read in the reference data

The raw data is from the GEO with ID GSE72859, which describes myeloid progenitors from mouse bone marrow.

We pre-select the top 1000 highly variable genes. To save time, we only use the top 30 genes.

data = ad.read_h5ad("data/MARROW.h5ad")
data = data[:,0:30]
data
View of AnnData object with n_obs × n_vars = 2660 × 30
    obs: 'Seq_batch_ID', 'Amp_batch_ID', 'well_coordinates', 'Plate_ID', 'Pool_barcode', 'Cell_barcode', 'CD34_measurement', 'FcgR3_measurement', 'cluster', 'Size_Factor', 'cell_type', 'Pseudotime', 'State', 'cell_type2', 'ery_meg_lineage_score', 'cell_stemness_score', 'no_expression', 'sizeFactor', 'pseudotime1', 'pseudotime2', 'l1', 'l2'
    var: 'gene_short_name'
    obsm: 'X_umap'

As we can see, this example dataset has two sets of pseudotime, thus two lineages. The variables pseudotime1 and pseudotime2 contain the corresponding pseudotime for each cell. The variables l1 and l2 indicate whether a particular cell belong to the first and/or second lineages.

data.obs[["pseudotime1","pseudotime2","l1","l2"]].head()
pseudotime1 pseudotime2 l1 l2
W31105 0.950862 0.568357 TRUE TRUE
W31106 9.168276 -1.000000 TRUE FALSE
W31107 -1.000000 7.981990 FALSE TRUE
W31108 11.394132 -1.000000 TRUE FALSE
W31109 -1.000000 8.080133 FALSE TRUE

Simulation

Then, we can use this multiple-lineage dataset to generate new data by setting the parameter mu_formula as two smooth terms for each lineage.

test = scDesign3Py.scDesign3(n_cores=3,parallelization="pbmcmapply")
test.set_r_random_seed(123)
simu_res = test.scdesign3(
    anndata=data,
    default_assay_name="counts",
    celltype="cell_type",
    pseudotime=["pseudotime1", "pseudotime2", "l1", "l2"],
    mu_formula="s(pseudotime1, k = 10, by = l1, bs = 'cr') + s(pseudotime2, k = 10, by = l2, bs = 'cr')",
    sigma_formula="1",
    family_use="nb",
    usebam=False,
    corr_formula="1",
    copula="gaussian",
)

Then we can construct new data using the simulated count matrix.

simu_data = ad.AnnData(X=simu_res["new_count"], obs=simu_res["new_covariate"])
simu_data.layers["log_transformed"] = np.log1p(simu_data.X)
data.layers["log_transformed"] = np.log1p(data.X)

Visualization

plot1 = scDesign3Py.plot_reduceddim(
    ref_anndata=data,
    anndata_list=simu_data,
    name_list=["Reference", "scDesign3"],
    assay_use="log_transformed",
    if_plot=True,
    color_by="pseudotime1",
    n_pc=20,
    point_size=5,
)
plot2 = scDesign3Py.plot_reduceddim(
    ref_anndata=data,
    anndata_list=simu_data,
    name_list=["Reference", "scDesign3"],
    assay_use="log_transformed",
    if_plot=True,
    color_by="pseudotime2",
    n_pc=20,
    point_size=5,
)

Pseudotime1

plot1["p_umap"]
../../_images/023f1041c251646fe26758ded675cbba876f305d11baa180b11d63263fb20630.png

Pseudotime2

plot2["p_umap"]
../../_images/e7d8706d030541e5eec02742c22bfc47026ef06e85dd713b4e5830c5beea6517.png