Simulate single-cell ATAC-seq data

Introduction

In this example, we show how to use scDesign3Py to simulate the peak by cell matrix of scATAC-seq data.

Import packages and Read in data

import pacakges

import anndata as ad
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfTransformer
import scDesign3Py

Read in the reference data

The raw data is from the Signac, which is of human peripheral blood mononuclear cells (PBMCs) provided by 10x Genomics. We pre-select the differentially accessible peaks between clusters. The data was converted to .h5ad file using the R package sceasy.

To save time, we subset 1000 cells and 100 genes

data = ad.read_h5ad("data/ATAC.h5ad")
data = data[data.obs.sample(1000, random_state=123).index,0:100]
data
View of AnnData object with n_obs × n_vars = 1000 × 100
    obs: 'nCount_peaks', 'nFeature_peaks', 'total', 'duplicate', 'chimeric', 'unmapped', 'lowmapq', 'mitochondrial', 'passed_filters', 'cell_id', 'TSS_fragments', 'DNase_sensitive_region_fragments', 'enhancer_region_fragments', 'promoter_region_fragments', 'on_target_fragments', 'blacklist_region_fragments', 'peak_region_fragments', 'nucleosome_signal', 'nucleosome_percentile', 'TSS.enrichment', 'TSS.percentile', 'pct_reads_in_peaks', 'blacklist_ratio', 'peaks_snn_res.0.8', 'seurat_clusters', 'ident', 'cell_type'
    var: 'name'
    obsm: 'X_lsi', 'X_umap'

Simulation

Here we choose the Zero-inflated Poisson (ZIP) as the distribution due to its good empirical performance. Users may explore other distributions (Poisson, NB, ZINB) since there is no conclusion on the best distribution of ATAC-seq.

test = scDesign3Py.scDesign3(n_cores=3,parallelization="pbmcmapply")
test.set_r_random_seed(123)
simu_res = test.scdesign3(
    anndata=data,
    default_assay_name="counts",
    celltype="cell_type",
    mu_formula="cell_type",
    sigma_formula="1",
    family_use="zip",
    usebam=False,
    corr_formula="cell_type",
    copula="gaussian",
)

We also run the TF-IDF transformation.

tfidf = TfidfTransformer()
org_tfidf = tfidf.fit_transform(data.X)
simu_tfidf = tfidf.fit_transform(simu_res["new_count"])

Then we can construct new data using the simulated count matrix and add the tfidf layer.

simu_data = ad.AnnData(X=simu_res["new_count"], obs=simu_res["new_covariate"], layers={"tfidf": simu_tfidf})
data.layers["tfidf"] = org_tfidf

Visualization

plot = scDesign3Py.plot_reduceddim(
    ref_anndata=data,
    anndata_list=simu_data,
    name_list=["Reference", "scDesign3"],
    assay_use="tfidf",
    if_plot=True,
    color_by="cell_type",
    n_pc=20,
    point_size=5,
)
plot["p_umap"]
../../_images/729ac1825d028053f24b6480fe0a7fa5a3be639af36d1b534f4e4590044c68da.png