scDesign3Py.scDesign3.scdesign3

scDesign3.scdesign3(anndata: anndata.AnnData, corr_formula: str | list[str], mu_formula: str, sigma_formula: str, family_use: Literal['binomial', 'poisson', 'nb', 'zip', 'zinb', 'gaussian'] | list[str], usebam: bool, assay_use: str | None = None, default_assay_name: str | None = None, celltype: str | None = None, pseudotime: str | list[str] | None = None, spatial: list[str] | None = None, other_covariates: str | list[str] | None = None, ncell: int = 'default', copula: Literal['gaussian', 'vine'] = 'gaussian', empirical_quantile: bool = False, family_set: str | list[str] = ['gaussian', 'indep'], important_feature: Literal['all', 'auto'] | list[bool] = 'all', dt: bool = True, pseudo_obs: bool = False, fastmvn: bool = False, nonnegative: bool = True, nonzerovar: bool = False, return_model: bool = False, n_cores: int = 'default', parallelization: Literal['mcmapply', 'bpmapply', 'pbmcmapply'] = 'default', bpparam: rpy2.robjects.methods.RS4 | None = 'default', trace: bool = False, return_py: bool = 'default') → rpy2.robjects.vectors.ListVector[source]

The wrapper for the whole scDesign3 pipeline

Arguments:

anndata: anndata.AnnData: anndata.AnnData object to store the single cell experiment information.
corr_formula: str or list[str]: Indicates the groups for correlation structure. If ‘1’, all cells have one estimated corr. If ‘ind’, no corr (features are independent). If others, this variable decides the corr structures.
assay_use: str (default: None): Indicates the assay you will use. If None, please specify a name for the assay stored in anndata.AnnData.X in @default_assay_name.
default_assay_name: str (default: None): Specified only when @assay_use is None. Asign a name to your default single cell experiment.get_bpparamone) The name of cell type variable in the anndata.AnnData.obs.
celltype: str (default: None): The name of cell type variable in the anndata.AnnData.obs.
pseudotime: str or list[str] (default: None): The name of pseudotime and (if exist) multiple lineages in the anndata.AnnData.obs.
spatial: list[str] (default: None): The names of spatial coordinates in the anndata.AnnData.obs.
other_covariates: str or list[str] (default: None): The other covaraites you want to include in the data.
ncell: int (default: ‘default’): The number of cell you want to simulate. Default is ‘default’, which means only the provided cells in the anndata.AnnData object will be used. If an arbitrary number is provided, the fucntion will use Vine Copula to simulate a new covaraite matrix.
mu_formula: str: A string of the mu parameter formula
sigma_formula: str: A string of the sigma parameter formula
family_use: str or list[str]: A string or a list of strings of the marginal distribution. Must be one of ‘binomial’, ‘poisson’, ‘nb’, ‘zip’, ‘zinb’ or ‘gaussian’, which represent ‘poisson distribution’, ‘negative binomial distribution’, ‘zero-inflated poisson distribution’, ‘zero-inflated negative binomail distribution’ and ‘gaussian distribution’ respectively.
usebam: bool: If True, call R function mgcv::bam for calculation acceleration.
copula: str (default: ‘gaussian’): A string of the copula choice. Must be one of ‘gaussian’ or ‘vine’. Note that vine copula may have better modeling of high-dimensions, but can be very slow when features are >1000.
empirical_quantile: bool (default: False): Please only use it if you clearly know what will happen! If True, DO NOT fit the copula and use the EMPIRICAL CDF values of the original data; it will make the simulated data fixed (no randomness). Only works if ncell is the same as your original data.
family_set: str or list[str] (default: [‘gaussian’, ‘indep’]): A string or a string list of the bivariate copula families.
important_feature: str or list[bool] (default: ‘all’): A string or list which indicates whether a gene will be used in correlation estimation or not. If this is a string, then this string must be either “all” (using all genes) or “auto”, which indicates that the genes will be automatically selected based on the proportion of zero expression across cells for each gene. Gene with zero proportion greater than 0.8 will be excluded form gene-gene correlation estimation. If this is a list, then this should be a logical vector with length equal to the number of genes in @sce. True in the logical vector means the corresponding gene will be included in gene-gene correlation estimation and False in the logical vector means the corresponding gene will be excluded from the gene-gene correlation estimation.
dt: bool (default: True): If True, perform the distributional transformation to make the discrete data continuous. This is useful for discrete distributions (e.g., Poisson, NB). Note that for continuous data (e.g., Gaussian), DT does not make sense and should be set as False.
pseudo_obs: bool (default: False): If True, use the empirical quantiles instead of theoretical quantiles for fitting copula.
fastmvn: bool (default: False): If True, the sampling of multivariate Gaussian is done by R function mvnfast, otherwise by R function mvtnorm.
nonnegative: bool (default: True): If True, values < 0 in the synthetic data will be converted to 0. Default is True, since the expression matrix is nonnegative.
nonzerovar: bool (default: False): If True, for any gene with zero variance, a cell will be replaced with 1. This is designed for avoiding potential errors, for example, PCA.
return_model: bool (default: False): If True, the marginal models and copula models will be returned.
n_cores: int (default: ‘default’): The number of cores to use. Default is ‘default’, use the setting when initializing.
parallelization: str (default: ‘default’): The specific parallelization function to use. If ‘bpmapply’, first call method @get_bpparam. Default is ‘default’, use the setting when initializing.
bpparam: rpy2.robject.methods.RS4 (default: ‘default’): If @parallelization is ‘bpmapply’, first call function @get_bpparam to get the robject. If @parallelization is ‘mcmapply’ or ‘pbmcmapply’, it should be None. Default is ‘default’, use the setting when initializing.
trace: bool (default: False): If True, the warning/error log and runtime for gam/gamlss will be returned.
return_py: bool (default: ‘default’): If True, functions will return a result easy for manipulation in python. Default is ‘default’, use the setting when initializing.

Output:

A dict like object.

new_count: pandas.DataFrame

A matrix of the new simulated count (expression) matrix.

The row corresponds to the observations and the column corresponds to the genes.

new_covariate: pandas.DataFrame

A dataframe of the new covariate matrix.

model_aic: pandas.Series

An array with three values. In order, they are the marginal AIC, the copula AIC, the total AIC.

model_bic: pandas.Series

An array with three values. In order, they are the marginal BIC, the copula BIC, the total BIC.

marginal_list: rpy2.rlike.container.OrdDict

A dict of marginal regression models if return_model = True.

Caution that though it’s name is list, it is actually a dict like object.

corr_list: rpy2.rlike.container.OrdDict

A dict of correlation models (conditional copulas) if return_model = True.

Caution that though it’s name is list, it is actually a dict like object.