Evaluate clustering goodness-of-fit

Introduction

In this example, we will show how to use scDesign3Py to evaluate the clustering goodness-of-fit for different cell-type assignments. If the true labels are unavailable and we have little prior knowledge, the scDesign3Py BIC can serve as an unsupervised metric.

Import packages and Read in data

import pacakges

import anndata as ad
import pandas as pd
import scanpy as sc
import scDesign3Py

from sklearn.metrics import adjusted_rand_score

Read in data

The raw data is from the R package DuoClustering2018 and converted to .h5ad file using the R package sceasy.

The cluster result is also got from the R package DuoClustering2018 as the package has already provided various clustering results.

data = ad.read_h5ad("data/Zhengmix4eq.h5ad")
cluster_res = pd.read_csv("data/clusterres_Zhengmix4eq.csv",index_col=0)
cluster_res = cluster_res[cluster_res["method"].isin(["PCAKmeans"]) & (cluster_res["run"] == 1)]

For demonstration purpose, we use the Zhengmix4eq dataset in the package with top 30 highly variable genes and the corresponding k-means clustering results with \(k = 2 \sim 10\).

kmeans_res = cluster_res.groupby("k")
ncell = data.n_obs
ngene = 30
ntrain = int(ncell/5)

# choose HVG genes
data.layers["log"] = data.X.copy()
sc.pp.normalize_total(data,target_sum=1e4,layer="log")
sc.pp.log1p(data,layer="log")
sc.pp.highly_variable_genes(data,layer="log",n_top_genes=ngene)
data = data[:,data.var["highly_variable"] == True]

train_data = data[data.obs.sample(n=ntrain,random_state=123).index,:]
train_data
View of AnnData object with n_obs × n_vars = 711 × 30
    obs: 'barcode', 'phenoid', 'total_features', 'log10_total_features', 'total_counts', 'log10_total_counts', 'pct_counts_top_50_features', 'pct_counts_top_100_features', 'pct_counts_top_200_features', 'pct_counts_top_500_features', 'sizeFactor', 'cell_type'
    var: 'id', 'symbol', 'mean_counts', 'log10_mean_counts', 'rank_counts', 'n_cells_counts', 'pct_dropout_counts', 'total_counts', 'log10_total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'log1p', 'hvg'
    obsm: 'X_pca', 'X_tsne'
    layers: 'logcounts', 'normcounts', 'log'

Simulation

We then use different cell-type clustering information to simulate new data.

simu_res = {}
for key,value in kmeans_res:
    print(f"K number: {key} fitting.")
    tmp = value[["cell","cluster"]]
    tmp.index = tmp["cell"]
    train_data.obs["cell_type"] = tmp.loc[train_data.obs.index,"cluster"].tolist()
    train_data.obs["cell_type"] = train_data.obs["cell_type"].astype("int").astype("str").astype("category")

    test = scDesign3Py.scDesign3(n_cores=2, parallelization="pbmcmapply")
    test.set_r_random_seed(123)
    res = test.scdesign3(anndata=train_data, 
                        celltype = 'cell_type',
                        corr_formula = "1", 
                        mu_formula = "cell_type", 
                        sigma_formula = "cell_type", 
                        copula = "gaussian", 
                        default_assay_name = "counts", 
                        family_use = "nb",
                        usebam = False)

    simu_res[key] = res

Visualization

After the simulations, we can check the BIC provided by our package and the calculated ARI to evaluate k-means clustering qualities.

import matplotlib.pyplot as plt
Hide code cell source
bics = pd.Series([v["model_bic"]["bic.marginal"] for v in simu_res.values()])
ari = pd.Series([adjusted_rand_score(v["cluster"],v["trueclass"]) for _,v in kmeans_res])
spearman_corr = bics.corr(ari,method="spearman")

# plot
plt.scatter(x=ari,y=bics)
plt.xlabel('ARI')
plt.ylabel('scDesign3 BIC')
plt.text(x=0.25,y=25500,s="Spearman Correlation: %.2f" % spearman_corr)
plt.show()
../../_images/da7ff71553f74bf8dc7104f533e0cc4a674cadc841b55e122985adb2568200f3.png