Marginal distribution for genes

Introduction

In this example, we explain different forms of the function that can be used when fitting the marginal distribution for each gene.

Notation

The following notations are used:

  • \({\mathbf{Y}} = [Y_{ij}] \in \mathbb{R}^{n \times m}\): the cell-by-feature matrix with \(n\) cells as rows, \(m\) features as columns, and \(Y_{ij}\) as the measurement of feature \(j\) in cell \(i\); for single-cell sequencing data, \({\mathbf{Y}}\) is often a count matrix.

  • \(\mathbf{X} = [\mathbf{x}_1, \cdots, \mathbf{x}_n]^T \in \mathbb{R}^{n\times p}\): the cell-by-state-covariate matrix with \(n\) cells as rows and \(p\) cell-state covariates as columns; example covariates are cell type, cell pseudotime, and cell spatial locations.

  • \(\mathbf{Z} = [\mathbf{b}, \mathbf{c}]\): \(\mathbf{b} = (b_1, \ldots, b_n)^T\) has \(b_i \in \{1, \ldots, B \}\) representing cell \(i\)’s batch, and \(\mathbf{c} = (c_1, \ldots, c_n)^T\) has \(c_i \in \{1, \ldots, C \}\) representing cell \(i\)’s condition.

For each feature \(j=1,\ldots,m\) in every cell \(i=1,\ldots,n\), the measurement \(Y_{ij}\)—conditional on cell \(i\)’s state covariates \(\mathbf{x_i}\) and design covariates \(\mathbf{z}_i = (b_i, c_i)^T\)—is assumed to follow a distribution \(F_{j}( \cdot~|~\mathbf{x}_i, \mathbf{z}_i~;~\mu_{ij}, \sigma_{ij}, p_{ij})\), which is specified as the generalized additive model for location, scale and shape (GAMLSS).

The various specifications of \(f_{jc_i}(\cdot)\), \(g_{jc_i}(\cdot)\), and \(h_{jc_i}(\cdot)\) are summarized in the next section.

\[\begin{split}\begin{equation} \begin{cases} Y_{ij}~|~\mathbf{x}_i, \mathbf{z}_i &\overset{\mathrm{ind}}{\sim} F_{j}( \cdot~|~\mathbf{x}_i, \mathbf{z}_i~;~\mu_{ij}, \sigma_{ij}, p_{ij})\\ \theta_{j}(\mu_{ij}) &= \alpha_{j0} + \alpha_{jb_i} + \alpha_{jc_i} + f_{jc_i}(\mathbf{x}_i) \\ \log(\sigma_{ij}) &= \beta_{j0}+ \beta_{jb_i} + \beta_{jc_i} + g_{jc_i}(\mathbf{x}_i) \\ \operatorname{logit}(p_{ij}) &= \gamma_{j0} + \gamma_{jb_i}+ \gamma_{jc_i}+ h_{jc_i}(\mathbf{x}_i) \\ \end{cases} \, \end{equation}\end{split}\]

Summary

Covariate type

Covariate form

Function form

Explaination

Geometric meaning

Code Example

Discrete cell type

\(x_i \in \left\{1, \ldots, K_C\right\}\)

\(f_{jc_i}(x_i) = \alpha_{jc_ix_i}\)

Cell type \(x_i\) has the effect \(\alpha_{jc_ix_i}\); for identifiability, \(\alpha_{jc_ix_i} = 0\) if \(x_i = 1\)

One intercept for each cell type

mu_formula = “cell_type”

Continuous pseudotime in one lineage

\(x_i \in [0,\infty)\)

\(f_{jc_i}({x}_i) = \sum_{k = 1}^Kb_{jc_ik}(x_{i})\beta_{jc_ik}\)

\(b_{jc_ik}(\cdot)\) is a basis function of cubic spline; \(K\) is the dimension of the basis

A curve along the pseudotime

mu_formula = “s(pseudotime)”

Continuous pseudotimes in \(p\) lineages

\(\mathbf{x}_i = (x_{i1}, \ldots, x_{ip})^T \in [0,\infty)^{p}\)

\(f_{jc_i}(\mathbf{x}_i) = \sum_{l = 1}^p \sum_{k = 1}^Kb_{jc_ilk}(x_{il})\beta_{jc_ilk}\)

\(b_{jc_ilk}(\cdot)\) is a basis function of cubic spline; \(K\) is the the dimension of the basis (default \(K=10\))

One curve along each lineage

mu_formula = “s(pseudotime1, k = 10, by = l1, bs = ‘cr’) + s(pseudotime2, k = 10, by = l2, bs = ‘cr’)”, \(p = 2\) in this case

Spatial location

\(\mathbf{x}_i = (x_{i1}, x_{i2})^T \in \mathbb{R}^{2}\)

\(f_{jc_i}(\mathbf{x}_i) = f_{jc_i}^{\operatorname{GP}}(x_{i1}, x_{i2}, K)\)

\(f_{jc_i}^{\operatorname{GP}}(\cdot, \cdot, K)\) is a Gaussian process smoother; \(K\) is the dimension of the basis (default \(K=400\))

A smooth surface

mu_formula = “s(spatial1, spatial2, bs = ‘gp’, k = 400)”

For simplicity, we only show the form of \(f_{jc_i}(\cdot)\) because \(g_{jc_i}(\cdot)\) and \(h_{jc_i}(\cdot)\) have the same form.