Introductions

Single-cell sequencing measures thousands of features for each cell, which is hard to perceive directly. Dimension reduction techniques such as PCA, t-SNE, and UMAP thus become essential to map the cells with high-dimensional information to a low dimensional space. The reduced dimension representations are then visualized with scatterplots to understand the latent structure of the data, to identify cell subpopulations or developmental trajectories, or to compare across different samples. However, we observe significant biases in such visualization procedure which could lead to problematic interpretations of the data in many real applications. The first bias arises when visualizing the gene expression levels or cell identities. The scatterplot only shows a subset of cells plotted last and the cells plotted earlier are masked and unseen. The second bias arises when comparing the cell type compositions from different types of samples. The scatterplot is biased by the unbalanced total numbers of cells across samples.

To this end, we develop a new visualization method, SCUBI, that tackles these biases. To address the first bias, SCUBI partitions the coordinate space into small non-overlapping squares and visualizes the aggregated information of cells within each square. To address the second bias, SCUBI calculates the proportions of cells falling in non-overlapping squares and visualizes the differences of cell proportions across samples. SCUBI more faithfully reveals the real biological information in the single-cell genomic datasets.

SCUBI comes with six modes designed for six different scenarios in single-cell data visualization. The first three modes are designed for cell-level visualizations and the last three modes are designed for comparisons across different types of samples.

Data loading

We demonstrate the utility of SCUBI using a small dataset, which is a subset of a single-cell RNA-seq dataset measuring the gene expression profiles in COVID-19 patients and healthy donors collected from Su et al. Cell, 2020. The dataset is in the format of a Seurat object and we used the following code to extract the first two dimensions of UMAP, the gene expression count matrix, and the severity levels of the samples from which the cells were collected. SCUBI accepts data generated by other software or other pipelines as well, as long as the format follows the basic requirements of SCUBI. Details of the requirements are described in each mode.

library(scubi)
library(Seurat)
## This version of bslib is designed to work with shiny version 1.6.0 or higher.
## Attaching SeuratObject
library(Matrix)
# read in the seurat object
data <- readRDS(paste0(system.file('data',package = 'scubi'),'/scubiexample.rds'))
# the first two dimensions of umap
umap1 <- data$umap@cell.embeddings[,1]
umap2 <- data$umap@cell.embeddings[,2]
str(umap1)
##  Named num [1:2500] -10.108 -9.388 -10.124 14.939 0.656 ...
##  - attr(*, "names")= chr [1:2500] "Mi:1" "HD:2" "Mod:3" "Se:4" ...
str(umap2)
##  Named num [1:2500] 2.78 1.17 2.63 -3.81 2.01 ...
##  - attr(*, "names")= chr [1:2500] "Mi:1" "HD:2" "Mod:3" "Se:4" ...
# the gene expression count matrix
count <- data$RNA@counts
str(count)
## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   ..@ i       : int [1:2232404] 0 10 11 15 17 18 19 20 21 22 ...
##   ..@ p       : int [1:2501] 0 1189 1978 3036 4041 4819 5752 6527 7451 8340 ...
##   ..@ Dim     : int [1:2] 3282 2500
##   ..@ Dimnames:List of 2
##   .. ..$ : chr [1:3282] "A1BG" "AAK1" "AAMP" "AASDHPPT" ...
##   .. ..$ : chr [1:2500] "Mi:1" "HD:2" "Mod:3" "Se:4" ...
##   ..@ x       : num [1:2232404] 1 2 2 1 1 1 1 1 1 2 ...
##   ..@ factors : list()
# total number of reads for each cell
readcount <- colSums(data)
# sample identity from which cells were colleted
sample <- sub(':.*','',colnames(count))
table(sample)
## sample
##   HD   Mi  Mod   Se 
## 1421  448  313  318
# sample design matrix
uniquesample <- unique(sample)
design <- data.frame(type=as.numeric(uniquesample=='Se'),row.names = uniquesample)
design

Mode 1: SCUBI for aggregated gene expression

This mode visualizes the expression of a single gene. It can be used to identify groups of cells with high or low expression levels of a marker gene. SCUBI partitions the coordinate space into small non-overlapping squares and visualizes the aggregated information across cells falling in each square.

There are four required inputs: the two dimensions (dim1 and dim2), the gene expression count matrix (count), and the name of the gene to be plotted (gene). Below shows an example plotting the expression of CD3D gene, which is a marker gene for T cells.

scubi_expression(dim1 = umap1, dim2 = umap2, count = count, gene = 'CD3D')

SCUBI supports the plotting of multiple genes on the same plot. For example, below shows and example plotting the expression of CD3D, CD4, CD14 , and CCL3 genes.

scubi_expression(dim1 = umap1, dim2 = umap2, count = count, gene = list('CD3D','CD4','CD14','CCL3'))

Optionally, the size of the squares can be controlled by resolution. Resolution is set to be around 10 for UMAP by default. A lower resolution will result in larger squares which aggregates information from more cells but with lower resolution. A small resolution is desired with small number of cells and a large resolution is desired with large number of cells.

Below shows the effect if we decrease resolution to 2, which results in larger squares.

scubi_expression(dim1 = umap1, dim2 = umap2, count = count, gene = 'CD3D', resolution=2)

SCUBI provides a function to assist in deciding the resolution. For a vector of potential resolution, it calculates the averaged numbers of cells falling in squares. A scatterplot is generated to visualize how this averaged numbers of cells changes with the resolution. A desired resolution is chosen at the elbow point of the plot.

resolution(dim1 = umap1, dim2 = umap2)

Users can supplier their own palettes using the ‘palette’ argument.

scubi_expression(dim1 = umap1, dim2 = umap2, count = count, gene = 'CD3D', palette = colorRampPalette(c('orange','blue'))(10))

By default, SCUBI will automatically generate a plot. For more advanced usage, the plotting process can be turned off and SCUBI will return a data frame used for plotting instead. Users can further customize their own plots using the returned data frame. In this mode, the first four columns define the x- and y-axis boundaries of each square and the last column contains the averaged expression levels.

df <- scubi_expression(dim1 = umap1, dim2 = umap2, count = count, gene = 'CD3D', resolution=2, plot = FALSE)
str(df)
## 'data.frame':    203 obs. of  5 variables:
##  $ xmin      : num  -0.75 -0.75 -0.75 -0.75 -5.25 -5.25 -5.25 -5.25 -5.25 -5.75 ...
##  $ xmax      : num  -0.25 -0.25 -0.25 -0.25 -4.75 -4.75 -4.75 -4.75 -4.75 -5.25 ...
##  $ ymin      : num  -15.75 -16.25 0.75 1.25 -0.25 ...
##  $ ymax      : num  -15.25 -15.75 1.25 1.75 0.25 ...
##  $ expression: num  0 0 6.56 6.16 6.51 ...

Mode 2: SCUBI for averaged continuous measure

This mode visualizes a continuous measure of a cell, such as the total number of reads or the proportion of reads mapped to the mitochondrial genome. It can be used to identify groups of cells with high or low values of the continuous measure. SCUBI partitions the coordinate space into small non-overlapping squares and visualizes the averaged value of the continuous measure across cells falling in each square.

There are three required inputs: the two dimensions (dim1 and dim2) and the continuous value (feature). Below shows an example plotting the total number of reads for each cell.

scubi_continuous(dim1 = umap1, dim2 = umap2, feature = readcount)

This mode also accepts ‘resolution’, ‘plot’, and ‘palette’ as optional arguments.

Mode 3: SCUBI for discrete categories

This mode visualizes the distribution of cells with discrete categories, such as the cell types, cell clusters, or the types of samples from which the cells were collected. It can be used to identify the location of a cell cluster, or regions where cells from certain samples are enriched. SCUBI splits the square into multiple pieces, each piece corresponding to one category. The area of the piece reflects the proportion of cells in each category falling in the square. The pieces are colored differently corresponding to different categories.

There are three required inputs: the two dimensions (dim1 and dim2) and the discrete category (feature). Below shows an example plotting the severity levels of the samples from which the cells were collected.

scubi_discrete(dim1 = umap1, dim2 = umap2, feature = sample)

This mode also accepts ‘resolution’, ‘plot’, and ‘palette’ as optional arguments.

Mode 4: SCUBI for comparing cell type compositions across samples

This mode compares the cell type compositions across different types of samples, such as samples with different levels of disease severities. It can be used to identify the regions where cells from a certain type of samples are enriched. This mode can deal with a continuous covariate, such as to identify enriched cell types with increasing age. Additional sample covariates can be included as well to control for potential confounding effects.

There are four required inputs: the two dimensions (dim1 and dim2), a vector indicating which sample each cell belongs to (sample), and the sample covariates (design). The design can have multiple columns to include multiple covariates, but only the coefficient for the first covariate (first column) will be visualized. Below shows an example plotting the difference of cell type compositions between severe and all other samples. Cell types more enriched in severe are marked as red and cell types deleted in severe are marked as blue.

scubi_sample_composition(dim1 = umap1, dim2 = umap2, sample = sample, design = design)

By default, the enrichment is smoothed using loess. Smoothing can also be turned off to show the original enrichment.

scubi_sample_composition(dim1 = umap1, dim2 = umap2, sample = sample, design = design, smooth=FALSE)

Two additional factors supported in this mode are ‘clip’ and ‘scale’. ‘clip’ replaces very large values for smaller ones and replaces very small values for larger ones, thus reducing the range of data and making the color palette more evenly distributed. ‘scale’ standardizes the enrichment across all squares to have mean 0 and standard deviation of 1.

This mode also accepts ‘resolution’, ‘plot’, and ‘palette’ as optional arguments.

Mode 5: SCUBI for comparing gene expression across samples

This mode compares the expression of a single gene across different types of samples. It can be used to identify the regions where gene expression is positively or negatively correlated with the covariate of interest. Similar to mode 4, it can also deal with a continuous variable or include additional covariates.

There are six required inputs: the two dimensions (dim1 and dim2), a gene expression count matrix (count), the name of the gene (gene), a vector indicating which sample each cell belongs to (sample), and the sample covariates (design). Below shows an example plotting the difference of S100A8 gene expression between severe and all other samples. A positive difference (severe higher than others) is marked as red and a negative difference (severe lower than others) is marked as blue.

scubi_sample_expression(dim1 = umap1, dim2 = umap2, count = count, gene = 'S100A8', sample = sample, design = design,resolution=3)

This mode also accepts ‘smooth’, ‘clip’, ‘scale’, ‘resolution’, ‘plot’, and ‘palette’ as optional arguments.

For this mode and Mode 6 below, SCUBI provides another function to assist in deciding the resolution. For a vector of potential resolution, it calculates the proportion of squares that contain cells from samples at least two different covariate values. A scatterplot is generated to visualize how this proportion changes with the resolution. A desired resolution is chosen at the elbow point of the plot.

resolution_sample_expression(dim1 = umap1, dim2 = umap2,sample=sample,design=design)

Mode 6: SCUBI for comparing continuous measure across samples

This mode compares a continuous measure of cells across different types of samples. It can be used to identify the regions where the continuous measure is positively or negatively correlated with the covariate of interest. Similar to mode 4, it can also deal with a continuous variable or include additional covariates.

There are five required inputs: the two dimensions (dim1 and dim2), a vector of continuous measure (feature), a vector indicating which sample each cell belongs to (sample), and the sample covariates (design). Below shows an example plotting the difference of number of read counts between severe and all other samples. A positive difference (severe higher than others) is marked as red and a negative difference (severe lower than others) is marked as blue.

scubi_sample_continuous(dim1 = umap1, dim2 = umap2, feature = readcount, sample = sample, design = design,resolution=3)

This mode also accepts ‘smooth’, ‘clip’, ‘scale’, ‘resolution’, ‘plot’, and ‘palette’ as optional arguments.

Session Info

sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Mojave 10.14.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] Matrix_1.2-18      SeuratObject_4.0.0 Seurat_4.0.0       scubi_1.0         
## 
## loaded via a namespace (and not attached):
##   [1] nlme_3.1-148         matrixStats_0.57.0   RcppAnnoy_0.0.18    
##   [4] RColorBrewer_1.1-2   httr_1.4.2           sctransform_0.3.2   
##   [7] tools_4.0.2          bslib_0.2.5.1        utf8_1.2.1          
##  [10] R6_2.5.0             irlba_2.3.3          rpart_4.1-15        
##  [13] KernSmooth_2.23-17   uwot_0.1.10          mgcv_1.8-33         
##  [16] DBI_1.1.0            lazyeval_0.2.2       colorspace_2.0-1    
##  [19] tidyselect_1.1.0     gridExtra_2.3        compiler_4.0.2      
##  [22] plotly_4.9.2.1       labeling_0.4.2       sass_0.4.0          
##  [25] scales_1.1.1         spatstat.data_2.0-0  lmtest_0.9-37       
##  [28] ggridges_0.5.2       pbapply_1.4-2        goftest_1.2-2       
##  [31] spatstat_1.64-1      stringr_1.4.0        digest_0.6.27       
##  [34] spatstat.utils_2.0-0 rmarkdown_2.7        pkgconfig_2.0.3     
##  [37] htmltools_0.5.1.1    highr_0.8            fastmap_1.0.1       
##  [40] htmlwidgets_1.5.1    rlang_0.4.11         pdist_1.2           
##  [43] shiny_1.5.0          farver_2.1.0         jquerylib_0.1.4     
##  [46] generics_0.1.0       zoo_1.8-8            jsonlite_1.7.1      
##  [49] ica_1.0-2            dplyr_1.0.2          magrittr_2.0.1      
##  [52] patchwork_1.0.1      Rcpp_1.0.5           munsell_0.5.0       
##  [55] fansi_0.4.2          abind_1.4-5          reticulate_1.16     
##  [58] lifecycle_1.0.0      stringi_1.5.3        yaml_2.2.1          
##  [61] MASS_7.3-51.6        Rtsne_0.15           plyr_1.8.6          
##  [64] grid_4.0.2           parallel_4.0.2       listenv_0.8.0       
##  [67] promises_1.1.1       ggrepel_0.8.2        crayon_1.4.1        
##  [70] deldir_0.2-10        miniUI_0.1.1.1       lattice_0.20-41     
##  [73] cowplot_1.1.0        splines_4.0.2        tensor_1.5          
##  [76] knitr_1.33           pillar_1.6.1         igraph_1.2.5        
##  [79] future.apply_1.6.0   reshape2_1.4.4       codetools_0.2-16    
##  [82] leiden_0.3.3         glue_1.4.2           evaluate_0.14       
##  [85] data.table_1.13.2    png_0.1-7            vctrs_0.3.8         
##  [88] httpuv_1.5.4         polyclip_1.10-0      tidyr_1.1.2         
##  [91] gtable_0.3.0         RANN_2.6.1           purrr_0.3.4         
##  [94] scattermore_0.7      clue_0.3-59          future_1.17.0       
##  [97] ggplot2_3.3.3        xfun_0.22            mime_0.9            
## [100] xtable_1.8-4         later_1.1.0.1        viridisLite_0.4.0   
## [103] survival_3.1-12      tibble_3.1.2         cluster_2.1.0       
## [106] globals_0.12.5       fitdistrplus_1.1-1   ellipsis_0.3.2      
## [109] ROCR_1.0-11