Introductions

Cell type annotation is an essential step in single-cell RNA-seq analysis. However, it is a time-consuming process that often requires expertise in collecting canonical marker genes and manually annotating cell types. Automated cell type annotation methods typically require the acquisition of high-quality reference datasets and the development of additional pipelines. We demonstrated that GPT-4, a highly potent large language model, can automatically and accurately annotate cell types by utilizing marker gene information generated from standard single-cell RNA-seq analysis pipelines in this manuscript. We developed this software, GPTCelltype, to provide an automated cell type annotation approach using GPT-4 for single-cell RNA-seq analysis.

Installation

GPTCelltype can be installed by following this instruction on Github.

remotes::install_github("Winnie09/GPTCelltype")

GPTCelltype depends on the R package openai. Please make sure it is installed as well.

install.packages("openai")

Set up OpenAI API key as an environment variable

GPTCelltype integrates the OpenAI API into the software. To connect to OpenAI API, a secret API key is required. To avoid the risk of exposing the API key or committing the key to browsers, users need to set up the API key as a system environment variable before running GPTCelltype. If the API key is provided, cell type annotations are returned. Otherwise, if the API key is not provided, the output from GPTCelltype is the prompt itself which users can further used to communicate with the GPT chatbot.

You can generate your API key in your OpenAI account webpage: log in to OpenAI, click on “Personal” on the upper right corner, click on “View API keys” in the break-down list, and then click on “Create new secret key” which directs you to the API key page. Copy the key and paste it on a note for further use. Avoid sharing your API key with others or uploading it to public spaces. Make sure it’s not visible in browsers or any client-side scripts.

Set up the API key as a system environment variable before running GPTCelltype.

Sys.setenv(OPENAI_API_KEY = 'your_openai_API_key')

Run GPTCelltype

First of all, please load the packages.

library(GPTCelltype)
library(openai)

We demonstrate how to run GPTCelltype as follows. The main function is gptcelltype(). It can annotate cell types by OpenAI GPT models in a Seurat pipeline or with a custom gene list. If gptcelltype() is used in a Seurat pipeline, Seurat FindAllMarkers() function needs to be run first and the differential gene table generated by Seurat will serve as the input. If the input is a custom list of genes, one cell type is identified for each element in the list.

Among the input arguments, input can either be the differential gene table returned by Seurat FindAllMarkers() function, or a list of genes. tissuename (optional) is a tissue name. model is a valid GPT-4 or GPT-3.5 model name listed on Models page. Default is ‘gpt-4’. topgenenumber is the number of top differential genes to be used when the input is Seurat differential genes. The output is a vector of cell types.

Example 1: Seurat object as input

GPTCelltype integrates seamlessly with the Seurat pipeline. It can take an Seurat object as input, if the Seurat object has marker genes information. Specifially, this can be achieved after running the Seurat function FindAllMarkers(). Here follows an example.

Load the Seurat package.

library(Seurat, quietly = TRUE)
## 'SeuratObject' was built under R 4.3.1 but the current version is
## 4.3.2; it is recomended that you reinstall 'SeuratObject' as the ABI
## for R may have changed
## 'SeuratObject' was built with package 'Matrix' 1.6.3 but the current
## version is 1.6.5; it is recomended that you reinstall 'SeuratObject' as
## the ABI for 'Matrix' may have changed
## 
## Attaching package: 'SeuratObject'
## The following object is masked from 'package:base':
## 
##     intersect

In the below example, we are going to use a Seurat object called ‘pbmc_small’ provided by the Seurat package. In real applications, a Seurat project obtained after running the standard Seurat pipeline should be prepared. The Seurat project should have cell clustering available. Use FindAllMarkers() function to generate the differential gene table if you haven’t done so:

data("pbmc_small")
suppressWarnings({
  all.markers <- FindAllMarkers(object = pbmc_small)
})
## Calculating cluster 0
## Calculating cluster 1
## Calculating cluster 2

Perform cell type annotation by GPT-4 using the gptcelltype() function. Here you can optionally provide the actual name of the tissue for your dataset.

res <- gptcelltype(all.markers, 
            tissuename = 'human PBMC', 
            model = 'gpt-4'
)
## [1] "Note: OpenAI API key found: returning the cell type annotations."
## [1] "Note: It is always recommended to check the results returned by GPT-4 in case of\n AI hallucination, before going to down-stream analysis."

It is always recommended to check the results returned by GPT-4 in case of AI hallucination, before going to down-stream analysis.

res
##                0                1                2 
##   "1. Monocytes" "2. Neutrophils"     "3. B Cells"

If the results make sense, we can assign the cell type annotations back to the Seurat object and visualize the cell type annotations on the UMAP:

pbmc_small@meta.data$celltype <- as.factor(res[as.character(Idents(pbmc_small))])
DimPlot(pbmc_small,group.by='celltype')

If the results need to be fine-tuned, it is easy to reassign cell type annotations for some clusters. For example, to change the cell type annotation for cluster 0:

res[1] <- 'Classical monocytes'
pbmc_small@meta.data$celltype <- res[as.character(Idents(pbmc_small))]

If you prefer not to link to GPT-4 API or do not have OpenAI key, you can set Sys.setenv(OPENAI_API_KEY = ''). In this case, the gptcelltype() function will print the prompt directly, which can be copied and pasted into the GPT-4 or ChatGPT online user interface to obtain cell type annotations.

Sys.setenv(OPENAI_API_KEY = '')
data("pbmc_small")
suppressWarnings({
  all.markers <- FindAllMarkers(object = pbmc_small)
})
## Calculating cluster 0
## Calculating cluster 1
## Calculating cluster 2
res <- gptcelltype(all.markers, 
            tissuename = 'human PBMC', 
            model = 'gpt-4'
)
## [1] "Note: OpenAI API key not found: returning the prompt itself."
cat(res)
## Identify cell types of human PBMC cells using the following markers separately for each
##  row. Only provide the cell type name. Do not show numbers before the name.
##  Some can be a mixture of multiple cell types. 
## 0:HLA-DPB1,HLA-DRB1,HLA-DPA1,HLA-DRA,HLA-DRB5,HLA-DQB1,LYZ,TYMP,HLA-DQA1,HLA-DMB
## 1:S100A8,TYMP,S100A9,LYZ,CST3,FCGRT,LST1,AIF1,TYROBP,IFITM3
## 2:HLA-DPB1,MS4A1,HLA-DQB1,HLA-DRB1,HLA-DRA,TCL1A,CD79A,CD79B,HLA-DPA1,HLA-DRB5

Example 2: use a list of genes as input

Set up your OpenAI API key as a system environment variable before running GPTCelltype.

Sys.setenv(OPENAI_API_KEY = 'your_openai_API_key')

If we provide a list of two gene vectors: the first vector contains CD4 and CD3D, and the second vector contains CD14, then we can call the function in this way:

res <- gptcelltype(
  input = list(cluster1 = c('CD4, CD3D'), cluster2 = 'CD14'),
  tissuename = 'human PBMC',
  model = 'gpt-4'
)
## [1] "Note: OpenAI API key not found: returning the prompt itself."
res
## [1] "Identify cell types of human PBMC cells using the following markers separately for each\n row. Only provide the cell type name. Do not show numbers before the name.\n Some can be a mixture of multiple cell types. \ncluster1:CD4, CD3D\ncluster2:CD14"

Session Info

sessionInfo()
## R version 4.3.2 (2023-10-31)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.2
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] Seurat_5.0.1       SeuratObject_5.0.1 sp_2.1-3           openai_0.4.1      
## [5] GPTCelltype_1.0.1 
## 
## loaded via a namespace (and not attached):
##   [1] deldir_2.0-2           pbapply_1.7-2          gridExtra_2.3         
##   [4] rlang_1.1.3            magrittr_2.0.3         RcppAnnoy_0.0.22      
##   [7] spatstat.geom_3.2-8    matrixStats_1.2.0      ggridges_0.5.6        
##  [10] compiler_4.3.2         png_0.1-8              vctrs_0.6.5           
##  [13] reshape2_1.4.4         stringr_1.5.1          pkgconfig_2.0.3       
##  [16] fastmap_1.1.1          ellipsis_0.3.2         labeling_0.4.3        
##  [19] utf8_1.2.4             promises_1.2.1         rmarkdown_2.25        
##  [22] purrr_1.0.2            xfun_0.42              cachem_1.0.8          
##  [25] jsonlite_1.8.8         goftest_1.2-3          highr_0.10            
##  [28] later_1.3.2            spatstat.utils_3.0-4   irlba_2.3.5.1         
##  [31] parallel_4.3.2         cluster_2.1.4          R6_2.5.1              
##  [34] ica_1.0-3              spatstat.data_3.0-4    bslib_0.6.1           
##  [37] stringi_1.8.3          RColorBrewer_1.1-3     reticulate_1.35.0     
##  [40] parallelly_1.37.0      lmtest_0.9-40          jquerylib_0.1.4       
##  [43] scattermore_1.2        assertthat_0.2.1       Rcpp_1.0.12           
##  [46] knitr_1.45             tensor_1.5             future.apply_1.11.1   
##  [49] zoo_1.8-12             sctransform_0.4.1      httpuv_1.6.14         
##  [52] Matrix_1.6-5           splines_4.3.2          igraph_2.0.2          
##  [55] tidyselect_1.2.0       abind_1.4-5            rstudioapi_0.15.0     
##  [58] yaml_2.3.8             spatstat.random_3.2-2  spatstat.explore_3.2-6
##  [61] codetools_0.2-19       miniUI_0.1.1.1         curl_5.2.0            
##  [64] listenv_0.9.1          lattice_0.21-9         tibble_3.2.1          
##  [67] plyr_1.8.9             withr_3.0.0            shiny_1.8.0           
##  [70] ROCR_1.0-11            evaluate_0.23          Rtsne_0.17            
##  [73] future_1.33.1          fastDummies_1.7.3      survival_3.5-7        
##  [76] polyclip_1.10-6        fitdistrplus_1.1-11    pillar_1.9.0          
##  [79] KernSmooth_2.23-22     plotly_4.10.4          generics_0.1.3        
##  [82] RcppHNSW_0.6.0         ggplot2_3.5.0          munsell_0.5.0         
##  [85] scales_1.3.0           globals_0.16.2         xtable_1.8-4          
##  [88] glue_1.7.0             lazyeval_0.2.2         tools_4.3.2           
##  [91] data.table_1.15.0      RSpectra_0.16-1        RANN_2.6.1            
##  [94] leiden_0.4.3.1         dotCall64_1.1-1        cowplot_1.1.3         
##  [97] grid_4.3.2             tidyr_1.3.1            colorspace_2.1-0      
## [100] nlme_3.1-163           patchwork_1.2.0        presto_1.0.0          
## [103] cli_3.6.2              spatstat.sparse_3.0-3  spam_2.10-0           
## [106] fansi_1.0.6            viridisLite_0.4.2      dplyr_1.1.4           
## [109] uwot_0.1.16            gtable_0.3.4           sass_0.4.8            
## [112] digest_0.6.34          progressr_0.14.0       ggrepel_0.9.5         
## [115] farver_2.1.1           htmlwidgets_1.6.4      htmltools_0.5.7       
## [118] lifecycle_1.0.4        httr_1.4.7             mime_0.12             
## [121] MASS_7.3-60.0.1