Cell type annotation is an essential step in single-cell RNA-seq analysis. However, it is a time-consuming process that often requires expertise in collecting canonical marker genes and manually annotating cell types. Automated cell type annotation methods typically require the acquisition of high-quality reference datasets and the development of additional pipelines. We demonstrated that GPT-4, a highly potent large language model, can automatically and accurately annotate cell types by utilizing marker gene information generated from standard single-cell RNA-seq analysis pipelines in this manuscript. We developed this software, GPTCelltype, to provide an automated cell type annotation approach using GPT-4 for single-cell RNA-seq analysis.
GPTCelltype can be installed by following this instruction on Github.
remotes::install_github("Winnie09/GPTCelltype")
GPTCelltype depends on the R package openai. Please make sure it is installed as well.
install.packages("openai")
GPTCelltype integrates the OpenAI API into the software. To connect to OpenAI API, a secret API key is required. To avoid the risk of exposing the API key or committing the key to browsers, users need to set up the API key as a system environment variable before running GPTCelltype. If the API key is provided, cell type annotations are returned. Otherwise, if the API key is not provided, the output from GPTCelltype is the prompt itself which users can further used to communicate with the GPT chatbot.
You can generate your API key in your OpenAI account webpage: log in to OpenAI. In the pop-up windows, click on “->” next to “API”; next, click on the left-hand-side icon of “API key”; then click on “Create new secret key” to create your key which directs you to the API key page. Copy the key and paste it on a note for further use. Avoid sharing your API key with others or uploading it to public spaces. Make sure it’s not visible in browsers or any client-side scripts. Finally, on the left bar, click “Settings”; on the break-down list click on “Billing”, and make sure you have non-zero credit balance.
Set up the API key as a system environment variable before running GPTCelltype.
Sys.setenv(OPENAI_API_KEY = 'your_openai_API_key')
First of all, please load the packages.
library(GPTCelltype)
library(openai)
We demonstrate how to run GPTCelltype as follows. The main function
is gptcelltype()
. It can annotate cell
types by OpenAI GPT models in a Seurat pipeline or with a custom gene
list. If gptcelltype()
is used in a Seurat
pipeline, Seurat FindAllMarkers()
function
needs to be run first and the differential gene table generated by
Seurat will serve as the input. If the input is a custom list of genes,
one cell type is identified for each element in the list.
Among the input arguments, input
can
either be the differential gene table returned by Seurat
FindAllMarkers()
function, or a list of
genes. tissuename
(optional) is a tissue
name. model
is a valid GPT-4 or GPT-3.5
model name listed on Models page. Default
is ‘gpt-4’. topgenenumber
is the number of
top differential genes to be used when the input is Seurat differential
genes. The output is a vector of cell types.
GPTCelltype integrates seamlessly with the Seurat pipeline. It can
take an Seurat object as input, if the Seurat object has marker genes
information. Specifially, this can be achieved after running the Seurat
function FindAllMarkers()
. Here follows an example.
Load the Seurat package.
library(Seurat, quietly = TRUE)
## 'SeuratObject' was built under R 4.3.1 but the current version is
## 4.3.2; it is recomended that you reinstall 'SeuratObject' as the ABI
## for R may have changed
## 'SeuratObject' was built with package 'Matrix' 1.6.3 but the current
## version is 1.6.5; it is recomended that you reinstall 'SeuratObject' as
## the ABI for 'Matrix' may have changed
##
## Attaching package: 'SeuratObject'
## The following object is masked from 'package:base':
##
## intersect
In the below example, we are going to use a Seurat object called ‘pbmc_small’ provided by the Seurat package. In real applications, a Seurat project obtained after running the standard Seurat pipeline should be prepared. The Seurat project should have cell clustering available. Use FindAllMarkers() function to generate the differential gene table if you haven’t done so:
data("pbmc_small")
suppressWarnings({
all.markers <- FindAllMarkers(object = pbmc_small)
})
## Calculating cluster 0
## Calculating cluster 1
## Calculating cluster 2
Perform cell type annotation by GPT-4 using the gptcelltype() function. Here you can optionally provide the actual name of the tissue for your dataset.
res <- gptcelltype(all.markers,
tissuename = 'human PBMC',
model = 'gpt-4'
)
## [1] "Note: OpenAI API key found: returning the cell type annotations."
## [1] "Note: It is always recommended to check the results returned by GPT-4 in case of\n AI hallucination, before going to down-stream analysis."
It is always recommended to check the results returned by GPT-4 in case of AI hallucination, before going to down-stream analysis.
res
## 0 1 2
## "Monocytes" "Neutrophils" "B cells"
If the results make sense, we can assign the cell type annotations back to the Seurat object and visualize the cell type annotations on the UMAP:
pbmc_small@meta.data$celltype <- as.factor(res[as.character(Idents(pbmc_small))])
DimPlot(pbmc_small,group.by='celltype')
If the results need to be fine-tuned, it is easy to reassign cell type annotations for some clusters. For example, to change the cell type annotation for cluster 0:
res[1] <- 'Classical monocytes'
pbmc_small@meta.data$celltype <- res[as.character(Idents(pbmc_small))]
If you prefer not to link to GPT-4 API or do not have OpenAI key, you
can set Sys.setenv(OPENAI_API_KEY = '')
. In this case, the
gptcelltype() function will print the prompt directly, which can be
copied and pasted into the GPT-4 or ChatGPT online user interface to
obtain cell type annotations.
Sys.setenv(OPENAI_API_KEY = '')
data("pbmc_small")
suppressWarnings({
all.markers <- FindAllMarkers(object = pbmc_small)
})
## Calculating cluster 0
## Calculating cluster 1
## Calculating cluster 2
res <- gptcelltype(all.markers,
tissuename = 'human PBMC',
model = 'gpt-4'
)
## [1] "Note: OpenAI API key not found: returning the prompt itself."
cat(res)
## Identify cell types of human PBMC cells using the following markers separately for each
## row. Only provide the cell type name. Do not show numbers before the name.
## Some can be a mixture of multiple cell types.
## 0:HLA-DPB1,HLA-DRB1,HLA-DPA1,HLA-DRA,HLA-DRB5,HLA-DQB1,LYZ,TYMP,HLA-DQA1,HLA-DMB
## 1:S100A8,TYMP,S100A9,LYZ,CST3,FCGRT,LST1,AIF1,TYROBP,IFITM3
## 2:HLA-DPB1,MS4A1,HLA-DQB1,HLA-DRB1,HLA-DRA,TCL1A,CD79A,CD79B,HLA-DPA1,HLA-DRB5
Set up your OpenAI API key as a system environment variable before running GPTCelltype.
Sys.setenv(OPENAI_API_KEY = 'your_openai_API_key')
If we provide a list of two gene vectors: the first vector contains CD4 and CD3D, and the second vector contains CD14, then we can call the function in this way:
res <- gptcelltype(
input = list(cluster1 = c('CD4, CD3D'), cluster2 = 'CD14'),
tissuename = 'human PBMC',
model = 'gpt-4'
)
## [1] "Note: OpenAI API key not found: returning the prompt itself."
res
## [1] "Identify cell types of human PBMC cells using the following markers separately for each\n row. Only provide the cell type name. Do not show numbers before the name.\n Some can be a mixture of multiple cell types. \ncluster1:CD4, CD3D\ncluster2:CD14"
sessionInfo()
## R version 4.3.2 (2023-10-31)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.2
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] Seurat_5.0.1 SeuratObject_5.0.1 sp_2.1-3 openai_0.4.1
## [5] GPTCelltype_1.0.1
##
## loaded via a namespace (and not attached):
## [1] deldir_2.0-2 pbapply_1.7-2 gridExtra_2.3
## [4] rlang_1.1.3 magrittr_2.0.3 RcppAnnoy_0.0.22
## [7] spatstat.geom_3.2-8 matrixStats_1.2.0 ggridges_0.5.6
## [10] compiler_4.3.2 png_0.1-8 vctrs_0.6.5
## [13] reshape2_1.4.4 stringr_1.5.1 pkgconfig_2.0.3
## [16] fastmap_1.1.1 ellipsis_0.3.2 labeling_0.4.3
## [19] utf8_1.2.4 promises_1.2.1 rmarkdown_2.25
## [22] purrr_1.0.2 xfun_0.42 cachem_1.0.8
## [25] jsonlite_1.8.8 goftest_1.2-3 highr_0.10
## [28] later_1.3.2 spatstat.utils_3.0-4 irlba_2.3.5.1
## [31] parallel_4.3.2 cluster_2.1.4 R6_2.5.1
## [34] ica_1.0-3 spatstat.data_3.0-4 bslib_0.6.1
## [37] stringi_1.8.3 RColorBrewer_1.1-3 reticulate_1.35.0
## [40] parallelly_1.37.0 lmtest_0.9-40 jquerylib_0.1.4
## [43] scattermore_1.2 assertthat_0.2.1 Rcpp_1.0.12
## [46] knitr_1.45 tensor_1.5 future.apply_1.11.1
## [49] zoo_1.8-12 sctransform_0.4.1 httpuv_1.6.14
## [52] Matrix_1.6-5 splines_4.3.2 igraph_2.0.2
## [55] tidyselect_1.2.0 abind_1.4-5 rstudioapi_0.15.0
## [58] yaml_2.3.8 spatstat.random_3.2-2 spatstat.explore_3.2-6
## [61] codetools_0.2-19 miniUI_0.1.1.1 curl_5.2.0
## [64] listenv_0.9.1 lattice_0.21-9 tibble_3.2.1
## [67] plyr_1.8.9 withr_3.0.0 shiny_1.8.0
## [70] ROCR_1.0-11 evaluate_0.23 Rtsne_0.17
## [73] future_1.33.1 fastDummies_1.7.3 survival_3.5-7
## [76] polyclip_1.10-6 fitdistrplus_1.1-11 pillar_1.9.0
## [79] KernSmooth_2.23-22 plotly_4.10.4 generics_0.1.3
## [82] RcppHNSW_0.6.0 ggplot2_3.5.0 munsell_0.5.0
## [85] scales_1.3.0 globals_0.16.2 xtable_1.8-4
## [88] glue_1.7.0 lazyeval_0.2.2 tools_4.3.2
## [91] data.table_1.15.0 RSpectra_0.16-1 RANN_2.6.1
## [94] leiden_0.4.3.1 dotCall64_1.1-1 cowplot_1.1.3
## [97] grid_4.3.2 tidyr_1.3.1 colorspace_2.1-0
## [100] nlme_3.1-163 patchwork_1.2.0 presto_1.0.0
## [103] cli_3.6.2 spatstat.sparse_3.0-3 spam_2.10-0
## [106] fansi_1.0.6 viridisLite_0.4.2 dplyr_1.1.4
## [109] uwot_0.1.16 gtable_0.3.4 sass_0.4.8
## [112] digest_0.6.34 progressr_0.14.0 ggrepel_0.9.5
## [115] farver_2.1.1 htmlwidgets_1.6.4 htmltools_0.5.7
## [118] lifecycle_1.0.4 httr_1.4.7 mime_0.12
## [121] MASS_7.3-60.0.1