imtoken钱包官网下载2.|xcell

首页
xcell

imtoken钱包官网下载2.|xcell

作者： imtoken钱包官网下载2.

2024-03-07 22:06:24

肿瘤免疫浸润分析：xCell包（原理及使用） - 知乎

肿瘤免疫浸润分析：xCell包（原理及使用） - 知乎切换模式写文章登录/注册肿瘤免疫浸润分析：xCell包（原理及使用）剑鱼师兄临床小医生，自学生信，seer数据库挖掘一、输入数据原文：The expression matrix should be a matrix with genes in rows and samples in columns. The rownames should be gene symbols. xCell uses the expression levels ranking and not the actual values, thus normalization does not have an effect, however normalizing to gene length is required.Importantly, xCell performs best with heterogenous dataset. Thus it is recommended to use all data combined in one run, and not break down to pieces (especially not cases and control in different runs).1.数据为矩阵类型，行为gene symol，列为样本；2.xCell原理时使用基因表达量排序的ssGSEA算法，因此不需要基因的真实表达量，但是需要经过基因长度归一化，意思就是说最好选用TPM；3.xCell对具有异质性的数据集表现更好，所以建议一次性将所有样本一起分析，而不是分层或者分组拆分数据集（xCell会给每个样本计算免疫细胞的评分）。二、原理介绍 1.xCell考虑了10800个基因，使用xCell.data$genes命令可以获取这个gene list。xCell要求你的数据集至少包含这个10800个基因中的5000个，如果你的数据集包含的这些基因太少了，将会导致结果不准确。2.xCell对每种细胞都使用了多个特征基因集来进行富集分析，针对64中细胞共有489特征基因集，使用xCell.data$signatures命令可以查看特征基因集。3.xCell的富集分数使用single-sample gene set enrichment analysis (ssGSEA) 【单样本基因富集分析】计算出来的，对于每种细胞而言，是多个特征基因集计算的富集分数的平均值。4.得到的平均分数会在样本之间进行简单的标准化，即移动平均分数，使这个细胞的平均富集分数在样本中的最小值为0。因此，xCell对异质性数据集的性能最好。如果一种细胞类型在所有样本中具有相似的水平，那么得分就会很低【因为都减去了最小的平均分数，最后所有样本这个细胞类型的分数就会都在0附近】，并不能很好地符合预期的比例。因此，建议在一次运行中使用所有可用数据，而不是分解为更小的子集。5.请注意，在不同的运行中分析的样品的分数因此是不可比较的。【因为数据集变了，ssGSEA算法即考虑了单个基因在样本内基因表达量的排序，还考虑在单个基因在样本之间的分布，好像是要先计算一个样本间的分布函数，具体不需要研究太深入了，所以当你数据集换了，这个基因的先验的分布函数就变了，所以后面计算的富集分数肯定就不一样了，所以不同数据集分开做的分析，不能用来比较，但是你把数据集整合到一起后，一次性做的xCell还是可以比较的】三、xCell Pipeline1. rawEnrichmentAnalysis输入表达矩阵，输出64中细胞的富集分数。scores = rawEnrichmentAnalysis(expr, signatures,genes, file.name, parallel.sz, parallel.type = "SOCK")‘expr’即表达矩阵；‘signatures’和‘genes’即上文提到的特征基因集和基因；‘file.name’非必须设置，并可用于指定将原始分数保存到以TAB分隔的文本文件中；‘parallel.sz’ 计算时使用的线程数，默认为4；‘parralel.type’ 使用的集群架构类型:' SOCK '(默认)或' FORK '；Fork更快，但Windows不支持。2. transformScores该函数用于将原始富集分数转换为类似于百分比的线性比例，xCell使用预先计算的校准参数进行转换。xCell对基于测序的基因表达值和基于微阵列的值使用不同的参数集。基于测序数据的排序值的参数为xCell.data$spill$fv；基于微阵列数据的排序值的参数为xCell.data$spill.array$fv。tscores = transformScores(scores, fit.vals, scale, fn)‘scores’ 是rawEnrichmentAnalysis的输出；‘fit.vals’为上述校准参数；scale 即标准化，推荐和默认为TRUE，对转化后的分数进行标准化；‘fn’可用于指定文件名，将转换后的分数保存到以TAB分隔的文本文件中。3. spillOver该函数用于执行由xCell产生的溢出补偿，以减少相关性很强的的细胞类型之间的依赖性。溢出是使用预先计算的依赖矩阵“K”来执行的。基于测序和基于微阵列的输入的依赖矩阵是不同的，基于测序的输入依赖矩阵为：xCell.data$spill$K，基于微阵列的输入依赖矩阵为：xCell.data$spill.array$K。spillOver(transformedScores, K, alpha = 0.5, file.name = NULL)去除溢出可能会产生过强的影响，可能会去除真实的信号。所以，alpha参数用于校准去除溢出水平，Alpha等于0意味没有溢出，等于1意味着完全补偿；作者在实验中发现，alpha等于0.5减少了依赖关系并增加了实际信号(这也是默认值)。最后，作者将这三个步骤包装成了一个Pipeline，我们只需要使用一步便可以完成以上三步的工作。xCellAnalysis(expr, signatures = NULL, genes = NULL, spill = NULL, rnaseq = TRUE, file.name = NULL, scale = TRUE, alpha =

0.5, save.raw = FALSE, parallel.sz = 4, parallel.type = "SOCK", cell.types.use = NULL)四、实战1.安装xCell#安装依赖包

install.packages('pracma', 'utils', 'stats', 'MASS','digest', 'curl', 'quadprog')

BiocManager::install(c("GSVA","GSEABase"), version = "3.8")

#安装xCell

devtools::install_github('dviraran/xCell')但是由于网络问题，可能不能使用R从github下载，还可以下载压缩包，本地安装，前提是依赖包也必须安装好。xCell包下载地址：GitHub - dviraran/xCell: Cell types enrichment analysis#下载压缩包后，本地安装，只需设置压缩包的位置

devtools::install_local("C:/Users/yxz/Desktop/xCell-master.zip")2.加载包、处理表达数据我是用的是TCGA下载的膀胱癌转录组数据，存在在'TCGA-BLCA.Rdata'中，数据名为RNAseq_TPM_tumor，将其处理为行名为基因名，列名为样本名称的矩阵。> library(xCell)

> load('TCGA-BLCA.Rdata')

> exp <- RNAseq_TPM_tumor[,-c(1,3)]

> row.names(exp) <- exp$gene_name

> exp$gene_name <- NULL

> exp <- as.matrix(exp)

> head(exp[1:6,1:4])

TCGA-DK-A2I6 TCGA-FD-A6TK TCGA-UY-A78L TCGA-4Z-AA84

MT-CO2 62170.34 19359.957 15928.54 38257.86

MT-CO3 41175.06 19359.516 21500.40 31115.51

MT-ND4 27932.33 13639.380 12138.97 30899.84

MT-CO1 27255.06 16299.942 14013.53 27248.09

MT-ATP6 26884.94 10122.535 11184.26 18820.21

MT-ND3 12948.43 9792.113 6283.82 18559.253. rawEnrichmentAnalysis当加载xCell包时，一个名为“xCell.data”的list对象，包含溢出和校准参数，signatures，以及它使用的genes。> scores <- rawEnrichmentAnalysis(exp,

signatures = xCell.data$signatures,

genes = xCell.data$genes,

parallel.sz = 4,

parallel.type = "SOCK")

> head(scores[1:6,1:4])

TCGA-DK-A2I6 TCGA-FD-A6TK TCGA-UY-A78L TCGA-4Z-AA84

aDC 2120.9099 3407.5979 2203.3173 1560.0276

Adipocytes 578.9236 918.9322 613.4070 911.7966

Astrocytes 1720.9034 2873.3526 2122.2867 1507.2499

B-cells 484.3791 922.0077 487.2223 833.6574

Basophils 564.3821 420.2946 572.7025 635.7317

CD4+ memory T-cells 574.9492 1039.5864 472.7169 506.29134. transformScores#这里使用的是测序数据，查看其对应的校准参数

> head(xCell.data$spill$fv[1:6,])

V1 V2 V3

Adipocytes 1.511297 3.133350 1.5368631

Astrocytes 2.697510 2.015096 0.3161583

B-cells 4.355125 1.856475 0.3228619

Basophils 9.175250 1.450814 0.1211131

CD4+ memory T-cells 7.825308 2.210082 0.3063117

CD4+ naive T-cells 5.825380 1.868170 0.2871547

> tscores <- transformScores(scores,

fit.vals = xCell.data$spill$fv,

scale = T)

> head(tscores[1:6,1:4])

TCGA-DK-A2I6 TCGA-FD-A6TK TCGA-UY-A78L TCGA-4Z-AA84

aDC 0.2946003913 0.7682611482 0.3182683579 0.1577735604

Adipocytes 0.0001608162 0.0009657404 0.0002036621 0.0009383422

Astrocytes 0.1502547695 0.4593741390 0.2386719742 0.1115486654

B-cells 0.0110567460 0.0499855339 0.0112245443 0.0400672600

Basophils 0.0812310672 0.0357448471 0.0841881330 0.1076197499

CD4+ memory T-cells 0.0067416677 0.0352566010 0.0036216243 0.00453050835. spillOver#查看针对测序数据的依赖矩阵

> head(xCell.data$spill$K[1:6,1:6])

aDC Adipocytes Astrocytes B-cells Basophils CD4+ memory T-cells

aDC 1 4.075718e-01 0 0.03981875 5.619359e-05 0

Adipocytes 0 1.000000e+00 0 0.00000000 0.000000e+00 0

Astrocytes 0 2.999426e-01 1 0.00000000 0.000000e+00 0

B-cells 0 0.000000e+00 0 1.00000000 4.271959e-02 0

Basophils 0 9.218091e-02 0 0.16016181 1.000000e+00 0

CD4+ memory T-cells 0 4.251974e-05 0 0.02674187 1.713277e-01 1

> tscores <- spillOver(tscores,K = xCell.data$spill$K, alpha = 0.5)

> head(tscores[1:6,1:3])

TCGA-DK-A2I6 TCGA-FD-A6TK TCGA-UY-A78L

aDC 0.2524046618 0.58816988 0.22413203692987015980264687

Adipocytes 0.0000000000 0.00000000 0.00198908334143779773686700

Astrocytes 0.0561374012 0.32519978 0.13219687233163704420668694

B-cells 0.0047849676 0.03453208 0.00826267497952571337849204

Basophils 0.0587344782 0.00000000 0.04495982215816759358650856

CD4+ memory T-cells 0.0002222366 0.02099696 0.000000000000000000013944226.Pipeline> tscores2 <- xCellAnalysis(exp,rnaseq = TRUE,scale = TRUE, alpha = 0.5,

parallel.sz = 4, parallel.type = "SOCK")

> head(tscores2[1:6,1:4])

TCGA-DK-A2I6 TCGA-FD-A6TK TCGA-UY-A78L TCGA-4Z-AA84

aDC 0.2524046618 0.58816988 0.22413203692987015980264687 0.13105769278020731882783423

Adipocytes 0.0000000000 0.00000000 0.00198908334143779773686700 0.00063422913516181667753502

Astrocytes 0.0561374012 0.32519978 0.13219687233163704420668694 0.01253547593958248378143150

B-cells 0.0047849676 0.03453208 0.00826267497952571337849204 0.03229280346962691561341074

Basophils 0.0587344782 0.00000000 0.04495982215816759358650856 0.06693861377818721702936955

CD4+ memory T-cells 0.0002222366 0.02099696 0.00000000000000000001394422 0.00000000000000000006264071以上便是数据整理到运行xcell得到免疫细胞富集分数的过程。编辑于 2022-08-31 20:47肿瘤及免疫生信分析R语言赞同 1410 条评论分享喜欢收藏申请

免疫浸润 | xCell 简述与实践 - 知乎

免疫浸润 | xCell 简述与实践 - 知乎首发于免疫浸润切换模式写文章登录/注册免疫浸润 | xCell 简述与实践被炸熟的虾被扫地出门的生信学徒-同名公众号：被炸熟的虾一、xCell 介绍xCell由dviraran团队（也是singleR的开发团队）于2017年开发，该方法可以基于基因表达数据评估多达64种细胞类型的浸润水平，包括多个适应性和先天免疫细胞、造血祖细胞、上皮细胞和细胞外基质细胞。xCell: digitally portraying the tissue cellular heterogeneity landscape. Genome Biol. 2017 Nov 15. PMID: 29141660.主要原理：首先从bulk基因表达数据中提取出64种免疫细胞和基质细胞的表达特征，作为细胞的Signature。使用ssGSEA计算出样本在每个细胞类型Signature上的富集得分。使用拟合公式将各种细胞类型的富集得分转换为相应的细胞类型分数。最后，xCell 对紧密相关的细胞类型的分数进行了补偿校正，降低了不同细胞类型之间可能存在的共线性/相关性的影响。R包保存在github：devtools::install_github('dviraran/xCell')二、xCell测试xCell 使用的表达矩阵要求行为基因、列为样本，基因名为symbol。xCell 使用表达水平排名而不是实际值，因此FPKM/TPM/RSEM定量结果都可以，但是不可以使用count值。代码很简单，运行只需要一步：xCellAnalysis(expr, signatures = NULL, genes = NULL, spill = NULL,

rnaseq = TRUE, file.name = NULL, scale = TRUE, alpha = 0.5,

save.raw = FALSE, parallel.sz = 4, parallel.type = "SOCK",

cell.types.use = NULL)主要参数：需要注意的是，如果是RNA-seq数据，则设置参数rnaseq = T。library(xCell)

## array

xCell <- xCellAnalysis(exprSet,rnaseq=F)

## RNA-seq

xCell_RNAseq <- xCellAnalysis(exprSet,rnaseq = T)TCGA数据操作示例：library(xCell)

load("D:/data/TCGA/TCGAbiolinks-EXP/TCGA-CHOL-GeneSymbol.Rdata")

xCell_RNAseq <- xCellAnalysis(exp_TPM,rnaseq = T)

#[1] "Num. of genes: 10523"

#Setting parallel calculations through a MulticoreParam back-end

#with workers=4 and tasks=100.

#Estimating ssGSEA scores for 489 gene sets.

#[1] "Calculating ranks..."

#[1] "Calculating absolute values from ranks..."这样就得到了每个样本的64种细胞亚型的结果，以及ImmuneScore，StromaScore 和 MicroenvironmentScore。xCell_RNAseq[1:4,1:4]

# TCGA-W5-AA31-11A-11R-A41I-07 TCGA-W5-AA2I-11A-11R-A41I-07 TCGA-W5-AA2T-01A-12R-A41I-07 TCGA-ZH-A8Y5-01A-11R-A41I-07

#aDC 7.516058e-02 1.535893e-02 1.738765e-18 0.000000e+00

#Adipocytes 2.008655e-01 2.499020e-01 0.000000e+00 1.618808e-03

#Astrocytes 1.170599e-17 6.764681e-18 3.437587e-18 8.949610e-18

#B-cells 0.000000e+00 2.432688e-18 3.925314e-03 0.000000e+00

rownames(xCell_RNAseq)

# [1] "aDC" "Adipocytes" "Astrocytes"

# [4] "B-cells" "Basophils" "CD4+ memory T-cells"

# [7] "CD4+ naive T-cells" "CD4+ T-cells" "CD4+ Tcm"

#[10] "CD4+ Tem" "CD8+ naive T-cells" "CD8+ T-cells"

#[13] "CD8+ Tcm" "CD8+ Tem" "cDC"

#[16] "Chondrocytes" "Class-switched memory B-cells" "CLP"

#[19] "CMP" "DC" "Endothelial cells"

#[22] "Eosinophils" "Epithelial cells" "Erythrocytes"

#[25] "Fibroblasts" "GMP" "Hepatocytes"

#[28] "HSC" "iDC" "Keratinocytes"

#[31] "ly Endothelial cells" "Macrophages" "Macrophages M1"

#[34] "Macrophages M2" "Mast cells" "Megakaryocytes"

#[37] "Melanocytes" "Memory B-cells" "MEP"

#[40] "Mesangial cells" "Monocytes" "MPP"

#[43] "MSC" "mv Endothelial cells" "Myocytes"

#[46] "naive B-cells" "Neurons" "Neutrophils"

#[49] "NK cells" "NKT" "Osteoblast"

#[52] "pDC" "Pericytes" "Plasma cells"

#[55] "Platelets" "Preadipocytes" "pro B-cells"

#[58] "Sebocytes" "Skeletal muscle" "Smooth muscle"

#[61] "Tgd cells" "Th1 cells" "Th2 cells"

#[64] "Tregs" "ImmuneScore" "StromaScore"

#[67] "MicroenvironmentScore" 后续就是箱线图，小提琴图，柱形图等的可视化，以及不同分组的比较。其他注意事项：xCell不是反卷积方法，而是富集方法，其返回结果是富集分数而非细胞占比，因此可以在不同样本之间对比同一细胞类型的得分，但是不同细胞类型之间的比较没有意义；xCell 在具有异质性的数据集上有更好的表现，因此不推荐把多个样本拆分多次运行xCell，而是合并所有数据一次运行。三、网页版工具xCell也提供了在线网址工具：http://xcell.ucsf.edu/网站提供了TCGA数据的处理结果：网站上对于输入数据的要求交代的更加清楚：有意思的是xCell以外，网站还提供了其他signature可供选择，但是运行报错：发布于 2023-10-20 14:19・IP 属地江苏生信分析免疫赞同 11 条评论分享喜欢收藏申请转载文章被以下专栏收录免

xCell: digitally portraying the tissue cellular heterogeneity landscape | Genome Biology | Full Text

Explore journals

Get published

About BMC

My account

Search all BMC articles

Genome Biology

Home

About

Articles

Submission Guidelines

Submit manuscript

xCell: digitally portraying the tissue cellular heterogeneity landscape

Download PDF

Method

Open access

Published: 15 November 2017

xCell: digitally portraying the tissue cellular heterogeneity landscape

Dvir Aran1, Zicheng Hu1 & Atul J. Butte

ORCID: orcid.org/0000-0002-7433-27401

Genome Biology

volume 18, Article number: 220 (2017)

Cite this article

87k Accesses

2045 Citations

115 Altmetric

Metrics details

AbstractTissues are complex milieus consisting of numerous cell types. Several recent methods have attempted to enumerate cell subsets from transcriptomes. However, the available methods have used limited sources for training and give only a partial portrayal of the full cellular landscape. Here we present xCell, a novel gene signature-based method, and use it to infer 64 immune and stromal cell types. We harmonized 1822 pure human cell type transcriptomes from various sources and employed a curve fitting approach for linear comparison of cell types and introduced a novel spillover compensation technique for separating them. Using extensive in silico analyses and comparison to cytometry immunophenotyping, we show that xCell outperforms other methods. xCell is available at http://xCell.ucsf.edu/.

BackgroundIn addition to malignant proliferating cells, tumors are also composed of numerous distinct non-cancerous cell types and activation states of those cell types. Together these are termed the tumor microenvironment, which has been in the research spotlight in recent years and is being further explored by novel techniques. The most studied set of non-cancerous cell types are the tumor-infiltrating lymphocytes (TILs). However, TILs are only part of a variety of innate and adaptive immune cells, stromal cells, and many other cell types that are found in the tumor and interact with the malignant cells. This complex and dynamic microenvironment is now recognized to be important both in promoting and inhibiting tumor growth, invasion, and metastasis [1, 2]. Understanding the cellular heterogeneity composing the tumor microenvironment is key for improving existing treatments, the discovery of predictive biomarkers, and development of novel therapeutic strategies.Traditional approaches for dissecting the cellular heterogeneity in liquid tissues are difficult to apply in solid tumors [3]. Therefore, in the past decade, several methods have been published for digitally dissecting the tumor microenvironment using gene expression profiles [4,5,6,7] (reviewed in [8]). Recently, a multitude of studies have been published applying published and novel techniques on publicly available tumor sample resources, such as The Cancer Genome Atlas (TCGA) [6,10,11,12,, 9–13]. Two general types of techniques are used: deconvolving the complete cellular composition and assessing enrichments of individual cell types.At least seven major issues raise concerns that the in silico methods could be prone to errors and cannot reliably portray the cellular heterogeneity of the tumor microenvironment. First, current techniques depend on the expression profiles of purified cell types to identify reference genes and therefore rely heavily on the data source from which the references are inferred and could this be inclined to overfit these data. Second, current methods focus on only a very narrow range of the tumor microenvironment, usually a subset of immune cell types, and thus do not account for the further richness of cell types in the microenvironment, including blood vessels and other different forms of cell subsets [14, 15]. A third problem is the ability of cancer cells to “imitate” other cell types by expressing immune-specific genes, such as a macrophage-like expression pattern in tumors with parainflammation [16]; only a few of the methods take this into account. Fourth, the ability of existing methods to estimate cell abundance has not yet been comprehensively validated in mixed samples. Cytometry is a common method for counting cell types in a mixture and, when performed in combination with gene expression profiling, can allow validation of the estimations. However, in most studies that included cytometry validation, these analyses were performed on only a very limited number of cell types and a limited number of samples [7, 13].A fifth challenge is that deconvolution approaches are prone to many different biases because of the strict dependencies among all cell types that are inferred. This could highly affect reliability when analyzing tumor samples, which are prone to form non-conventional expression profiles. A sixth problem comes with inferring an increasing number of closely related cell types [10]. Finally, deconvolution analysis heavily relies on the structure of the reference matrix, which limits its application to the resource used to develop the matrix. One such deconvolution approach is CIBESORT, the most comprehensive study to date, which allows the enumeration of 22 immune subsets [7]. Newman et al. [7] performed adequate evaluation across data sources and validated the estimations using cytometry immunophenotyping. However, the shortcomings of deconvolution approaches are apparent in CIBERSORT, which is limited to Affymetrix microarray studies.On the other hand, gene set enrichment analysis (GSEA) is a simple technique which can be easily applied across data types and can be quickly applied for cancer studies. In GSEA each gene signature is used independently of all other signatures and it is thus protected from the limitations of deconvolution approaches. However, because of this independence, it is many times hard to differentiate between closely related cell types. In addition, gene signature-based methods only provide enrichment scores and thus do not allow comparison across cell types and cannot provide insights into the abundance of cell types in the mixture.Here, we present xCell, a novel method that integrates the advantages of gene set enrichment with deconvolution approaches. We present a compendium of newly generated gene signatures for 64 cell types, spanning multiple adaptive and innate immunity cells, hematopoietic progenitors, epithelial cells, and extracellular matrix cells derived from thousands of expression profiles. Using in silico mixtures, we transform the enrichment scores to a linear scale, and using a spillover compensation technique we reduce dependencies between closely related cell types. We evaluate these adjusted scores in RNA-seq and microarray data from primary cell type samples from various independent sources. We examine their ability to digitally dissect the tumor microenvironment by in silico analyses, and perform the most comprehensive comparison to date with cytometry immunophenotyping. We compare our inferences with available methods and show that scores from xCell are more reliable for digital dissection of mixed tissues. Finally, we apply our method on TCGA tumor samples to portray a full tumor microenvironment landscape across thousands of samples. We provide these estimations to the community and hope that this resource will allow researchers to gain a better perspective of the complex cellular heterogeneity in tumor tissues.ResultsGenerating a gene signature compendium of cell typesTo generate our compendium of gene signatures for cell types, we collected gene expression profiles from six sources: the FANTOM5 project, from which we annotated 719 samples from 39 cell types analyzed by the Cap Analysis Gene Expression (CAGE) technique [17]; the ENCODE project, from which we annotated 115 samples from 17 cell types analyzed by RNA-seq [18]; the Blueprint project, from which we annotated 144 samples from 28 cell types analyzed by RNA-seq [19]; the IRIS project, from which we annotated 95 samples from 13 cell types analyzed by Affymetrix microarrays [20]; the Novershtern et al. [21] study, from which we annotated 180 samples from 24 cell types analyzed by Affymetrix microarrays; and the Human Primary Cells Atlas (HPCA), a collection of Affymetrix microarrays composed of many different Gene Expression Omnibus (GEO) datasets, from which we annotated 569 samples from 41 cell types [22] (Fig. 1a). Altogether we collected and curated gene expression profiles from 1822 samples of pure cell types, annotated to 64 distinct cell types and cell subsets (Fig. 1b; Additional file 1). Of those, 54 cell types were found in at least two of these data sources. For cell types with five or more samples in a data source, we left one sample out for testing. All together, 97 samples were left out, and all of the model training described below was performed on the remaining 1725 samples.Fig. 1xCell study design. a A summary of the data sources used in the study to generate the gene signatures, showing the number of pure cell types and number of samples curated from them. b Our compendium of 64 human cell type gene signatures grouped into five cell type families. c The xCell pipeline. Using the data sources and based on different thresholds, we derived gene signatures for 64 cell types. Of this collection of 6573 signatures, we chose the 489 most reliable cell types, three for each cell type from each data source where available. The raw score is then the average single-sample GSEA (ssGSEA) score of all signatures corresponding to the cell type. Using simulations of gene expression for each cell type, we derived a function to transform the non-linear association between the scores to a linear scale. Using the simulations we also derive the dependencies between cell type scores and apply a spillover compensation method to adjust the scoresFull size image

Our strategy for selecting reliable cell type gene signatures is shown in Fig. 1c (see Additional file 2: Figure S1 and “Methods” for a full description and technical details). For each data source independently we identified genes that are overexpressed in one cell type compared to all other cell types. We applied different thresholds for choosing sets of genes to represent the cell type gene signatures; hence, from each source, we generated dozens of signatures per cell type. This scheme yielded 6573 gene signatures corresponding to 64 cell types. Importantly, since our primary aim is to develop a tool for studying the cellular heterogeneity in the tumor microenvironment, we applied a methodology we previously developed [16] to filter out genes that tend to be overexpressed in a set of 634 carcinoma cell lines from the Cancer Cell Line Encyclopedia (CCLE) [23].Next, we used single-sample GSEA (ssGSEA) to score each sample based on all signatures. ssGSEA is a well-known method for determining a single, aggregate score of the enrichment of a set of genes in the top of a ranked gene expression profile [24]. To choose the most reliable signatures we tested their performance in identifying the corresponding cell type in each of the data sources. To prevent overfitting, each signature learned from one data source was tested in other sources, but not in the data source from which it was originally inferred. To reduce biases resulting from a small number of genes and from the analysis of different platforms, instead of one signature per cell type, the top three ranked signatures from each data source were chosen. Altogether we generated 489 gene signatures corresponding to 64 cell types spanning multiple adaptive and innate immunity cells, hematopoietic progenitors, epithelial cells, and extracellular matrix cells (Additional file 3). Observing the scores in the 97 test primary cell type samples affirmed their ability to identify the corresponding cell type compared to other cell types across data sources (Additional file 2: Figure S2). We defined the raw enrichment score per cell type to be the average ssGSEA score from all the cell types’ corresponding signatures.Spillover compensation between closely related cell typesOur primary objective is to accurately identify enrichment of cell types in mixtures. To imitate such admixtures, we performed an array of simulations of gene expression combinations for different cell types to assess the accuracy and sensitivity of our gene signatures. We generated such in silico expression profiles using different data sources and different sets of cell types in mixtures and by choosing randomly one sample per cell type from all available samples in the data source. The simulations revealed that our raw scores reliably predict even small changes in the proportions of cell types, distinguish between most cell types, and are reliable in different transcriptomic analysis platforms (Additional file 2: Figure S3). However, the simulations also revealed that raw scores of RNA-seq samples are not linearly associated with the abundance and that they do not allow comparisons across cell types (Additional file 2: Figure S4). Thus, using the training samples we generated synthetic expression profiles by mixing the cell type of interest with other, non-related cell types. We then fit a formula that transforms the raw scores to cell type abundances. We found that the transformed scores showed resemblance to the known fractions of the cell types in simulations, thus enabling comparison of scores across cell types, and not just across samples (Additional file 2: Figure S5).The simulations also revealed another limitation of the raw scores: closely related cell types tend to have correlating scores (Additional file 2: Figure S5). That is, scores may show enrichment for a cell type due to a “spillover effect” between closely related cell types. This problem mimics the spillover problem in flow cytometry, in which fluorescent signals correlate with each other due to spectrum overlaps. Inspired by the compensation method used in flow cytometry studies [25], we leveraged our simulations to generate a spillover matrix that allows correcting for correlations between cell types. To better compensate for low abundances in mixtures, we created a simulated dataset where each sample contains 25% of the cell type of interest with the rest from a non-related cell type and produced a spillover matrix, a representation of the dependencies of scores between different cell types.Applying the spillover correction procedure on the pure cell types (Fig. 2a) and simulated expression profiles (Fig. 2b, c; Additional file 2: Figures S5 and S6) showed that this method was able to successfully reduce associations between closely related cell types. For example, we generated simulated mixtures using an independent data source of multiple cell types that was not used for the development of the method (GSE60424) [26], and used our method to infer the underlying abundances. We observed decent performance in recapitulating the cell type distributions. However, before correcting for spillovers, there were false associations between CD4+ and CD8+ T cells, as well as between monocytes and neutrophils. The spillover correction was able to reduce these associations significantly without harming the correlations on the diagonal (Fig. 2b). In addition, we generated simulated mixtures using the training samples (Additional file 2: Figure S5) and the test samples (Additional file 2: Figure S6). In the 18 simulated mixtures using the test samples, we observed an overall average decrease of 17.1% in significant correlations off the diagonal (Fig. 2c; Additional file 2: Figure S5). Unexpectedly, following the spillover compensation we observed slightly improved associations on the diagonal between the scores and the underlying abundances (1.4% average improvement).Fig. 2Evaluation of the performance of xCell using simulated mixtures. a An overview of adjusted scores for 43 cell types in 259 purified cell type samples from the Blueprint and ENCODE data sources (other data sources are in Additional file 2: Figure S4). Most signatures clearly distinguish the corresponding cell type from all other cell types. b A simulation analysis using GSE60424 as the data source [26], which was not used in the development of xCell. This data source contains 114 RNA-seq samples from six major immune cell types. Left: Pearson correlation coefficients using our method before spillover adjustment and after the adjustment. Dependencies between CD4+ T cells, CD8+ T cells, and NK cells were greatly reduced; spillover from monocytes to neutrophils was also removed. Right: Comparison of the correlation coefficients across the different methods. The first column corresponds to xCell’s predictions of the underlying abundances of the cell types in the simulations (both color and pie chart correspond to average Pearson coefficients). Bindea, Charoentong, Palmer, Rooney, and Tirosh represent sets of signatures for cell types from the corresponding manuscripts. Newman refers to the inferences produced using CIBERSORT on the simulations. xCell outperformed the other methods in 17 of 18 comparisons. c Comparison of the correlation coefficients across the different methods based on 18 simulations generated using the left-out testing samples. Here rows correspond to methods and columns show the average Pearson coefficient for the corresponding cell type across the simulations. Independent simulations are available in Additional file 2: Figure S6. xCell outperformed the other methods in 64 of 67 comparisonsFull size image

Finally, many of the cell types we estimate are not expected to be in a given mixture; however, the pipeline we described will often produce non-zero scores. In the 18 test simulated mixtures, 56.4% of the scores for cell types that are not part of the mixture were non-negligible (> 0.001). To overcome this inadequacy, we introduce a statistical significance test of whether a produced enrichment score is not random—whether the cell type of interest is in the mixture. Using the reference training data sets, for each cell type we generated random mixtures of all cell types except the corresponding cell type, and calculated the cell type-adjusted scores. We then fit a beta distribution for each of the cell types and used these distributions to calculate the probability that the score of the corresponding cell type is present in the mixture by random (Additional file 2: Figure S7). Applying this procedure to the test simulated mixtures enabled detection of about half of the non-expected non-negligible scores as non-significant (46.9% change—from 56.4% non-negligible scores to 28.8% with p value > 0.2), while detecting as non-significant only 15.3% of non-negligible scores for cell types used for generating the mixture (from 88.6% non-negligible scores to 75.1%) (Additional file 4).This pipeline for generating adjusted cell type enrichment scores from gene expression profiles, which we named xCell, is available as an R package and a simple web tool (http://xCell.ucsf.edu/).Validation of enrichment scores in simulated expression profilesWe next compared the ability of xCell scores to infer the underlying cell type enrichments in simulated mixtures with a set of 53 previously published signatures corresponding to 26 cell types [6, 12, 27, 28] (Additional file 5). Our analyses showed that xCell outperformed the previously published signatures in recapitulating the underlying abundances, in mixtures generated using the training samples (Additional file 2: Figure S5) and the test samples (Additional file 2: Figure S6) and an independent data source (GSE60424 [26]) (Fig. 2b), in the vast majority of the comparable cell types (51 of 53 comparisons of mixtures generated using training samples, 46 of 49 using test samples, and 17 of 18 using GSE60424) (Fig. 2c). xCell showed overall better performance with all data sources used, proving its versatility across platforms. Importantly, our compensation technique was able to completely remove associations between cell types, while previously published signatures showed considerate dependencies between closely related cell types, such as between CD8+ T cells and NK cells (Additional file 2: Figure S8).In addition, we also compared xCell’s performance on test mixtures with that of CIBERSORT, a prominent deconvolution-based method [7]. Unlike signature-based methods, which output independent enrichment scores per cell type, the output from deconvolution-based methods is the inferred proportions of the cell types in the mixture. Similar to the performance comparisons using signatures, xCell also outperformed CIBERSORT in all comparable cell types, across all data sources (Fig. 2b, c; Additional file 2: Figures S5 and S6).Validation of enrichment scores with cytometry immunoprofilingIn addition to the simulated mixture analysis, we compared our estimates for cell type enrichments from gene expression profiles with mass spectrometry (CyTOF) immunophenotyping. We utilized independent publicly available studies in which a total of 165 individuals were studied for both gene expression from whole blood and FACS across 18 cell subsets from peripheral blood mononuclear cells (PBMCs; available from ImmPort, studies SDY311 and SDY420) [29]. We calculated xCell scores for each of the signatures using the studies’ expression profiles and correlated the scores with the FACS fractions of the cell subsets. Of the 14 cell types with at least 1% abundance, xCell was able to significantly recover 10 and 12 cell subsets in SDY311 and SDY420, respectively (Pearson correlation between calculated and actual cell counts p value < 0.05; Fig. 3). Comparing the performance of xCell to previously published signatures and CIBERSORT revealed that no other method was able to recover cell types that our method was not able to recover in both data sets (Fig. 3). In general, previous methods were able to recover signal only from major cell types, including B cells, CD4+ and CD8+ T cells, and monocytes, suggesting that their performance was not reliable in more specialized cell subsets. While our method also struggled in these cell subsets, it still showed significant correlations with most of the cell subsets, including effector memory CD8+ T cells, naïve CD4+ T cells, and naïve B cells. In addition, xCell was more reliable in CD4+ T cells and monocytes and equally reliable in B cells (Fig. 3). In CD8+ T cells xCell was outperformed by methods depending solely on CD8A expression, which may not serve as a reliable biomarker in cancer settings (Additional file 2: Figure S9).Fig. 3Comparison of digital dissection methods with flow cytometry counts. Left: Scatter plots of CyTOF fractions in PBMCs vs. cell type scores from whole blood of 61 samples from SDY311 (top) and 104 samples from SDY420 (bottom). Only the top correlating cell types in each study are shown. Right: Correlation coefficients produced by our method compared to other methods. Only cell types with abundance of at least 1% on average, as measured by CyTOF, are shown. Non-significant correlations (p value < 0.05) are marked with a gray “x”Full size image

Despite the generally improved ability of xCell to estimate cell populations, we do note that in some cases the correlations we observed were relatively low, emphasizing the difficulty of estimating cell subsets in mixed samples, and the need for cautious examination and further validation of findings.Cell type enrichment in tumor samplesWe next applied our methodology to 9947 primary tumor samples across 37 cancer types from the TCGA and TARGET projects [30] (Additional file 2: Figure S10). Average scores of cell types in each cancer type affirmed prior knowledge of expected enriched cell types, validating the power of our method for identifying the cell type of origin of different cancer types. For example, epithelial cells were enriched in carcinomas, keratinocytes in squamous cell carcinomas, mesangial cells in kidney cancers, chondrocytes in sarcoma, neurons in brain tumors, hepatocytes in hepatocellular carcinoma, melanocytes in melanomas, B cells in B-cell lymphoma, T cells in thymoma, myeloid cells in acute myeloid leukemia, and lymphocytes in acute lymphocytic leukemia (Fig. 4a). While these results are expected, it is reassuring that xCell can be applied to human cancers.Fig. 4Cell type enrichment analysis in tumors. a Average scores for nine cell types across 24 cancer types from TCGA (The Cancger Genome Atlas). Scores were normalized across rows. Signatures were chosen such that they are the cell of origin of a cancer type or the most significant signature of the cancer type compared to all others. b t-SNE (t-Distributed Stochastic Neighbor Embedding) plot of 8875 primary cancer samples from TCGA (The Cancger Genome Atlas) and TARGET colored by cancer type. The t-SNE plot was generated using the enrichment scores of 48 non-epithelial, non-stem cell, and non-cell type-specific scores. Many of the cancer types create distinct clusters, emphasizing the important role of the tumor microenvironment in characterizing tumorsFull size image

Most of the cell types we infer are part of the complex cellular heterogeneity of the tumor microenvironment. We hypothesized that an additive combination of all cell types’ scores would be negatively correlated with tumor purity. Thus, we generated a microenvironment score as the sum of all immune and stromal cell types. We then correlated this microenvironment score with our previously generated purity estimations, which are based on copy number variations, gene expression, DNA methylation, and H&E slides [31]. Our analysis showed highly significant negative correlations in all cancer types, suggesting this score as a novel measurement for tumor microenvironment abundance (Additional file 2: Figure S11).Finally, to provide insight into the potential of xCell to portray the tumor microenvironment, we plotted all tumor samples based on their cell type scores. Using different sets of cell type inferences, we applied the t-Distributed Stochastic Neighbor Embedding (t-SNE) dimensionality reduction technique [32] (Additional file 2: Figure S12). Interestingly, the analysis revealed that unique microenvironment compositions characterize different cancer indications. For example, prostate cancers form a unique cluster based on their immune cell type composition, while head and neck tumors are distinguished by their stromal composition. Remarkably, only when performing the analysis with all immune and stromal cell types did clear clusters form distinguishing between most of the cancer types (Fig. 4b), demonstrating the unique composition of the tumor microenvironment, which differs between cancer types. This notion emphasizes the importance of portraying the full cellular heterogeneity of the tumor microenvironment for the study of cancer. To this end, we calculated the enrichment scores for 64 cell types across the TCGA spectrum, and provide these data with the hope that they will serve the research community as a resource to further explore novel associations of cell type enrichment in human tumors (Additional file 6).DiscussionRecently, many studies have shown different methodologies for the digital dissection of cancer samples [3,10,11,12,, 6, 9–13]. These studies have provided novel insights into cancer research and related to therapy efficacy. However, it is important to remember that the methods that have been applied for portraying the tumor microenvironment have only attained limited validation, and it is unclear how reliable their estimations are. In this study, we took a step back and focused on generating cell type gene scores that could reliably estimate enrichment of cell types. Our method, which is gene signature-based, is more reliable due to its reliance on a group of signatures for each cell type, learned from multiple data sources, which increases the ability to distinguish the signal from the noise. Our method also integrates a novel approach to remove dependencies between cell types, which allows better reliability when studying closely related cell types.To develop xCell, we collected the most comprehensive resource to date of primary cell types, spanning the largest set of human cell types. We then performed an extensive validation of the predicted cell type inferences in mixed samples. Our method for choosing a set of signatures that are reliable across several data sources has proven to be beneficial, as our scores robustly outperformed all available methods in predicting the abundance of cell types in in silico mixtures and blood samples. Based on our evaluation, xCell provides the most accurate and sensitive way to identify enrichment of many cell types in an admixture, allowing the detection of subtle differences in the enrichment of a particular cell type in the tumor microenvironment with high confidence.It is important to note that xCell, as all other methods, performed significantly better in simulated mixtures than in real mixtures. Several technical reasons account for this discrepancy. First, the cytometry analyses were performed on PBMCs, while the gene expression profiles were generated from whole blood. Second, not all genes required by xCell were present; in fact, in SDY420 only 54.5% of the genes required by xCell were available. However, other explanations for the lower success when inferring abundances in real samples are possible—it may well be possible that the expression patterns of marker genes in mixtures are different to those in purified cells. Recent technologies such as single-cell RNA-sequencing may be able to clarify how much this may perturb the analyses.We chose to apply a gene signature enrichment approach over deconvolution methods because of several advantages that the former provides. First, gene signatures are rank-based and are therefore suitable for cross-platform transcriptome measurements. We showed here that our scores reliably predict enrichment when using different RNA-seq techniques and different microarray platforms. They are agnostic to normalization methods or concerns related to batch effects, making them robust to both technical and biological noise. Second, there is no decline in performance with increasing numbers of cell types. The tumor microenvironment is a rich milieu of cell types, and our analyses show enrichment of many cells derived from mesenchyme in tumors. A partial portrayal of the tumor microenvironment may result in misleading findings. Finally, gene signatures are simple and can easily be adjusted.The main disadvantage of gene signatures is that they do not discriminate between closely related cell types well, though it is not clear how well other methods distinguish between such cell types [10]. Our method takes this into account and uses a novel technique, inspired by flow cytometry analyses, to remove such dependencies between closely related cell types. It is important to note that, until this step, the cell type scores are independent of each other, and a false inference of one cell type will not harm all other cell types. However, the spillover correction adjustment removes this strict independence between cell type inferences, as in deconvolution methods. Yet, the compensation is very limited, and between most cell types there is no compensation at all; thus, most of the inferences are still independent.Despite the utility of our signatures for characterizing the tumor microenvironment, several issues require further investigation. While our signatures outperformed previous methods, it is important to note that our correlations with direct measurements were still far from perfect. More expression data from pure cell types, especially cell types with limited samples, and more expression data coupled with cytometry counts from various tissue types will allow more precise definition of signatures and, in turn, better reliability. Meanwhile, it is necessary to refer to inferences made by our method or other methods with caution. Discoveries made using digital dissection methods must be rigorously validated using other technologies to avoid hasty conclusions.Another limitation of our method is that the inferences are strictly enrichment scores, and cannot be interpreted as proportions. This is due to the inability to translate the minimum and maximum scores produced by ssGSEA to clear proportions. Thus, while our method attempts to calibrate the scores to resemble proportions, these cannot be reliably used as such. This limitation also does not provide statistical significance for the inferences, by calculating an empirical p value as suggested by Newman et al. [7].ConclusionsTissue dissection methods are an emerging tool for large-scale characterization of tumor cellular heterogeneity. These approaches do not rely on tissue dissociation, as opposed to single-cell techniques, and therefore provide an effective tool for dissecting solid tumors. The great availability of public gene expression profiles allows these methods to be efficiently performed on hundreds of historical cohorts spanning thousands of patients, and to associate them with clinical outcomes. Here we present the most comprehensive collection of gene expression enrichment scores for cell types. Our methodology for generating cell type enrichment scores and adjusting them to cell type proportions allowed us to create a powerful tool that is the most reliable and robust tool currently available for identifying cell types across data sources. We provide a simple web tool, xCell (http://xCell.ucsf.edu/), to the community and hope that further studies will utilize it for the discovery of novel predictive and prognostic biomarkers, and new therapeutic targets.MethodsData sourcesSignature data sourcesRNA-seq and cap analysis gene expression (CAGE) normalized FPKM values were downloaded from the FANTOM5 [33], ENCODE [34], and Blueprint data portals [19]. Raw Affymetrix microarray CEL files were downloaded from the Gene Expression Omnibus (GEO), accessions GSE22886 (IRIS) [35], GSE24759 (Novershtern) [36], and GSE49910 (HPCA) [37], and analyzed using the Robust Multi-array Average (RMA) procedure on probe-level data using Matlab functions. The analysis was performed using custom CDF files downloaded from Brainarray [38]. All samples were manually annotated to 64 cell types (Additional file 1).Other expression data sourcesRNA-seq normalized counts were downloaded from the GEO, accession GSE60424 [39]. Illumina HumanHT-12 V4.0 Beadchip data of PBMC samples and the accompanying CyTOF data were downloaded from ImmPort accession SDY311 [40] and quantile normalized using Matlab functions. Similarly, Agilent Whole Human Genome 4 × 44 K slide data of PBMC samples and the accompanying CyTOF data were downloaded from ImmPort accession SDY420 [41] and quantile normalized using Matlab functions. Multiple probes per gene were collapsed using averages. RNA-seq data from the Cancer Cell Line Encyclopedia (CCLE) was obtained using the PharmacoGx R package [42]. RSEM levels for 9947 primary tumor samples from TCGA and TARGET were downloaded from https://toil.xenahubs.net. Published signatures were collected from their corresponding papers [6, 12, 27, 28] (Additional file 5).In silico simulated mixturesWe generated several types of simulated mixtures, but all are based on the same pipeline:

Given a data source of pure cell types, choose n cell types available in the data and choose a random fraction for each cell type (the fractions sum to 1). We denote this vector of fraction f.

Generate an expression matrix of pure cell types, M, with n columns. The generation of the expression matrix varied between the experiments we performed: a) Synthetic mixtures for learning the power coefficient and spillover matrix were generated using the median expression profile of each cell type, creating a homogenous and noiseless mixture. b) Training mixtures were generated by randomly choosing one of the multiple available samples for each of the cell types chosen to be included in the mixture. This random selection introduces significant noise into the mixture, and between mixtures in the mixture set, which reflects the variation we observe in real datasets. c) Test mixtures, where only one sample per cell type was available, were generated by adding a random noise for each gene of up to 20% of the expression level. Cell types included in a mixture were chosen randomly, by avoiding cell types that cannot be distinguished (e.g., CD4+ T cells and CD4+ memory T cells).

To generate a simulated expression profile we use the formula M × f, which returns one simulated mixed gene expression profile based on additive expression of the expression profiles of the cell types. This process is then repeated 500 times with different f and different M (as explained in 2b and 2c, M is recreated for each simulation by adding random noise (in b) or choosing a random sample), generating distinct mixtures using the same set of cell types.

The xCell development pipelineA workflow of the xCell development pipeline can be found in Additional file 2: Figure S1, and is described in detail below.Filtering cancer genesIn a previous study [16] we calculated using CCLE the number of cell lines that over-express each gene (twofold more than the peak of expression distribution). For generating the signatures we only use genes that have an overexpression rate of less than 5% (less than 32 cell lines of the 634 carcinoma cell lines). We use this stringent threshold to eliminate genes that tend to be overexpressed in tumors, regardless of the cellular composition. Of 18,988 genes analyzed, 9506 were identified as not being overexpressed in tumors. For signatures of cell types that may be the cell of origin of solid tumors, including epithelial cells, sebocytes, keratinocytes, hepatocytes, melanocytes, astrocytes, and neurons, we used all genes.Generating gene signaturesExpression profiles were reduced to 10,808 genes that are shared across all six data sources. Gene expression was converted to log scale by adding 3 to restrict inclusion of small changes and followed by log2 conversion. In each group of samples corresponding to a cell type we calculated 10th, 25th, 33.3th, and 50th percentiles of low expression (Q1

), and 90th, 75th, 66.6th, and 50th quantiles of high expression (Q2

1-q

). For cell type A we calculated the difference for each gene between Q1

q(A) and max(Q2

1-q(all other cell types)). We repeated this also for second and third largest Q2

1-q(all other cell types). The signature of cell type A consists of all genes that pass a threshold. We used different thresholds here: 0, 0.1, log2(1.5), log(2), 3, 4, and 5. We repeated this procedure for each of the six data sources independently. Only gene sets of at least eight genes and no more than 200 genes were reserved. This scheme yielded 6573 gene signatures corresponding to 64 cell types. We calculated single-sample gene set enrichment analysis (ssGSEA) for each of those signatures to score each sample in each of the data sources using the GSVA R package [43].Choosing the “best” signatureFor each signature we computed the t-statistic between the scores of the corresponding cell type compared to all other samples, omitting samples from parental or descendant cell types (i. e. CD4+ naïve T cells the general CD4+ T cells were not used in the calculations). The procedure was performed for each data source where the corresponding cell type was available, except the data source from which the signature was learned. Thus, a signature was only chosen if it is reliable in a data source it was not trained upon. If the cell type was available in only one data source, the signature was tested in that data source. From each data source the top three signatures were chosen. All together we chose 489 signatures corresponding to 64 cell types (across the six data sources we have 163 cell types; Additional file 3). The raw score for a cell type is the average of all corresponding signatures, after shifting scores of each signature to have a minimal score of 0 across all samples.Learning parameters for raw score transformationFor each cell type we created a synthetic mixture using the median expression profile of the cell type (cell X) and an additional “control” cell type. For the control in sequencing-based data sources we used multipotent progenitor (MPP) cell samples or endothelial cell samples, because both are found in all the sequencing-based data sets. In microarray-based data sources we used erythrocytes and monocytes. We generated such mixtures using increasing levels of the corresponding cell type (0.8% of cell X and 99.2% control, 1.6% cell X and 98.4% control, etc.). We noticed two problems with the raw scores: ssGSEA scores have different distributions between different signatures and a score from signatures cannot thus be compared with a score from another signature. In addition, in sequencing-based data, the association between the underlying levels of the cell type was not linearly associated with the score. We thus designed a transformation pipeline for the scores (which is applied to both sequencing and microarray-based datasets separately)—for each cell type, using the synthetic mixtures, we first shifted the scores to 0 using the minimal score (which corresponded to mixtures containing 0.8% of the cell type) and divided by 5000. We then fit a power function to the scores corresponding to abundances of 0.8 to 25.6%. We used this range because we are mostly interested in identifying cell types with low abundance, and above that the function exponential increase may interfere with precise fitting. The power coefficient was then averaged across the data sources were the cell type is available (we denote this vector as P). After adjusting the score using the learned power coefficient, we fit a linear curve, and used the learned slope as a calibration parameter for the adjusted scores (denoted as V1).Learning the spillover compensation reference matrixAnother limitation that was observed in the mixtures is the dependencies between closely related cell types: scores that predict enrichment of one cell type also predict enrichment of another cell type, which might not even be in the mixture. To overcome this problem we created a reference matrix of “spillovers” between cell types. Below we focus on the generation of the sequencing-based spillover matrix but an equivalent process was performed to generate the microarray-based spillover matrix. We first generated a synthetic mixture set, where each mixture contains 25% of each of the cell types (median expression) and 75% of a “control” cell type, as in the previous section. We then calculated raw cell type scores and transformed them using the learned coefficients as explained above. We combined all sequencing-based data sources together by using the average scores, and completed the matrix to be 64 × 64 by adding columns from cell types that are not present in any of the sequencing-based data using the microarray reference matrix. We then normalized each row of cell type scores by dividing it by the diagonal (denoted as K; in the spillover matrix rows are cell type scores and columns are cell type samples). The diagonal, before the normalization, is also used for calibration (denoted as V2). The “spillover” between a cell type score (x) and another cell type (y) is the ratio between x and y. Finally, we cleaned the spillover matrix to not compensate between parent and descendent cell types by compensating parent cell types only with other parent cell types (CD4+ T cells are compensated against CD8+ T cells, but not CD8+ Tem), and compensating child cell types only compared to other child cell types from the same parent and all other parents, but not child cell types from other parents. Some of the compensations were too strong, removing correlations between cell types and their corresponding signatures; thus, we limited the compensation levels, off the diagonal, to 0.5. The spillover matrix, power, and calibration coefficients are available in Additional file 7.Calculating scores for a mixtureThe input comprises a gene expression data set normalized to gene length (such as FPKM or TPM), where rows are genes and columns are samples (N is the number of samples). Duplicate gene names are combined together. xCell uses a set of 10,808 genes for the scoring. It is recommended to use data sets that contain at least the majority of these genes. Missing values in a sample are treated as missing genes (the xCell web tool requires intersection of at least 5000 genes). It is also recommended to use as many samples as possible, with highly expected variation in cell type fractions. (1) ssGSEA scores are calculated for each of the 489 gene signatures. (2) Scores of all signatures corresponding to a cell type are averaged. The result is a matrix (A) with 64 rows and N columns. (3) Each element in the scores matrix (A

) is transformed using the following formula:$$ {T}_{ij}=\raisebox{1ex}{$\left({A}_{ij}-\min \left({A}_i\right)\right)/5000\Big){}^{P_i}$}\!\left/ \!\raisebox{-1ex}{$\left(V{1}_i\cdot V{2}_i\right)$}\right. $$

The output is matrix T of transformed scores. Different P, V1, and V2 are used for sequencing-based and microarray-based datasets. (4) Spillover compensation is then performed for each row using linear least squares that minimizes the following (as performed in flow cytometry analyses and explained in Bagwell and Adams [25]):$$ \left\Vert K\cdot x-{T}_i\right\Vert, \mathrm{such}\kern0.17em \mathrm{that}\;x\ge 0 $$

All x values are then combined to create the final xCell scores. The compensation may result in deteriorating real associations; thus, we provide a scaling parameter (alpha) to multiply all off-diagonal cells in matrix K. In all experiments in this study we used alpha = 0.5. Different K matrices are used for sequencing-based and array-based data.Significance assessmentWe provide a statistical significance assessment for the presence of a cell type in the mixture by learning scores distributions for cell types in random mixtures. For each cell type X, we generate a random matrix as follows: In each reference data set we find all cell types corresponding to samples, except X and its parent or descendants (if X is CD8+ Tem cells, then we also exclude CD8+ T cells; if X is CD8+ T cells, we exclude all CD8+ cell types). We then use the same procedure we used for generating training samples, but adding an additional 5% random noise. The main difference here is that we randomly mix in all cell types (except X) and not just a small subset. We then run the xCell pipeline for these random mixtures. In most cell types the produced scores show similarity to a beta distribution; thus, using the fitdistr function from the MASS package, we fit such a distribution for each of the mixtures we generated (e.g., for a mixture excluding cell type X we fit a beta distribution for cell type X). In five of the cell types the scores from the random mixtures consistently produced 0; thus, we define those distributions as constant 0.001 (Additional file 2: Figure S7). Given an input data set, we can now calculate a p value for each xCell score with the null hypothesis that the cell type is not present in the mixture. The actual distributions we use to calculate the p values are combinations of those learned from FANTOM5, Blueprint, and ENCODE for sequencing-based input, and IRIS, HPCA, and Novershtern for microarray-based input. The p value for a score of a cell type in a sample is the chance of the region in the distribution of the corresponding cell type to exceed the score. In the testing samples we used a threshold of 20% to define a non-significant score. We used this threshold to have a trade-off between detecting the non-negligible scores of cell types not in the mixture and not detecting scores of cell type in the mixture, thus affecting the power of estimating the underlying cell type fractions (Additional file 4).Cytometry analysesGene expression and cytometry data were downloaded from ImmPort (SDY311 [40] and SDY420 [41]). The gene expression data were quantile normalized using Matlab functions, and multiple probes per gene were collapsed using averages. The cytometry data counts were divided by the viable/singlet counts. In the SDY311 dataset, ten patients had two replicates of expression profiles, and those were averaged. Two outlier samples in the cytometry data set were removed from further analyses (SUB134240, SUB134283).Other toolsThe CIBERSORT web tool was used for inferring proportions using the expression profile (https://cibersort.stanford.edu). CIBERSORT results for activated and resting cell types were combined; B cell and CD4+ T cell percentages are the combination of all their subtypes. t-SNE plots were produced using the Rtsne R package. Purity measurements were obtained from our previous publication [31]. Correlation plots were generated using the corrplot R package.

ReferencesGalon J, Costes A, Sanchez-Cabo F, Kirilovsky A, Mlecnik B, Lagorce-Pagès C, et al. Type, density, and location of immune cells within human colorectal tumors predict clinical outcome. Science. 2006;313:1960–4.Article

CAS

PubMed

Google Scholar

Hanahan D, Coussens LM. Accessories to the crime: functions of cells recruited to the tumor microenvironment. Cancer Cell. 2012;21:309–22.Article

CAS

PubMed

Google Scholar

Gentles AJ, Newman AM, Liu CL, Bratman SV, Feng W, Kim D, et al. The prognostic landscape of genes and infiltrating immune cells across human cancers. Nat Med. 2015;21:938–45.Article

CAS

PubMed

PubMed Central

Google Scholar

Abbas AR, Wolslegel K, Seshasayee D, Modrusan Z, Clark HF. Deconvolution of blood microarray data identifies cellular activation patterns in systemic lupus erythematosus. PLoS One. 2009;4:e6098.Shen-Orr SS, Gaujoux R. Computational deconvolution: extracting cell type-specific information from heterogeneous samples. Curr Opin Immunol. 2013;25:571–8.Article

CAS

PubMed

Google Scholar

Rooney MS, Shukla SA, Wu CJ, Getz G, Hacohen N. Molecular and genetic properties of tumors associated with local immune cytolytic activity. Cell. 2015;160:48–61.Article

CAS

PubMed

PubMed Central

Google Scholar

Newman AM, Liu CL, Green MR, Gentles AJ, Feng W, Xu Y, et al. Robust enumeration of cell subsets from tissue expression profiles. Nat Methods. 2015;12:453–7.Article

CAS

PubMed

PubMed Central

Google Scholar

Newman AM, Alizadeh AA. High-throughput genomic profiling of tumor-infiltrating leukocytes. Curr Opin Immunol. 2016;41:77–84.Article

CAS

PubMed

Google Scholar

Angelova M, Charoentong P, Hackl H, Fischer ML, Snajder R, Krogsdam AM, et al. Characterization of the immunophenotypes and antigenomes of colorectal cancers reveals distinct tumor escape mechanisms and novel targets for immunotherapy. Genome Biol. 2015;16:64.Article

PubMed

PubMed Central

Google Scholar

Li B, Severson E, Pignon J-C, Zhao H, Li T, Novak J, et al. Comprehensive analyses of tumor immunity: implications for cancer immunotherapy. Genome Biol. 2016;17:14.Article

Google Scholar

Iglesia MD, Parker JS, Hoadley KA, Serody JS, Perou CM, Vincent BG. Genomic Analysis of immune cell infiltrates across 11 tumor types. J Natl Cancer Inst. 2016;108:djw144.Charoentong P, Finotello F, Angelova M, Mayer C, Efremova M, Rieder D, et al. Pan-cancer immunogenomic analyses reveal genotype-immunophenotype relationships and predictors of response to checkpoint blockade. Cell Rep. 2017;18:248–62.Article

CAS

PubMed

Google Scholar

Şenbabaoğlu Y, Gejman RS, Winer AG, Liu M, Van Allen EM, de Velasco G, et al. Tumor immune microenvironment characterization in clear cell renal cell carcinoma identifies prognostic and immunotherapeutically relevant messenger RNA signatures. Genome Biol. 2016;17:231.Article

PubMed

PubMed Central

Google Scholar

Pattabiraman DR, Weinberg RA. Tackling the cancer stem cells--what challenges do they pose? Nat Rev Drug Discov. 2014;13:497–512.Article

CAS

PubMed

PubMed Central

Google Scholar

Turley SJ, Cremasco V, Astarita JL. Immunological hallmarks of stromal cells in the tumour microenvironment. Nat Rev Immunol. 2015;15:669–82.Article

CAS

PubMed

Google Scholar

Aran D, Lasry A, Zinger A, Biton M, Pikarsky E, Hellman A, et al. Widespread parainflammation in human cancer. Genome Biol BioMed Central. 2016;17:145.Article

Google Scholar

Lizio M, Harshbarger J, Shimoji H, Severin J, Kasukawa T, Sahin S, et al. Gateways to the FANTOM5 promoter level mammalian expression atlas. Genome Biol. 2015;16:22.Article

CAS

PubMed

PubMed Central

Google Scholar

Consortium EP, Bernstein BE, Birney E, Dunham I, Green ED, Gunter C, et al. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74.Article

Google Scholar

Blueprint Epigenome Project. 2015. http://www.blueprint-epigenome.eu/. Accessed 3 May 2016.Abbas AR, Baldwin D, Ma Y, Ouyang W, Gurney A, Martin F, et al. Immune response in silico (IRIS): immune-specific genes identified from a compendium of microarray expression data. Genes Immune. 2005;6:319–31.Article

CAS

Google Scholar

Novershtern N, Subramanian A, Lawton LN, Mak RH, Haining WN, McConkey ME, et al. Densely interconnected transcriptional circuits control cell states in human hematopoiesis. Cell. 2011;144:296–309.Article

CAS

PubMed

PubMed Central

Google Scholar

Mabbott NA, Baillie JK, Brown H, Freeman TC, Hume DA. An expression atlas of human primary cells: inference of gene function from coexpression networks. BMC Genomics. 2013;14:632.Article

CAS

PubMed

PubMed Central

Google Scholar

Barretina J, Caponigro G, Stransky N, Venkatesan K, Margolin AA, Kim S, et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature. 2012;483:603–7.Article

CAS

PubMed

PubMed Central

Google Scholar

Barbie DA, Tamayo P, Boehm JS, Kim SY, Moody SE, Dunn IF, et al. Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1. Nature. 2009;462:108–12.Article

CAS

PubMed

PubMed Central

Google Scholar

Bagwell CB, Adams EG. Fluorescence spectral overlap compensation for any number of flow cytometry parameters. Ann N Y Acad Sci. 1993;677:167–84.Article

CAS

PubMed

Google Scholar

Linsley PS, Speake C, Whalen E, Chaussabel D. Copy number loss of the interferon gene cluster in melanomas is linked to reduced T cell infiltrate and poor patient prognosis. PLoS One. 2014;9:e109760.Bindea G, Mlecnik B, Tosolini M, Kirilovsky A, Waldner M, Obenauf AC, et al. Spatiotemporal dynamics of intratumoral immune cells reveal the immune landscape in human cancer. Immunity. 2013;39:782–95.Article

CAS

PubMed

Google Scholar

Tirosh I, Izar B, Prakadan SM, Wadsworth MH, Treacy D, Trombetta JJ, et al. Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq. Science. 2016;352:189–96.Article

CAS

PubMed

PubMed Central

Google Scholar

Bhattacharya S, Andorf S, Gomes L, Dunn P, Schaefer H, Pontius J, et al. ImmPort: Disseminating data to the public for the future of immunology. Immunol Res. 2014;58:234–9.Article

CAS

PubMed

Google Scholar

Vivian J, Rao AA, Nothaft FA, Ketchum C, Armstrong J, Novak A, et al. Toil enables reproducible, open source, big biomedical data analyses. Nat Biotechnol. 2017;35:314–6.Article

CAS

PubMed

PubMed Central

Google Scholar

Aran D, Sirota M, Butte AJ. Systematic pan-cancer analysis of tumour purity. Nat Commun. 2015;6:8971.Article

CAS

PubMed

PubMed Central

Google Scholar

van der Maaten L, Hinton GE. Visualizing high-dimensional data using t-SNE. J Mach Learn Res. 2008;9:2579–605.

Google Scholar

FANTOM5 project. http://fantom.gsc.riken.jp/5/. Accessed 2 May 2016.ENCODE: Encyclopedia of DNA Elements. https://www.encodeproject.org/. Accessed 5 May 2016.Abbas AR et al. Expression profiles from a variety of resting and activated human immune cells. 2010. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE22886. Accessed 7 Nov 2014.Novershtern N et al. Densely interconnected transcriptional circuits control cell states in human hematopoiesis. 2011. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE24759. Accessed 11 Nov 2014.Mabbott NA et al. An Expression Atlas of Human Primary Cells: Inference of Gene Function from Coexpression Networks. 2013. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE49910. Accessed 8 July 2016.Dai M, Wang P, Boyd AD, Kostov G, Athey B, Jones EG, et al. Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data. Nucleic Acids Res. 2005;33:e175.Speake C et al. Next generation sequencing of human immune cell subsets across diseases. 2015. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE60424. Accessed 5 Jan 2017.Immport. 2010. http://www.immport.org/immport-open/public/study/study/displayStudyDetail/SDY311. Accessed 17 July 2016.Immport. 2010. http://www.immport.org/immport-open/public/study/study/displayStudyDetail/SDY420.Smirnov P, Safikhani Z, El-Hachem N, Wang D, She A, Olsen C, et al. PharmacoGx: an R package for analysis of large pharmacogenomic datasets. Bioinformatics. 2016;32:1244–6.Article

CAS

PubMed

Google Scholar

Hänzelmann S, Castelo R, Guinney J. GSVA: gene set variation analysis for microarray and RNA-seq data. BMC Bioinformatics. 2013;14:7.Article

PubMed

PubMed Central

Google Scholar

Aran D. xCell R package and development scripts. 2017. http://doi.org/10.5281/zenodo.1004662.

Google Scholar

Download referencesAcknowledgmentsWe thank Marina Sirota and Thomas Peterson for helpful discussions. We thank the anonymous reviewers for comments on an initial draft of this manuscript, which resulted in an improved publication.

Funding

This work was supported by the Gruss Lipper Postdoctoral Fellowship to D.A., and the National Cancer Institute (U24 CA195858) and the National Institute of Allergy and Infectious Diseases (Bioinformatics Support Contract HHSN272201200028C) to A.J.B. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Availability of data and materials

The xCell R package for generating the cell type scores and R scripts for the development of xCell are available at https://github.com/dviraran/xCell (under the GNU 3.0 license) and deposited to Zenodo (assigned DOI http://doi.org/10.5281/zenodo.1004662) [44].

Author informationAuthors and AffiliationsInstitute for Computational Health Sciences, University of California, San Francisco, California, 94158, USADvir Aran, Zicheng Hu & Atul J. ButteAuthorsDvir AranView author publicationsYou can also search for this author in

PubMed Google ScholarZicheng HuView author publicationsYou can also search for this author in

PubMed Google ScholarAtul J. ButteView author publicationsYou can also search for this author in

PubMed Google ScholarContributionsDA conceived and led the development of the algorithm, conducted all the analyses, prepared the software, and wrote the manuscript. ZH contributed to the design of the algorithm. AJB supervised the project. All authors read and approved the final manuscript.Corresponding authorsCorrespondence to

Dvir Aran or Atul J. Butte.Ethics declarations

Ethics approval and consent to participate

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Additional filesAdditional file 1:Summary table of primary cell types used in this study. (XLSX 51 kb)Additional file 2: Figures S1–S12.(PDF 7074 kb)Additional file 3:The 489 cell type gene signatures. (XLSX 417 kb)Additional file 4:Summary table of the statistical significance analysis in the testing mixtures. (XLSX 50 kb)Additional file 5:The 53 previously published cell type gene signatures. (XLSX 58 kb)Additional file 6:xCell scores of 9947 samples from TCGA and TARGET. (TSV 6996 kb)Additional file 7:The spillover matrix and calibrating coefficients. (XLSX 110 kb)Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissionsAbout this articleCite this articleAran, D., Hu, Z. & Butte, A.J. xCell: digitally portraying the tissue cellular heterogeneity landscape.

Genome Biol 18, 220 (2017). https://doi.org/10.1186/s13059-017-1349-1Download citationReceived: 06 March 2017Accepted: 10 October 2017Published: 15 November 2017DOI: https://doi.org/10.1186/s13059-017-1349-1Share this articleAnyone you share the following link with will be able to read this content:Get shareable linkSorry, a shareable link is not currently available for this article.Copy to clipboard

Provided by the Springer Nature SharedIt content-sharing initiative

KeywordsCap Analysis Gene Expression (CAGE)Spillover MatrixCell Type EnrichmentCancer Cell Line Encyclopedia (CCLE)Pure Cell Types

Download PDF

Associated content

Collection

Cancer Genomics

Genome Biology

ISSN: 1474-760X

Submission enquiries: editorial@genomebiology.com

General enquiries: info@biomedcentral.com

xCell：基因表达数据的免疫浸润分析专家！ - 知乎

xCell：基因表达数据的免疫浸润分析专家！ - 知乎切换模式写文章登录/注册xCell：基因表达数据的免疫浸润分析专家！生信果hi大家，好久不见，小果又来啦！xCell是一种基于基因表达数据的免疫细胞浸润评估工具，可以识别出潜在的免疫细胞亚群并计算它们在组织中的相对丰度。它可以广泛应用于癌症、自身免疫性疾病和其他疾病的免疫浸润分析中。今天小果就来教大家如何实战应用xCell进行基因表达数据的免疫浸润分析，那就和小果一起来看下吧！数据预处理下载xCell包并导入：install.packages('devtools')devtools::install_github('dviraran/xCell')library(xCell)xCell免疫浸润分析导入基因表达矩阵，并通过xCellAnalysis分析表达矩阵，该函数用于预测细胞组分在给定样本中的相对丰度：exprSet=read.csv("GSE57065_ExprMatrix.csv",header=T,row.names=1)xCell <- xCellAnalysis(exprSet) ###array datawrite.csv(xCell,"xCell_score.csv") #将计算后的结果写入表格文件中先来一起看下分析后的数据集吧！处理xCell分析后的结果数据集接下来，我们可以根据样本的分组情况，提取出对应的细胞类型得分。首先我们要读取 group.txt 的文件，这里包含了样本的分组信息，温馨提示，细胞表达样本数据包含两种类型哦！我们可以读入这个文件，并提取出“septic_shock” 和 “healthy” 两组样本的得分：group=read.delim("group.txt",header=T,stringsAsFactors = FALSE,sep="\t") score=read.csv("xCell_score.csv",header=T,row.names=1,stringsAsFactors = FALSE)#取出结果文件中的前64个样本，并分别取出“septic_shock”，即带有疾病的样本群和“healthy”，即健康的样本群到对应的数据集中 case_data=score[1:64,group$group=="septic_shock"] control_data=score[1:64,group$group=="healthy"] all(row.names(case_data)==row.names(control_data)) normal=ncol(control_data) tumor=ncol(case_data) normal_data=as.data.frame(t(control_data)) tumor_data=as.data.frame(t(case_data)) #合并两种类型的结果文件 rt_total = rbind(normal_data,tumor_data) cell_group_file=read.delim("cell_type.txt",header=T,stringsAsFactors = FALSE)cell_group=unique(cell_group_file$Subgroup)xCell对分析结果下游分析同样地，我们也可以对xCell分析的结果进行下游分析并通过小提琴图对结果进行直观的分析，和小果一起来看一下怎样操作吧！现在我们提取出了xCellAnalysis分析后的样本数据集并做了封装处理，现在我们可以针对我们的细胞群数据进行下游分析以及可视化处理的操作。主要的流程为：根据不同的细胞子群对样本进行分类，遍历每个细胞子群，并计算每个子群的差异显著性，并对显著差异的细胞子群进行绘图展示和保存结果。那么就让我们一起来看下代码实现吧！ for (subgroup in cell_group){ s=which( cell_group_file$Subgroup==subgroup) #获取当前细胞所属子群的index cells=cell_group_file$Cell.types[s] #提取属于当前子群的所有细胞 rt=rt_total[,cells] #提取属于当前子群的所有细胞类型的数据，并保存在 rt 中 cell=c() p.value=c() for(i in 1:ncol(rt)){ normalData=rt[1:normal,i] tumorData=rt[(normal+1):(normal+tumor),i] wilcoxtest=wilcox.test(normalData,tumorData,exact = F) #计算其差异的 p-value，并保存在p.value中，细胞类型名称保存在cell中 p=round(wilcoxtest$p.value,3) p.value=c(p.value,p) #合并 cell=c(cell,colnames(rt)[i]) } sig=data.frame(cell,p.value) sig=sig[sig$p.value<0.05,] #选出p值小于0.05的显著细胞类型 s=which(colnames(rt)%in%as.character(sig$cell)) rt=rt[,s] all(colnames(rt)==as.character(sig$cell))通过小提琴图可视化下游分析的结果现在，我们可以将最终的计算结果通过小提琴图展示出来，让我们来看看具体的代码实现吧！library(vioplot)pdf(paste(subgroup,"Xcell_score.pdf",sep="_"),height=8,width=15) #设置可视化和保存的文件名称par(las=1,mar=c(10,6,3,3))x=c(1:ncol(rt))y=c(1:ncol(rt))#绘制空白散点图plot(x,y, xlim=c(0,(ncol(rt)-1)*3+2),ylim=c(min(rt),max(rt)+0.02), main="GSE57065",xlab="", ylab="Xcell score", pch=21,col="white",xaxt="n",cex.lab=1.3,cex.main=1.5) for(i in 1:ncol(rt)){ normalData=rt[1:normal,i] tumorData=rt[(normal+1):(normal+tumor),i] wilcoxtest=wilcox.test(normalData,tumorData,exact = F) p=round(wilcoxtest$p.value,3) vioplot(normalData,at=3*(i-1),lty=1,add = T,col = '#42B540FF') #绘制正常数据的小提琴图 vioplot(tumorData,at=3*(i-1)+1,lty=1,add = T,col = '#925E9FFF') #绘制肿瘤数据的小提琴图 mx=max(c(normalData,tumorData)) lines(c(x=3*(i-1)+0.2,x=3*(i-1)+0.8),c(mx,mx)) text(x=3*(i-1)+1,y=mx+0.02,labels=ifelse(p<0.001,paste0("p<0.001"),paste0("p=",p)),cex = 0.8) } text(seq(1,((ncol(rt)-1)*3+1),3),-0.01,xpd = NA,labels=colnames(rt),cex = 1,srt = 45,pos=2,font=2) dev.off() write.csv(sig,paste(subgroup,"_sig_cell.csv",sep="")) #保存最终结果 } 现在，你已经成功完成了xCell分析细胞到可视化的左右工作，现在小果给大家展示其中一个细胞子群的下游分析可视化结果！一起来看吧！怎么样，你学会怎么使用xCell包了嘛? 更多学习资源请大家移步小果专属云生信平台搜索更多资源哦！小果专属云生信平台：云生信 - 学生物信息学 (biocloudservice.com)云生信平台也有免疫专版的学习模块哦，快来找到你想学习的专属模块吧！发布于 2023-05-18 12:06・IP 属地上海基因表达免疫浸润生信分析赞同2 条评论分享喜欢收藏申请

GitHub - dviraran/xCell: Cell types enrichment analysis

Toggle navigation

Product

Actions

Automate any workflow

Packages

Host and manage packages

Security

Find and fix vulnerabilities

Codespaces

Instant dev environments

Copilot

Write better code with AI

Code review

Manage code changes

Issues

Plan and track work

Discussions

Collaborate outside of code

Explore

All features

Documentation

GitHub Skills

Blog

Solutions

For

Enterprise

Teams

Startups

Education

By Solution

CI/CD & Automation

DevOps

DevSecOps

Resources

Learning Pathways

White papers, Ebooks, Webinars

Customer Stories

Partners

Open Source

GitHub Sponsors

Fund open source developers

The ReadME Project

GitHub community articles

Repositories

Topics

Trending

Collections

Pricing

Search or jump to...

Search code, repositories, users, issues, pull requests...

Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

You signed in with another tab or window. Reload to refresh your session.

You signed out in another tab or window. Reload to refresh your session.

You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

dviraran

xCell

Public

Notifications

Fork

Star

162

Cell types enrichment analysis

162

stars

forks

Branches

免疫浸润–xCell使用简介 – 王进的个人网站

王进的个人网站

NO PAINS, NO GAINS.

首页

实验方法

分子生物学

CRISPR-Cas9

动物实验

细胞生物学

其他

常用软件

科研软件

图片处理

Image J

统计软件

Graphpad

SPSS

办公软件

小工具

其他

生信分析

ggplot2

R语言

生信资料

Linux系统

其他

新药研发

精彩生活

我的爱情

我爱罗

宝贝yiyi

科研互助

科研互助群

B站代码获取

我的简历

给我留言

免疫浸润–xCell使用简介

Home202110月5免疫浸润–xCell使用简介

Posted on 2021-10-052021-10-05

xCell is a recently published method based on ssGSEA that estimates the abundance scores of 64 immune cell types, including adaptive and innate immune cells, hematopoietic progenitors, epithelial cells, and extracellular matrix cells

xcell 是基于ssGSEA（single-sample GSEA） ssGSEA顾名思义是一种特殊的GSEA，它主要针对单样本无法做GSEA而提出的一种实现方法，原理上与GSEA是类似的，不同的是GSEA需要准备表达谱文件即gct，根据表达谱文件计算每个基因的rank值参考网址https://shengxin.ren/article/403和https://support.bioconductor.org/p/98463/

关于Xcell找对网址很重要，我一开始找错了地方

https://github.com/dviraran/xCell 首先看read.me 很开心是我要的东西

安装这个之前经常报错，要安装很多别的辅助包

install.packages('Rcpp')#########安装各类程序包

devtools::install_github('dviraran/xCell')

安装的时候还会有错误。

安装好的这一刻，还是很开心的。

使用方法

第一步计算xCell

library(xCell)

exprMatrix = read.table(file = '/Users/chenyuqiao/Desktop/TCGA-LUAD.htseq_counts.tsv',header=TRUE,row.names=1, as.is=TRUE)

xCellAnalysis(exprMatrix)

library(xCell)

exprMatrix = read.table(file = '/Users/chenyuqiao/Desktop/TCGALUAD.htseq_counts.tsv',header=TRUE,row.names=1, as.is=TRUE)

###exprMatrix<- exprMatrix[1:10,1:10]

Ensemble_ID<- rownames(exprMatrix)

ID<- strsplit(Ensemble_ID, "[.]")

str(ID)

IDlast<- sapply(ID, "[", 1)

exprMatrix$Ensemble_ID<- IDlast

row.names(exprMatrix)<- exprMatrix$Ensemble_ID

save(exprMatrix, file = 'TCGA.Rdata')

load(file = 'TCGA.Rdata')

####library(clusterProfiler)

library(org.Hs.eg.db)

ls("package:org.Hs.eg.db")

g2s=toTable(org.Hs.egSYMBOL);head(g2s)

g2e=toTable(org.Hs.egENSEMBL);head(g2e)

tmp=merge(g2e,g2s,by='gene_id')

head(tmp)

colnames(exprMatrix)[ncol(exprMatrix)] <- c("ensembl_id")###################重命名Ensemble_ID 便于后面merge

exprMatrix[1:4,1:4]

exprMatrix<- merge(tmp,exprMatrix,by='ensembl_id')

exprMatrix[1:4,1:4]

exprMatrix<- exprMatrix[,- c(1,2)]

exprMatrix=exprMatrix[!duplicated(exprMatrix$symbol),]

row.names(exprMatrix)<- exprMatrix[,1]

exprMatrix<- exprMatrix[,-1]

exprMatrix[1:4,1:4]

xCellAnalysis(exprMatrix)####################一句话就分析完成了

##save(results,file = 'Xcell_result.Rdata')#############需要重新修改

第二步：批量生存分析

load(file = 'Xcell_result.Rdata')

result<- as.data.frame(result)

library(dplyr)

library(tidyverse)

TCGA.LUAD.GDC_phenotype <- read.delim("TCGA-LUAD.GDC_phenotype.tsv")

#colnames(TCGA.LUAD.GDC_phenotype)

#head(TCGA.LUAD.GDC_phenotype)

LUAD_Pheno<- select(TCGA.LUAD.GDC_phenotype, "submitter_id.samples", "vital_status.diagnoses", "days_to_death.diagnoses", "days_to_last_follow_up.diagnoses", "pathologic_N", "pathologic_M", "days_to_new_tumor_event_after_initial_treatment")

LUAD_Pheno<- LUAD_Pheno[grep('01A',LUAD_Pheno$submitter_id.samples),] #####只筛选01A的 01A代表肿瘤

LUAD_Pheno[is.na(LUAD_Pheno)]<- 0

LUAD_Pheno$PFS_status<- ifelse((LUAD_Pheno$days_to_new_tumor_event_after_initial_treatment == 0 & LUAD_Pheno$days_to_death.diagnoses == 0), 0,1)

##################################

LUAD_Pheno$OS<- ifelse(LUAD_Pheno$days_to_last_follow_up.diagnoses > LUAD_Pheno$days_to_death.diagnoses, LUAD_Pheno$days_to_last_follow_up.diagnoses,LUAD_Pheno$days_to_death.diagnoses)

LUAD_Pheno$PFS<- ifelse(LUAD_Pheno$days_to_new_tumor_event_after_initial_treatment == 0, LUAD_Pheno$OS ,LUAD_Pheno$days_to_new_tumor_event_after_initial_treatment)

LUAD_Pheno$OS_status<- as.factor(LUAD_Pheno$vital_status.diagnoses)

#############################设计好分组

#############################生存曲线

firstdata<- result ###############expre

firstdata$ID<- rownames(firstdata)

gene<- row.names(firstdata)

#######select only gene to analysis

library(survminer)

library(survival)

library(ggplot2)

library(dplyr)

for (x in gene) {

RNA_seq_data<-filter(firstdata, firstdata$ID == x)

RNA_seq_data<- t(RNA_seq_data)

RNA_seq_data<- as.data.frame(RNA_seq_data)

# str(RNA_seq_data)

# colnames(LUAD_Pheno)

RNA_seq_data$submitter_id.samples<- row.names(RNA_seq_data)

colnames(RNA_seq_data)<- c("Expressionvalue","submitter_id.samples")

LUAD_Pheno$submitter_id.samples<- as.character(LUAD_Pheno$submitter_id.samples)

LUAD_Pheno$submitter_id.samples<- sub('-', '.', LUAD_Pheno$submitter_id.samples)#############- replaced by .

finaldata<- inner_join(LUAD_Pheno,RNA_seq_data, by = "submitter_id.samples")

finaldata$PFS_status<- as.character(finaldata$PFS_status)

finaldata$PFS_status<- as.numeric(as.factor(finaldata$PFS_status))

finaldata$Expressionvalue<- as.numeric(as.character(finaldata$Expressionvalue))

finaldata$group<- ifelse(finaldata$Expressionvalue>median(finaldata$Expressionvalue),'high','low')

library(survminer)

library(survival)

fit <- survfit(Surv(finaldata$PFS,finaldata$PFS_status)~finaldata$group, data=finaldata)

summary(fit)

pp<- ggsurvplot(fit, data = finaldata, conf.int = F, pval = TRUE,

xlim = c(0,2000), # present narrower X axis, but not affect

# survival estimates.

xlab = "Time in days", # customize X axis label.

break.time.by = 200) # break X axis in time intervals by 500\.

ggsave(filename = paste("plot_",x,".pdf",sep = ""))

print(x)

}

Xcell实战 – 简书 (jianshu.com)

打赏赞(1)微海报分享

By 进哥哥

R语言

生信资料Tags: 免疫浸润

文章导航

ClusterProfShiny 富集小程序R|散点图+边际图（柱形图，小提琴图）

发表评论取消回复邮箱地址不会被公开。必填项已用*标注评论名称 *

电子邮件 *

站点

在此浏览器中保存我的姓名、电子邮件和站点地址。

Search for:

关于我

王进（Jingle）

本网站主要用于个人科研方法整理以及生活分享，欢迎各位留言一起学习探讨，共同进步。如果想更多的了解我，欢迎查看我的简历。

很多留言不能及时给大家回复讨论，深感歉意！现在太忙了，如果有急需要讨论合作的可以直接加微信，也可以进科研互助群讨论。

近期文章

24年新版TCGA GDC data portal 2.0界面介绍及数据下载教程

单因素/多因素Logistic回归模型基本介绍及SPSS/GraphPad分析步骤

更新：转录因子靶基因多数据库预测在线工具（主要针对KnockTF数据库）

CRISPRi和CRISPRa：基因表达干预的新利器

2016-2023年NSFC国家自然科学基金信息App

近期评论柠檬酸合酶发表在《m6A-IP（MeRIP）-qPCR计算相对表达量》j发表在《2016-2023年NSFC国家自然科学基金信息App》尹发表在《给我留言》张张发表在《亚硫酸盐的测序法（bisulfite sequencing PCR,BSP）》山东大学王永亮发表在《DNAMAN 9.0 | 分子生物学应用软件神器》标签COX

CRISPR-Cas9

Cytoscape

DNA甲基化

endnote

GEO

ggplot2

Graphpad

GTEx

IC50

Image J

Linux

lncRNA

m6A

miRNA

Motif

PCR

PD1/PDL1

PubMed

pull-down

R语言

SCI写作

Shiny

shRNA

SPSS

TCGA

Western Blot

免疫浸润

免疫组化

基因敲除

基金写作

实验动物

富集分析

引物

慢病毒

新药

流式

热图

爬虫

科研热点

类器官

网络

肺癌

衰老

转录因子

选择分类目录

Uncategorized (4)

实验方法 (204)

CRISPR-Cas9 (13)

其他 (34)

写作投稿 (13)

分子生物学 (126)

动物实验 (16)

细胞生物学 (40)

常用软件 (104)

Graphpad (14)

Image J (19)

SPSS (8)

其他 (4)

办公软件 (8)

图片处理 (22)

小工具 (29)

科研软件 (29)

统计软件 (15)

新药研发 (16)

生信分析 (196)

Linux系统 (5)

Python (2)

R语言 (138)

其他 (13)

机器学习 (2)

生信资料 (68)

精彩生活 (33)

宝贝yiyi (20)

我爱罗 (7)

功能

条目feed

评论feed

WordPress.org

文章归档文章归档

选择月份

2024年2月

2024年1月

2023年12月

2023年11月

2023年9月

2023年8月

2023年7月

2023年6月

2023年5月

2023年4月

2023年3月

2023年2月

2023年1月

2022年12月

2022年11月

2022年10月

2022年9月

2022年8月

2022年7月

2022年6月

2022年5月

2022年4月

2022年3月

2022年2月

2022年1月

2021年12月

2021年11月

2021年10月

2021年9月

2021年8月

2021年7月

2021年6月

2021年5月

2021年4月

2021年3月

2021年2月

2021年1月

2020年12月

2020年11月

2020年10月

2020年9月

2020年8月

2020年7月

2020年6月

2020年5月

个人风采

xCell: digitally portraying the tissue cellular heterogeneity landscape - PubMed

This site needs JavaScript to work properly. Please enable it to take advantage of the complete set of features!

Clipboard, Search History, and several other advanced features are temporarily unavailable.

Skip to main page content

An official website of the United States government

Here's how you know

The .gov means it’s official.

Federal government websites often end in .gov or .mil. Before

sharing sensitive information, make sure you’re on a federal

government site.

The site is secure.

The https:// ensures that you are connecting to the

official website and that any information you provide is encrypted

and transmitted securely.

Show account info

Account

Logged in as:

username

Dashboard

Publications

Account settings

Log out

Access keys

NCBI Homepage

MyNCBI Homepage

Main Content

Main Navigation

Search:

Advanced

Clipboard

User Guide

Save

Send to

Clipboard

My BibliographyCollectionsCitation manager

Display options

Format

Abstract

PubMed

PMID

Save citation to file

Format:

Summary (text)

PubMed

PMID

Abstract (text)

CSV

Create file

Cancel

Email citation

Subject:

1 selected item: 29141660 - PubMed

To:

From:

Format:

Summary

Summary (text)

Abstract

Abstract (text)

MeSH and other data

Send email

Cancel

Add to Collections

Create a new collection

Add to an existing collection

Name your collection:

Name must be less than 100 characters

Choose a collection:

Unable to load your collection due to an error

Please try again

Add

Cancel

Add to My Bibliography

My Bibliography

Unable to load your delegates due to an error

Please try again

Add

Cancel

Your saved search

Name of saved search:

Search terms:

Test search terms

Would you like email updates of new search results?

Saved Search Alert Radio Buttons

Yes

Email:

(change)

Frequency:

Monthly

Weekly

Daily

Which day?

The first Sunday

The first Monday

The first Tuesday

The first Wednesday

The first Thursday

The first Friday

The first Saturday

The first day

The first weekday

Which day?

Sunday

Monday

Tuesday

Wednesday

Thursday

Friday

Saturday

Report format:

Summary

Summary (text)

Abstract

Abstract (text)

PubMed

Send at most:

1 item

5 items

10 items

20 items

50 items

100 items

200 items

Send even when there aren't any new results

Optional text in email:

Save

Cancel

Create a file for external citation management software

Create file

Cancel

Your RSS Feed

Name of RSS Feed:

Number of items displayed:

100

Create RSS

Cancel

RSS Link

Copy

Full text links

BioMed Central

Free PMC article

Full text links

ActionsCiteCollectionsAdd to CollectionsCreate a new collectionAdd to an existing collection

Name your collection:

Name must be less than 100 characters

Choose a collection:

Unable to load your collection due to an errorPlease try again

Add

Cancel

Display options

Format

AbstractPubMedPMID

Permalink

Copy

Page navigation

Title & authors

Abstract

Conflict of interest statement

Figures

单细胞分析十八般武艺11：xCell-腾讯云开发者社区-腾讯云

析十八般武艺11：xCell-腾讯云开发者社区-腾讯云生信技能树jimmy单细胞分析十八般武艺11：xCell关注作者腾讯云开发者社区文档建议反馈控制台首页学习活动专区工具TVP最新优惠活动文章/答案/技术大牛搜索搜索关闭发布登录/注册首页学习活动专区工具TVP最新优惠活动返回腾讯云官网生信技能树jimmy首页学习活动专区工具TVP最新优惠活动返回腾讯云官网社区首页 >专栏 >单细胞分析十八般武艺11：xCell单细胞分析十八般武艺11：xCell生信技能树jimmy关注发布于 2021-05-18 12:35:095.7K0发布于 2021-05-18 12:35:09举报文章被收录于专栏：单细胞天地单细胞天地单细胞测序技术的发展日新月异，新的分析工具也层出不穷。每个工具都有它的优势与不足，在没有权威工具和流程的单细胞生信江湖里，多掌握几种分析方法和工具，探索数据时常常会有意想不到的惊喜。往期相关单细胞初级8讲和高级分析8讲

单细胞分析十八般武艺1：harmony

单细胞分析十八般武艺2：LIGER

单细胞分析十八般武艺3：fastMNN

单细胞分析十八般武艺4：velocyto

单细胞分析十八般武艺5：monocle3

单细胞分析十八般武艺6：NicheNet

单细胞分析十八般武艺7：CellChat

单细胞分析十八般武艺8：Garnett

单细胞分析十八般武艺9：DoubletFinder

单细胞分析十八般武艺10：NMF

xCell简介xCell是开发SingleR包的团队2017年推出的一款推断bulkRNA样本中细胞类型比例的R包，目前在google学术查到它有598次引用。xCell的工作原理是用机器学习算法提取了64种免疫细胞和基质细胞的signature，待检测bulkRNA数据先用ssGSEA算法计算样本在每个细胞类型signature的富集分数，然后用特别设计的算法将样本中各种细胞类型的富集分数转换为细胞类型分数，最后对紧密相关的细胞类型分数进行补偿校正。xCell支持的64种细胞类型xCell使用须知使用xCell分析的bulkRNA数据类型可以是RNA-seq测序数据，也可以是表达芯片数据，但不要把两种数据混合在一起分析。表达芯片数据不用对数据进行任何处理，测序数据要转换为TPM、FPKM或RPKM值。如果输入样本中的细胞成分没有足够的可变性，xCell将无法识别任何信号；因此输入数据必须具有异质性，且不要把多个样本分成多次运行xCell，不同运行之间的输出结果没有可比性。xCell中的线性变换使用校准参数使xCell评分与百分比相似，但是它有时也会不准确，如果分析得出的高分对应明显错误的细胞类型，可以手动调整校准参数。xCell评分是基于signatures的富集分数，它与真正的细胞比例有线性相关性，但是不能把xCell评分作为细胞比例值。xCell评分用于下游分析时，可以在不同样本之间对比同一细胞类型的得分，但是不要在同一样本内比较不同细胞类型的得分。不要把xCell用于单细胞数据的细胞类型鉴定。xCell用法示例数据https://raw.githubusercontent.com/dviraran/xCell/master/vignettes/sdy420.rds

复制安装xCelldevtools::install_github('dviraran/xCell')

复制xCell测试library(xCell)

## 加载测试数据

sdy <- readRDS("sdy420.rds")

# sdy是下载的示例数据，有104个样本的表达谱芯片的bulkRNA数据expr，

# 以及基于流式计数的细胞百分比的数据fcs

summary(sdy)

# Length Class Mode

#expr 104 data.frame list

#fcs 104 data.frame list

## 根据样本实际情况设置分析的细胞类型，有利于提高分析的准确性，非必要步骤

cell.types.use = intersect(colnames(xCell.data$spill$K), rownames(sdy$fcs))

## xCell评分，注意rnaseq参数，芯片数据设为F，测序数据设为T

scores = xCellAnalysis(sdy$expr, rnaseq=F, cell.types.use = cell.types.use)

## 准确性评估

library(psych)

library(ggplot2)

fcs = sdy$fcs[rownames(scores), colnames(scores)]

res = corr.test(t(scores), t(fcs), adjust='none')

qplot(x=rownames(res$r), y=diag(res$r),

fill=diag(res$p) < 0.05, geom='col',

main='SDY420 association with immunoprofiling',

ylab='Pearson R', xlab= '') + labs(fill = "p-value < 0.05") +

theme_classic() +

theme(axis.text.x = element_text(angle = 45, hjust = 1))

复制测试数据集中有24种细胞类型，因为xCell自带signature的局限性，只能预测18种细胞类型。其中13个预测结果与真实的细胞比例显著相关（p<0.05），7个预测结果与真实的细胞比例高度相关（p<0.05&R>0.5）。关于介绍 xCell 的说明xCell并不是一款分析单细胞数据的工具，我向大家介绍它并收录在《单细胞分析十八般武艺》专题中，是因为它与单细胞的分析密切相关。虽然单细胞研究热潮已经持续了几年，但是高昂的成本依然让大家难以负担；因此使用少量样本做scRNA-seq得出研究结论，然后用大量bulkRNA样本进行验证的策略被越来越多的人使用。为了更好地将单细胞数据与bulkRNA数据联系起来，往往需要对bulkRNA数据进行去卷积操作，近年来很多优秀的去卷积工具被开发出来，我会在此专题中陆续介绍几款常用的方法。交流探讨：如果您阅读此文有所疑惑，或有不同见解，亦或其他问题，可以点击阅读原文联系探讨。本文参与腾讯云自媒体分享计划，分享自微信公众号。原始发表：2021-04-30，如有侵权请联系 cloudcommunity@tencent.com 删除编程算法硬件开发本文分享自单细胞天地微信公众号，前往查看如有侵权，请联系 cloudcommunity@tencent.com 删除。本文参与腾讯云自媒体分享计划，欢迎热爱写作的你一起参与！编程算法硬件开发评论登录后参与评论0 条评论热度最新登录后参与评论推荐阅读LV.关注文章0获赞0目录往期相关xCell简介xCell支持的64种细胞类型xCell使用须知xCell用法示例数据安装xCellxCell测试关于介绍 xCell 的说明领券社区专栏文章阅读清单互动问答技术沙龙技术视频团队主页腾讯云TI平台活动自媒体分享计划邀请作者入驻自荐上首页技术竞赛资源技术周刊社区标签开发者手册开发者实验室关于社区规范免责声明联系我们友情链接腾讯云开发者扫码关注腾讯云开发者领取腾讯云代金券热门产品域名注册云服务器区块链服务消息队列网络加速云数据库域名解析云存储视频直播热门推荐人脸识别腾讯会议企业云CDN加速视频通话图像分析MySQL 数据库SSL 证书语音识别更多推荐数据安全负载均衡短信文字识别云点播商标注册小程序开发网站监控数据迁移Copyright © 2013 - 2024 Tencent Cloud. All Rights Reserved. 腾讯云版权所有深圳市腾讯计算机系统有限公司 ICP备案/许可证号：粤B2-20090059 深公网安备号 44030502008569腾讯云计算（北京）有限责任公司京ICP证150476号 | 京ICP备11018762号 | 京公网安备号11010802020287问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档Copyright © 2013 - 2024 Tencent Cloud.All Rights Reserved. 腾讯云版权所有登录后参与评论00

分享一个R包——免疫浸润xCell

快递查询

首页

云商城

科研服务

技术支持

云学院

客户中心

关于我们

测试调用测试设计Survival生存曲线绘制软件环境微生物多样性软件转录组分析软件转录组软件购买重测序软件环境微生物多样性软件(1)桌面软件中药空间代谢组学检测中药非靶代谢组检测中药入血/入靶成分分析中药成分鉴定检测中药组学ATAC-seqCHIP-seqHi-C测序基因调控OmicsBeanMicrobe Trakr(微生物基因组鉴定分析工具)网页分析系统WEB分析系统澳洲血清 BovineBD科研管KAPAQIAGENThermoFisherMVE液氮罐4titude® 样品管标记系统Hi-C建库试剂盒及基因组组装软件无血清细胞冻存液Cell Freezing Medium纳米流式检测仪lexogen支原体检测试剂盒仪器试剂耗材数据库开发数据中心TCGA生存数据包功能医学报告系统开发PlantArray植物生理组平台特色服务单细胞测序空间代谢组DSP空间蛋白质组Visium空间转录组测序空间多组学类器官基因芯片染色体级别基因组组装Hi-C建库叶绿体、线粒体基因组测序一代测序动植物基因组de novo测序细菌基因组测序真菌基因组测序病毒基因组测序简化基因组遗传图谱测序简化基因组GWAS测序基因组重测序表观组基因分型外显子捕获目标区域捕获简化基因组遗传图谱性状定位扫描图DNA中5-hmC图谱测定全基因组甲基化测序真菌基因组扫描图测序epiGBS-简化甲基化BSA混池测序基因组SSR开发基因组（DNA）UMI-RNAseq转录组测序真核有参转录组测序真核无参转录组测序原核链特异性转录组测序全转录组测序降解组表达谱芯片circRNA芯片circRNA测序Small RNA测序Lnc RNA测序m6A甲基化测序互作转录组测序UMI-RNAseq转录组（RNA）16S扩增子全长测序Meta-Barcoding（eDNA）技术研究微生物多样性测序宏基因组测序宏基因组Binning分析宏基因组抗性基因测序HiC-Meta宏基因组宏转录组差异表达测序宏病毒组测序环境DNAHiFi-Meta宏基因组肠道菌群临床检测基于肠道菌群检测和移植的肠道微生态学科建设宏基因组元素循环测序微生物组蛋白组代谢组抗体芯片Raybiotech芯片蛋白芯片蛋白芯片4D蛋白质组Raybiotech芯片OLINK精准蛋白质组学解决方案常规定量蛋白质组蛋白质组定性分析靶向蛋白质组学修饰蛋白质组学非靶向代谢组学靶向代谢组学脂质组学蛋白和代谢组GC-MS全代谢组LC-MS全代谢组靶向代谢组脂质组学代谢组学分子生物学CRISPR基因编辑细胞定制细胞株构建iPS构建CRISPR/Cas9DNA甲基化修饰细胞FAQ基因编辑切片图像扫描组织芯片免疫组化微量基因组建库专家病理切片数字存档多色免疫荧光病理形态学数据陪护扩增子时序分析基因突变体克隆动物中心小动物疾病模型构建和检测服务基因编辑小鼠动物实验支原体污染检测服务细胞系遗传背景鉴定细胞系鉴定外泌体全转录组测序外泌体分离与鉴定单外泌体蛋白质组学分析服务外泌体专题甲基化焦磷酸测序cfDNA甲基化测序DNA甲基化测序850K甲基化芯片935K甲基化芯片全基因组甲基化测序（WGBS）简化基因组甲基化测序 (RRBS)目标区域甲基化测序 (Targeted Bisulfite Sequencing)甲基化DNA免疫沉淀测序 (MeDIP-seq)氧化-重亚硫酸盐测序 (oxBS-seq)TET-重亚硫酸盐测序(TAB-seq)5hmC-Seal，超高灵敏度的羟甲基化检测羟甲基化免疫共沉淀测序 (hMeDIP-seq)DNA 6mA免疫沉淀测序 (6mA-IP Seq)甲基化专题RNA修饰研究专题免疫印迹(Western-blot)技术服务定量Western检测Simoa单分子免疫分析qPCRCNVSNPPGM测序PCR array数字PCR精准检测ATAC-SeqChIP-SeqRIP-Seq基因调控Ribo-seq核糖体印迹测序技术Active Ribo-seq活跃翻译组测序技术翻译组10x官方发布样本准备样本要求样本取材以及样本编号技巧精简版细胞库组织库动物模型蛋白组代谢组Hi-C单细胞与空间转录组单细胞悬液外泌体Raybiotech蛋白芯片Simoa样本准备样本准备要求表单留言板SaaS 帮助搜索Mac谷歌浏览器2019国自然基金查询生信相关工具集合数据分析项目信息单提交资料分享核酸抽提产品资料转录组软件教学视频微生物多样性软件教学视频Lexogen产品培训视频Olink精准蛋白组学专题项目进度个人中心会员登录会员注册购物车联系我们公众号手机商城公司愿景知识分享

点击下单

当前位置

分享一个R包——免疫浸润xCell

xCell由dviraran团队于2017年开发，原文题目为：xCell: digitally portraying the tissue cellularheterogeneity landscape超过200次引用：该方法通过反褶积整合了基因富集分析的优势，可以评估64种细胞类型，涉及多个适应性和先天免疫细胞、造血祖细胞、上皮细胞和细胞外基质细胞，其中包括48种肿瘤微环境相关细胞。该方法适用于基因表达谱和传统RNA-seq数据，但不包括单细胞数据（推荐singleR，同样由该团队开发）。R代码存于github，网址为https://github.com/dviraran/xCell，代码如下：rm(list = ls())options( repos<- c(CRAN="https://mirrors.tuna.tsinghua.edu.cn/CRAN/"))options("BioC_mirror"<- "https://mirrors.ustc.edu.cn/bioc/")options(stringsAsFactors = F)Sys.setlocale("LC_ALL","English")devtools::install_github('dviraran/xCell')我们以GSE76275为例试一下这个包：要求输入行为基因名，列为样本。代码很简单library(xCell)load("finalSet.Rdata")xCell<- xCellAnalysis(exprSet) ## 如果是RNA-seq数据，则xCell_RNAseq<- xCellAnalysis(exprSet,rnaseq = T)该包vignette也提供了参考代码评估了sdy311和sdy420两个数据集的免疫浸润情况：### Data gathering sdy311 <- readRDS("xCell-master/vignettes/sdy311.rds") sdy311_expr <- sdy311$expr sdy311_fcs <- sdy311$fcs dim(sdy311_expr) dim(sdy311_fcs) sdy420 <- readRDS("xCell-master/vignettes/sdy420.rds") sdy420_expr <- sdy420$expr sdy420_fcs <- sdy420$fcs### Generating xCell scores get.xCell.scores = function(sdy){ raw.scores = rawEnrichmentAnalysis(as.matrix(sdy$expr), xCell.data$signatures, xCell.data$genes) colnames(raw.scores) = gsub("\\.1","",colnames(raw.scores)) raw.scores = aggregate(t(raw.scores)~colnames(raw.scores),FUN=mean) rownames(raw.scores) = raw.scores[,1] raw.scores = raw.scores[,-1] raw.scores = t(raw.scores) cell.types = rownames(sdy$fcs) cell.types.use = intersect(rownames(raw.scores),rownames(sdy$fcs)) transformed.scores = transformScores(raw.scores[cell.types.use,],xCell.data$spill.array$fv) scores = spillOver(transformed.scores,xCell.data$spill.array$K)#s = y A = intersect(colnames(sdy$fcs),colnames(scores)) scores = scores[,A] scores } library(xCell) sdy311$fcs= sdy311$fcs[,-which(colnames(sdy311$fcs) %in% c("SUB134240","SUB134283"))] scores311 = get.xCell.scores(sdy311) scores420 = get.xCell.scores(sdy420)计算了xCell评分和免疫分析的相关性### Correlating xCell scores and CyTOF immunoprofilings library(psych) library(ggplot2) correlateScoresFCS = function(scores,fcs,tit){ fcs = fcs[rownames(scores),colnames(scores)] res = corr.test(t(scores),t(fcs),adjust='none') df = data.frame(R=diag(res$r),p.value=diag(res$p),Cell.Types=rownames(res$r)) ggplot(df)+geom_col(aes(y=df$R,x=Cell.Types,fill=p.value<0.05))+theme_classic()+ theme(axis.text.x = element_text(angle = 45, hjust = 1))+ ylab('Pearson R')+ggtitle(tit) } correlateScoresFCS(scores311,sdy311$fcs,'SDY311') correlateScoresFCS(scores420,sdy420$fcs,'SDY420')同样地，作者提供了xCell网页版工具，网址为：http://xcell.ucsf.edu/通过上传表达数据，会在你的邮箱里收到分析结果。还提供了可视化分析结果的网页工具，网址为http://comphealth.ucsf.edu/xCellView/，不过貌似用不了

文章分类：

文献解读数据分析

xCellx详细使用方法及结果分析 - 简书

lx详细使用方法及结果分析 - 简书登录注册写文章首页下载APP会员IT技术xCellx详细使用方法及结果分析mope关注赞赏支持xCellx详细使用方法及结果分析理论知识：

组织是由不同谱系和亚型的细胞类型组成的复杂环境，每种细胞都有自己独特的转录组。因此，批量转录组分析是细胞类型特异性基因表达的总和加权的细胞类型比例在给定的样本。去卷积的基因表达谱允许重建组织的细胞组成。Xcell 是一个强大的计算方法，转换基因表达谱为丰富分数的64免疫和基质细胞类型跨样本。

不同受试者细胞类型组成的差异可以确定疾病的细胞靶点，并提出新的治疗策略。此外，调整这些变异可以检测真正的基因表达差异，并提高解释下游分析。

1.安装xCell

Xcell 是在 r 中开发的，用于运行 xcell 的 r 包可以作为 github 存储库中的开源代码使用 (https://github.com/dviraran/xCell)

安装 r 是一个先决条件(https://www.r-project.org/) ，rstudio 是运行 r 脚本的推荐环境(https://www.rstudio.com/)。

要安装 xcell r 包，下面的所有命令都应该在 r 环境中输入:

如果 devtools 包以前没有安装，首先安装它:

install.packages('devtools')

从 github 安装当前的 xcell 版本:

devtools::install_github('dviraran/xCell')

Xcell 包装依赖于以下包装: 生物导体包装ーGSVA，GSEABase。

if (!requireNamespace("BiocManager", quietly = TRUE))

install.packages("BiocManager")

BiocManager::install(c("GSVA","GSEABase"), version = "3.8")

CRAN packages—pracma, utils, stats, MASS, digest, curl,

quadprog.

install.packages('pracma', 'utils', ' stats', 'MASS',

'digest', 'curl', 'quadprog')

以上为xCell说明文档的安装说明，以下为网络上的方法（用以下即可）。

install.packages('Rcpp') #########安装各类程序包

devtools::install_github('dviraran/xCell')

library(xCell)

2.输入文件

文件格式：Xcell 的输入是来自人类混合样本的基因表达矩阵。应该在运行 xcell 函数之前读取它。矩阵应该是基因作为行名，列是样本。如果基因表达式是一个标签分隔的文件，可以使用下面的调用来读取它:

expr = read.table(file.name, header=TRUE, row.names=1, as.is=TRUE, sep='\t')

如果基因表达数据来自微阵列，就不需要标准化。如果基因表达数据来自一个测序平台，数值必须被归一化为基因长度(例如，rpkm，tpm，fpkm)。Xcell 使用表达式级别排序，而不使用实际值，因此进一步的规范化不会产生影响。

3.xCell Pipeline

Xcell 管道由三个步骤组成，它们在 r 包中也表示为函数:

1. rawEnrichmentAnalysis

xCell.data$genes

作为最低要求，输入基因表达矩阵至少需要5000个基因，但共用基因数目太少可能会影响结果的准确性。

Xcell 为每种单元格类型使用多个签名。总共有489个特征符合64种细胞类型。完整的签名列表可在以下网址找到:

xCell.data$signatures

得分计算使用单样本基因集富集分析(ssGSVA)。对于每个单元格类型，计算来自多个相应签名的多个得分的平均值。最后，平均分数被移动，使得每个单元格类型的最小分数为零。

scores = rawEnrichmentAnalysis(expr, signatures,

genes, file.name, parallel.sz, parallel.type = "SOCK")

2. transformScores

这个函数用于将得分从原始浓缩得分转换为类似百分比的线性尺度，xcell 使用预先计算的校准参数进行转换。Xcell 使用不同的参数设置基于序列的基因表达值和基于微阵列的值(有关调整校准参数的信息，请参阅注3)。基于序列的值的参数可以在下面找到:

xCell.data$spill$fv

对于基于微阵列的数据:

xCell.data$spill.array$fv

函数的用法如下:

tscores = transformScores(scores, fit.vals, scale, fn)

‘scores’是 rawenrichmentanalysis 的输出; ‘ fit.vals’是上面描述的校准参数; 如果符合逻辑，则是否按比例缩放转换后的分数(默认值为 true 并推荐使用)。

3. spillOver

spillOver(transformedScores, K, alpha = 0.5, file.name = NULL)

usage

xCellAnalysis(expr, signatures = NULL, genes = NULL, spill= NULL, rnaseq = TRUE, file.name = NULL, scale = TRUE, alpha =0.5, save.raw = FALSE, parallel.sz = 4, parallel.type = "SOCK",cell.types.use =NULL)

可简化为

exprMatrix = read.table("expr",header=TRUE,row.names=1, as.is=TRUE)

4.实例

1）输入样本

在https://github.com/dviraran/xCell/tree/master/vignettes

sdy = readRDS('sdy420.rds')

sdy有两种数据，一种是表达谱，另一种是细胞分数。

sdy数据.png

expr.png

fcs.png

raw.scores = rawEnrichmentAnalysis(as.matrix(sdy$expr),

xCell.data$signatures,

xCell.data$genes) #首先生成原始分数

下一步是转换原始分数，并应用溢出补偿。为了获得最佳的结果，最好只对相关的细胞类型进行溢出补偿（例如，如果我们知道混合物中没有巨噬细胞，最好将它们从分析中去除）。因此，我们将分数矩阵子集为仅在CyTOF数据集中也被测量的单元格类型：

cell.types.use = intersect(colnames(xCell.data$spill$K),

rownames(sdy$fcs))

最后一步是补偿溢出效果的分数：

scores = spillOver(transformed.scores,xCell.data$spill.array$K)

请注意，我们在这里使用xCell.data$spill.array数据，因为表达式数据是用微阵列生成的。上面详细介绍的管道也可以使用xCell分析包装器功能同样执行：

scores = xCellAnalysis(sdy$expr, rnaseq=F,cell.types.use = cell.types.use)

使用这些分数，我们现在可以找到细胞分数和细胞类型分数之间的相关性:

library(psych)

library(ggplot2)

fcs = sdy$fcs[rownames(scores),colnames(scores)]

res = corr.test(t(scores),t(fcs),adjust='none')

qplot(x=rownames(res$r),y=diag(res$r),

fill=diag(res$p)<0.05,geom='col',

main = 'SDY420 asspciation with immunoprofiling',

ylab = 'Pearson R')+labs(fill="p-value<0.05")+

theme_classic()+

theme(axis.text.x = element_text(angle = 45,hjust = 1))

该代码生成一个条形图（图1a），用于显示xCell分数与免疫分析的预期分数的相关性。我们发现18种细胞类型中的13种存在显著相关性(p值<0.05)，7种细胞类型中存在高相关性(R>0.5)。需要注意的是，xCell产生的是富集分数，而不是细胞类型的比例，因此不期望分数会与CyTOF比例相似，只是测量值之间将存在线性相关。

图1a

在上面的分析中，我们使用一个细胞子集运行了xCell，而不是所有64种细胞类型。在某些情况下，这可能会提高准确性，因为溢出补偿程序可能会进行过度补偿。因此，我们可以对所有细胞类型进行相同的分析，并与免疫图谱分析进行关联：

最后编辑于：2021.12.02 14:52:29©著作权归作者所有,转载或内容合作请联系作者人面猴序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...沈念sama阅读 145,261评论 1赞 308死咒序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...沈念sama阅读 62,177评论 1赞 259救了他两次的神仙让他今天三更去死文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...开封第一讲书人阅读 96,329评论 0赞 214道士缉凶录：失踪的卖姜人文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...开封第一讲书人阅读 41,490评论 0赞 184港岛之恋（遗憾婚礼）正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...茶点故事阅读 49,353评论 1赞 262恶毒庶女顶嫁案：这布局不是一般人想出来的文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...开封第一讲书人阅读 39,028评论 1赞 179城市分裂传说那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...沈念sama阅读 30,611评论 2赞 276双鸳鸯连环套：你想象不到人心有多黑文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...开封第一讲书人阅读 29,383评论 0赞 171万荣杀人案实录序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...沈念sama阅读 32,749评论 0赞 215护林员之死正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...茶点故事阅读 29,460评论 2赞 219白月光启示录正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...茶点故事阅读 30,814评论 1赞 232活死人序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...沈念sama阅读 27,255评论 2赞 215日本核电站爆炸内幕正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...茶点故事阅读 31,752评论 3赞 214男人毒药：我在死后第九天来索命文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...开封第一讲书人阅读 25,685评论 0赞 9一桩弑父案，背后竟有这般阴谋文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...开封第一讲书人阅读 26,114评论 0赞 170情欲美人皮我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...沈念sama阅读 33,747评论 2赞 234代替公主和亲正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...茶点故事阅读 33,901评论 2赞 238推荐阅读更多精彩内容免疫浸润分析方法肿瘤不是单纯的恶性细胞群，而是由不同类型细胞组成的复杂生态系统。在这些细胞中，肿瘤浸润免疫细胞在肿瘤控制和治疗反应...Seurat_Satija阅读 6,146评论 0赞 23基于Seurat结果推断单细胞群肿瘤纯度之ESTIMATEInferring tumour purity and stromal and immune cell admix...周运来就是我阅读 4,260评论 7赞 12利用系统基因组学方法进行多基因组学数据复合与分析：在动物生产，健康和增益中的方法和应用文章来源：Multi-omic data integration and analysis using syste...虾里巴人阅读 2,869评论 0赞 13使用limma、Glimma和edgeR对RNA-seq数据分析笔记1 摘要简单且高效地分析RNA测序数据的能力正是Bioconductor的核心优势之一。在获得RNA-seq基因...玄都维维子阅读 4,782评论 0赞 19WGCNA相关文献记录【自找】https://pubmed.ncbi.nlm.nih.gov/?term=colorectal+cancer+g...医只蜗牛阅读 2,471评论 1赞 20评论0赞1414赞15赞赞赏更

Follow

imtoken钱包官网下载2.|xcell

imtoken钱包官网下载2.|xcell

肿瘤免疫浸润分析：xCell包（原理及使用） - 知乎

免疫浸润 | xCell 简述与实践 - 知乎

xCell: digitally portraying the tissue cellular heterogeneity landscape | Genome Biology | Full Text

xCell：基因表达数据的免疫浸润分析专家！ - 知乎

GitHub - dviraran/xCell: Cell types enrichment analysis

免疫浸润–xCell使用简介 – 王进的个人网站

xCell: digitally portraying the tissue cellular heterogeneity landscape - PubMed

单细胞分析十八般武艺11：xCell-腾讯云开发者社区-腾讯云

分享一个R包——免疫浸润xCell

xCellx详细使用方法及结果分析 - 简书

最近的新闻

您可能喜欢的文章

苹果im钱包下载安装|x77永久

比特派官方下载安装|xff

为什么TP钱包更新后失去了市场？

如何下载中邮钱包？

如何使用私钥恢复以太坊钱包

易方达钱包是否支持USDT交易？

如何通过TP钱包查找交易记录中的合约地址