查看: 1845|回复: 16

[流程分析] NBIS系列单细胞转录组数据分析实战(三):多样本数据整合

[复制链接]

版主

Rank: 10Rank: 10Rank: 10

主题
29
注册时间
2016.5.11
在线时间
571 小时

活跃会员突出贡献


发表于 2021.4.5 10:51:55 | 显示全部楼层 |阅读模式
本帖最后由 bioinfomics 于 2021.4.5 10:51 编辑
本文首发于微信公众号“bioinfomics”:https://mp.weixin.qq.com/s/4fcpa_EA4ik7ysPJFYMYFg


第三节:多样本数据整合
在本节教程中,我们将探讨多个样本scRNA-seq数据集整合的不同方法。我们使用两种不同的方法来校正跨数据集的批处理效应。同时,我们还给出一种量化措施,以评估不同数据集整合的效果。Seurat使用单细胞数据综合集成中介绍的数据整合方法,而Scran和Scanpy使用相互最近邻方法(MNN)。以下是用于多样本数据集整合的常用方法:
MarkdownLanguageLibraryRef
CCARSeuratCell
MNNR/PythonScater/ScanpyNat. Biotech.
ConosRconosNat. Methods
ScanoramaPythonscanoramaNat. Biotech.
加载所需的R包和质控降维后的数据
[AppleScript] 纯文本查看 复制代码
# 加载所需的R包
suppressPackageStartupMessages({
    library(Seurat)
    library(cowplot)
    library(ggplot2)
})

# 加载质控降维后的数据集
alldata <- readRDS("data/results/covid_qc_dr.rds")
alldata
#An object of class Seurat 
#18121 features across 5532 samples within 1 assay 
#Active assay: RNA (18121 features, 2000 variable features)
# 6 dimensional reductions calculated: pca, umap, tsne, UMAP10_on_PCA, UMAP_on_ScaleData, #UMAP_on_Graph

print(names(alldata@reductions))
## [1] "pca"               "umap"              "tsne"             
## [4] "UMAP10_on_PCA"     "UMAP_on_ScaleData" "UMAP_on_Graph"


分割样本数据构建多个数据集
接下来,我们将合并后的数据按样本分割成多个不同的数据集,并分别对每个样本的数据进行数据标准化 (log-normalization),筛选高变基因。
[AppleScript] 纯文本查看 复制代码
# 使用SplitObject函数分割seurat对象
alldata.list <- SplitObject(alldata, split.by = "orig.ident")

# 分别对每个样本数据进行标准化并筛选高变基因
for (i in 1:length(alldata.list)) {
    alldata.list[[i]] <- NormalizeData(alldata.list[[i]], verbose = FALSE)
    alldata.list[[i]] <- FindVariableFeatures(alldata.list[[i]], selection.method = "vst", 
        nfeatures = 2000, verbose = FALSE)
}

# 提取每个数据集中筛选到的高变基因
hvgs_per_dataset <- lapply(alldata.list, function(x) {
    x@assays$[url=mailto:RNA@var.features]RNA@var.features[/url]
})
head(hvgs_per_dataset)

# 绘制venn图查看不同样本中高变基因的重叠情况
venn::venn(hvgs_per_dataset, opacity = 0.4, zcolor = (scales::hue_pal())(3), cexsn = 1, 
    cexil = 1, lwd = 1, col = "white", frame = F, borders = NA)

鉴定数据整合的anchors
我们使用FindIntegrationAnchors函数鉴定用于多样本数据整合的anchors。
[AppleScript] 纯文本查看 复制代码
alldata.anchors <- FindIntegrationAnchors(object.list = alldata.list, dims = 1:30, 
                                                                    reduction = "cca")
# Computing 2000 integration features
# Scaling features for provided objects
# |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01s  
# Finding all pairwise anchors
# |                                                  | 0 % ~calculating  Running CCA
# Merging objects
# Finding neighborhoods
# Finding anchors
# Found 2010 anchors
# Filtering anchors
# Retained 1720 anchors
# Extracting within-dataset neighbors
# |++++                                              | 7 % ~25s          Running CCA
# Merging objects
# Finding neighborhoods
# Finding anchors
# Found 2156 anchors
# Filtering anchors
# Retained 1728 anchors
# Extracting within-dataset neighbors
# |+++++++                                           | 13% ~24s          Running CCA
# Merging objects
# Finding neighborhoods
# Finding anchors
# Found 2700 anchors
# Filtering anchors
# Retained 2255 anchors
# Extracting within-dataset neighbors
# |++++++++++                                        | 20% ~23s          Running CCA
# Merging objects
# Finding neighborhoods
# Finding anchors
# Found 1823 anchors
# Filtering anchors
# Retained 1463 anchors
# Extracting within-dataset neighbors
# |++++++++++++++                                    | 27% ~24s          Running CCA
# Merging objects
# Finding neighborhoods
# Finding anchors
# Found 2284 anchors
# Filtering anchors
# Retained 1890 anchors
# Extracting within-dataset neighbors
# |+++++++++++++++++                                 | 33% ~22s          Running CCA
# Merging objects
# Finding neighborhoods
# Finding anchors
# Found 2519 anchors
# Filtering anchors
# Retained 1980 anchors
# Extracting within-dataset neighbors
# |++++++++++++++++++++                              | 40% ~20s          Running CCA
# Merging objects
# Finding neighborhoods
# Finding anchors
# Found 2097 anchors
# Filtering anchors
# Retained 1619 anchors
# Extracting within-dataset neighbors
# |++++++++++++++++++++++++                          | 47% ~17s          Running CCA
# Merging objects
# Finding neighborhoods
# Finding anchors
# Found 2624 anchors
# Filtering anchors
# Retained 2226 anchors
# Extracting within-dataset neighbors
# |+++++++++++++++++++++++++++                       | 53% ~15s          Running CCA
# Merging objects
# Finding neighborhoods
# Finding anchors
# Found 2909 anchors
# Filtering anchors
# Retained 2177 anchors
# Extracting within-dataset neighbors
# |++++++++++++++++++++++++++++++                    | 60% ~13s          Running CCA
# Merging objects
# Finding neighborhoods
# Finding anchors
# Found 2914 anchors
# Filtering anchors
# Retained 2502 anchors
# Extracting within-dataset neighbors
# |++++++++++++++++++++++++++++++++++                | 67% ~12s          Running CCA
# Merging objects
# Finding neighborhoods
# Finding anchors
# Found 2081 anchors
# Filtering anchors
# Retained 1647 anchors
# Extracting within-dataset neighbors
# |+++++++++++++++++++++++++++++++++++++             | 73% ~09s          Running CCA
# Merging objects
# Finding neighborhoods
# Finding anchors
# Found 2512 anchors
# Filtering anchors
# Retained 2161 anchors
# Extracting within-dataset neighbors
# |++++++++++++++++++++++++++++++++++++++++          | 80% ~07s          Running CCA
# Merging objects
# Finding neighborhoods
# Finding anchors
# Found 2831 anchors
# Filtering anchors
# Retained 2178 anchors
# Extracting within-dataset neighbors
# |++++++++++++++++++++++++++++++++++++++++++++      | 87% ~05s          Running CCA
# Merging objects
# Finding neighborhoods
# Finding anchors
# Found 2737 anchors
# Filtering anchors
# Retained 2274 anchors
# Extracting within-dataset neighbors
# |+++++++++++++++++++++++++++++++++++++++++++++++   | 93% ~02s          Running CCA
# Merging objects
# Finding neighborhoods
# Finding anchors
# Found 3067 anchors
# Filtering anchors
# Retained 2774 anchors
# Extracting within-dataset neighbors
# |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=36s

整合多样本数据集
接下来,我们使用IntegrateData函数利用鉴定的anchors进行多样本数据集的整合。
[AppleScript] 纯文本查看 复制代码
alldata.int <- IntegrateData(anchorset = alldata.anchors, dims = 1:30, new.assay.name = "CCA")
Merging dataset 1 into 3
Extracting anchors for merged samples
Finding integration vectors
Finding integration vector weights
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Integrating data
Merging dataset 6 into 5
Extracting anchors for merged samples
Finding integration vectors
Finding integration vector weights
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Integrating data
Merging dataset 2 into 3 1
Extracting anchors for merged samples
Finding integration vectors
Finding integration vector weights
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Integrating data
Merging dataset 5 6 into 3 1 2
Extracting anchors for merged samples
Finding integration vectors
Finding integration vector weights
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Integrating data
Merging dataset 4 into 3 1 2 5 6
Extracting anchors for merged samples
Finding integration vectors
Finding integration vector weights
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Integrating data
Warning: Adding a command log without an assay associated with it

We can observe that a new assay slot is now created under the name CCA.
[AppleScript] 纯文本查看 复制代码
names(alldata.int@assays)
## [1] "RNA" "CCA"

# by default, Seurat now sets the integrated assay as the default assay, so any
# operation you now perform will be on the ingegrated data.
[url=mailto:alldata.int@active.assay]alldata.int@active.assay[/url]
## [1] "CCA"

运行完IntegrateData函数后,在Seurat对象中会生成一个新的integrated (or ‘batch-corrected’)的Assay。原始的未进行批次矫正的数据仍然存储在"RNA"的Assay中,我们可以使用整合后的“integrated”的Assay去进行后续的降维可视化分析。

数据降维可视化
接下来,我们对整合后的数据进行scaling归一化,PCA和UMAP/t-SNE降维可视化。
[AppleScript] 纯文本查看 复制代码
# Run Dimensionality reduction on integrated space
alldata.int <- ScaleData(alldata.int, verbose = FALSE)
alldata.int <- RunPCA(alldata.int, npcs = 30, verbose = FALSE)

alldata.int <- RunUMAP(alldata.int, dims = 1:30)
alldata.int <- RunTSNE(alldata.int, dims = 1:30)

# 可视化批次矫正后数据整合后的结果
plot_grid(ncol = 3,
  DimPlot(alldata, reduction = "pca", group.by = "orig.ident")+NoAxes()+ggtitle("PCA raw_data"),
  DimPlot(alldata, reduction = "tsne", group.by = "orig.ident")+NoAxes()+ggtitle("tSNE raw_data"),
  DimPlot(alldata, reduction = "umap", group.by = "orig.ident")+NoAxes()+ggtitle("UMAP raw_data"),

  DimPlot(alldata.int, reduction = "pca", group.by = "orig.ident")+NoAxes()+ggtitle("PCA integrated"),
  DimPlot(alldata.int, reduction = "tsne", group.by = "orig.ident")+NoAxes()+ggtitle("tSNE integrated"),
  DimPlot(alldata.int, reduction = "umap", group.by = "orig.ident")+NoAxes()+ggtitle("UMAP integrated")
)
如图所示,我们可以看到批次校正后整合的数据可以很好按细胞类型的混合到一起,而不存在明显的样本间的批次。

可视化细胞类型特异的marker基因
MarkersCell Type
CD3ET cells
CD3E CD4CD4+ T cells
CD3E CD8ACD8+ T cells
GNLY, NKG7NK cells
MS4A1B cells
CD14, LYZ, CST3, MS4A7CD14+ Monocytes
FCGR3A, LYZ, CST3, MS4A7FCGR3A+ Monocytes
FCER1A, CST3DCs
[AppleScript] 纯文本查看 复制代码
FeaturePlot(alldata.int, reduction = "umap", features = c("CD3E", "CD4", "CD8A", 
    "NKG7", "GNLY", "MS4A1", "CD14", "LYZ", "MS4A7", "FCGR3A", "CST3", "FCER1A"), 
    order = T, slot = "data", combine = T)



使用harmony方法进行批次矫正多样本数据整合
[AppleScript] 纯文本查看 复制代码
# 加载harmony包
library(harmony)
## Loading required package: Rcpp

# 使用RunHarmony函数进行批次矫正
alldata.harmony <- RunHarmony(alldata, group.by.vars = "orig.ident", reduction = "pca", 
                                                     dims.use = 1:50, assay.use = "RNA")
Harmony 1/10
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Harmony 2/10
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Harmony 3/10
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Harmony 4/10
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Harmony 5/10
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Harmony 6/10
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Harmony converged after 6 iterations
Warning: Invalid name supplied, making object name syntactically valid. New object name is Seurat..ProjectDim.RNA.harmony; see ?make.names for more details on syntax validity

 # Here we use all PCs computed from Harmony for UMAP calculation
alldata.int[["harmony"]] <- alldata.harmony[["harmony"]]
# UMAP降维可视化
alldata.int <- RunUMAP(alldata.int, dims = 1:50, reduction = "harmony", reduction.name = "umap_harmony")


使用scanorama方法进行批次矫正多样本数据整合
[AppleScript] 纯文本查看 复制代码
# 加载所需的R包
library(reticulate)
scanorama <- import("scanorama")

hvgs <- unique(unlist(hvgs_per_dataset))
head(hvgs)
# [1] "PTGDS"    "IGLC3"    "IGLC2"    "HLA-DQA1" "IGHM"     "CD79A"
assaylist <- list()
genelist <- list()

for (i in 1:length(alldata.list)) {
    assaylist[[i]] <- t(as.matrix(GetAssayData(alldata.list[[i]], "data")[hvgs, ]))
    genelist[[i]] <- hvgs
}

lapply(assaylist, dim)
## [[1]]
## [1]  540 5203
## 
## [[2]]
## [1]  837 5203
## 
## [[3]]
## [1]  976 5203
## 
## [[4]]
## [1] 1022 5203
## 
## [[5]]
## [1] 1129 5203
## 
## [[6]]
## [1] 1028 5203

integrated.data <- scanorama$integrate(datasets_full = assaylist, genes_list = genelist)

intdimred <- do.call(rbind, integrated.data[[1]])
colnames(intdimred) <- paste0("PC_", 1:100)
rownames(intdimred) <- colnames(alldata.int)

# Add standard deviations in order to draw Elbow Plots in Seurat
stdevs <- apply(intdimred, MARGIN = 2, FUN = sd)

alldata.int[["scanorama"]] <- CreateDimReducObject(embeddings = intdimred, stdev = stdevs, 
    key = "PC_", assay = "RNA")
## Warning: Cannot add objects with duplicate keys (offending key: PC_), setting
## key to 'scanorama_'

# Here we use all PCs computed from Scanorama for UMAP calculation
alldata.int <- RunUMAP(alldata.int, dims = 1:100, reduction = "scanorama", reduction.name = "umap_scanorama")


不同整合方法结果的可视化
[AppleScript] 纯文本查看 复制代码
p1 <- DimPlot(alldata, reduction = "umap", group.by = "orig.ident") + ggtitle("UMAP raw_data")
p2 <- DimPlot(alldata.int, reduction = "umap", group.by = "orig.ident") + ggtitle("UMAP CCA")
p3 <- DimPlot(alldata.int, reduction = "umap_harmony", group.by = "orig.ident") + 
    ggtitle("UMAP Harmony")
p4 <- DimPlot(alldata.int, reduction = "umap_scanorama", group.by = "orig.ident") + 
    ggtitle("UMAP Scanorama")
leg <- get_legend(p1)

gridExtra::grid.arrange(gridExtra::arrangeGrob(p1 + NoLegend() + NoAxes(), p2 + NoLegend() + 
    NoAxes(), p3 + NoLegend() + NoAxes(), p4 + NoLegend() + NoAxes(), nrow = 2), 
    leg, ncol = 2, widths = c(8, 2))



保存数据整合后的结果
[AppleScript] 纯文本查看 复制代码
saveRDS(alldata.int, "data/results/covid_qc_dr_int.rds")



本帖子中包含更多资源

您需要 登录 才可以下载或查看,没有帐号?立即注册

x
加油啦。
回复

使用道具 举报

迅猛龙

Rank: 8Rank: 8

主题
3
注册时间
2017.9.8
在线时间
34 小时

发表于 2021.4.5 10:58:28 | 显示全部楼层
加油,加油!
回复

使用道具 举报

迅猛龙

Rank: 8Rank: 8

主题
0
注册时间
2020.11.21
在线时间
37 小时

发表于 2021.4.5 11:38:34 | 显示全部楼层
坚持就是胜利!
回复

使用道具 举报

钵水母

Rank: 3Rank: 3

主题
0
注册时间
2021.3.29
在线时间
4 小时

发表于 2021.4.5 18:46:02 | 显示全部楼层
新的一天加油!
回复

使用道具 举报

迅猛龙

Rank: 8Rank: 8

主题
0
注册时间
2020.11.21
在线时间
37 小时

发表于 2021.4.20 16:17:02 | 显示全部楼层
坚持就是胜利!
回复

使用道具 举报

迅猛龙

Rank: 8Rank: 8

主题
3
注册时间
2017.9.8
在线时间
34 小时

发表于 2021.4.21 08:35:49 | 显示全部楼层
加油,加油!
回复

使用道具 举报

迅猛龙

Rank: 8Rank: 8

主题
3
注册时间
2017.9.8
在线时间
34 小时

发表于 2021.4.24 11:16:41 | 显示全部楼层
加油,加油!
回复

使用道具 举报

迅猛龙

Rank: 8Rank: 8

主题
3
注册时间
2017.9.8
在线时间
34 小时

发表于 2021.4.25 14:21:05 | 显示全部楼层
加油,加油!
回复

使用道具 举报

迅猛龙

Rank: 8Rank: 8

主题
3
注册时间
2017.9.8
在线时间
34 小时

发表于 2021.4.28 08:52:07 | 显示全部楼层
加油,加油!
回复

使用道具 举报

迅猛龙

Rank: 8Rank: 8

主题
0
注册时间
2020.11.21
在线时间
37 小时

发表于 2021.4.28 11:23:43 | 显示全部楼层
坚持就是胜利!
回复

使用道具 举报

迅猛龙

Rank: 8Rank: 8

主题
3
注册时间
2017.9.8
在线时间
34 小时

发表于 2021.4.29 08:20:55 | 显示全部楼层
加油,加油!
回复

使用道具 举报

迅猛龙

Rank: 8Rank: 8

主题
0
注册时间
2020.11.21
在线时间
37 小时

发表于 2021.4.29 10:56:32 | 显示全部楼层
坚持就是胜利!
回复

使用道具 举报

迅猛龙

Rank: 8Rank: 8

主题
3
注册时间
2017.9.8
在线时间
34 小时

发表于 2021.4.30 08:01:59 | 显示全部楼层
加油,加油!
回复

使用道具 举报

迅猛龙

Rank: 8Rank: 8

主题
3
注册时间
2017.9.8
在线时间
34 小时

发表于 2021.5.7 08:09:07 | 显示全部楼层
加油,加油!
回复

使用道具 举报

迅猛龙

Rank: 8Rank: 8

主题
3
注册时间
2017.9.8
在线时间
34 小时

发表于 7 天前 | 显示全部楼层
加油,加油!
回复

使用道具 举报

迅猛龙

Rank: 8Rank: 8

主题
3
注册时间
2017.9.8
在线时间
34 小时

发表于 4 天前 | 显示全部楼层
加油,加油!
回复

使用道具 举报

迅猛龙

Rank: 8Rank: 8

主题
3
注册时间
2017.9.8
在线时间
34 小时

发表于 3 天前 | 显示全部楼层
加油,加油!
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

快速回复 返回顶部 返回列表