This protocol aims at guiding reasearcher how to employ deconvolution of methylomes obtained from complex tissues. It will start with data retrieval from a public resource, but is equally applicable to in-house generated data. We will furthermore focus on the Illumina BeadChip series as a data source, although the protocol is also compatible with bisulfite sequencing, or any technology yielding single base pair resolution. Deconvolution here refers to creating two matrices (proportion matrix A and methylation pattern matrix T) from a single matrix of input DNA methylation data (dimension CpGs x samples). Non-negative matrix factorization is employed for this task, and we will discuss some of the advantages and caveats of the methods.
2a. Linux To execute the protocol described here, several R-packages (RnBeads, DecompPipeline, MeDeCom, and FactorViz) are required. First, you should have a recent R version installed and all your packages updated. The installation of the software is appliable for most Linux distributions directly through R, while for MacOS the binary release of MeDeCom https://github.com/lutsik/MeDeCom/releases/download/v0.3.0/MeDeCom_0.3.0.tgz should be used. The protocol is also available as a Docker container https://hub.docker.com/r/mscherer/medecom, which is currently the only option for Windows operating systems. For the remainder of this protocol, we assume a common Linux distribution as the operating system.
install.packages(c("devtools","BiocManager"))
BiocManager::install(c("RnBeads","RnBeads.hg19","RnBeads.mm10","RnBeads.hg38"),dependencies=TRUE)
devtools::install_github(c("lutsik/MeDeCom","CompEpigen/DecompPipeline","CompEpigen/FactorViz"))
library(DecompPipeline)
2b. macOS In a first step, you will have to retrieve the binary version of MeDeCom from GitHub. Afterwards, you can install the remaining packages similar to the Linux case.
install.packages(c("devtools","BiocManager"))
BiocManager::install(c("RnBeads","RnBeads.hg19","RnBeads.mm10","RnBeads.hg38"),dependencies=TRUE)
install.packages("https://github.com/lutsik/MeDeCom/releases/download/v1.0.0/MeDeCom_1.0.0.tgz",repos=NULL,type="binary")
devtools::install_github(c("CompEpigen/DecompPipeline","CompEpigen/FactorViz"))
library(DecompPipeline)
2c. Windows A dedicated branch in the GitHub repository handling installation on Windows machines is available and can be installed similar to the Linux case.
BiocManager::install(c("RnBeads",
"RnBeads.hg19",
"RnBeads.mm10",
"RnBeads.hg38"),
dependencies=TRUE)
library(devtools)
devtools::install_github("lutsik/MeDeCom",ref="windows")
devtools::install_github(
c("CompEpigen/DecompPipeline","CompEpigen/FactorViz")
)
library(DecompPipeline)
2 d. Cross-platform using Docker If not yet available, install Docker on your machine. The Docker website provides detailed installation instructions for all major operating systems and computational environments, including Linux, Windows and MacOS. Specifically, on Windows follow these instructions. After successful installation, start Docker desktop, be sure that a X Server is running, open a new PowerShell window and type:
# Windows
docker run -e DISPLAY=<YOUR_IP>:0.0 -it mscherer/medecom
# Linux
xhost +"local:docker@"
docker run --env DISPLAY=$DISPLAY --privileged --volume $XAUTH:/root/.Xauthority --network=host --volume /tmp/.X11-unix:/tmp/.X11-unix --rm --rm -it mscherer/medecom
An interactive R-session will start which can be used to run this protocol.
gdc-client download –m gdc_manifest.2019-01-23.txt
Store the resulting files in a new directory idat.
library(tidyverse)
clinical.data <- read.table(
"annotation/clinical.tsv",
sep="\t", header = TRUE)
idat.files <- list.files("idat",full.names = TRUE)
meta.files <- list.files(idat.files[1],full.names = TRUE)
untar(meta.files[grepl(".tar.gz",meta.files)],exdir = idat.files[1])
meta.files <- untar(meta.files[grepl(".tar.gz",meta.files)], list = TRUE)
meta.info <- read.table(
file.path(idat.files[1],meta.files[grepl(".sdrf.txt",meta.files)]),
sep="\t",header=TRUE)
meta.info <- meta.info[
match(unique(meta.info$Comment..TCGA.Barcode.),
meta.info$Comment..TCGA.Barcode.
),
]
anno.frame <- meta.info %>%
mutate("submitter_id"=substr(Comment..TCGA.Barcode., 1, 12)) %>%
inner_join(x=clinical.data, by=c("submitter_id"="submitter_id")) %>%
distinct(submitter_id, submitter_id.4, .keep_all = TRUE) %>%
mutate(barcode=gsub("_Grn.idat|_Red.idat", "", Array.Data.File),
Sentrix_ID=sub("_.*$","", barcode),
Sentrix_Position=sub("^[^_]+_","",barcode),
healthy_cancer=ifelse(grepl("11A",Comment..TCGA.Barcode.),
"healthy","cancer"))
write.table(anno.frame,
"annotation/sample_annotation.tsv",
quote = FALSE, row.names = FALSE, sep= "\t")
#' write idat files to parent directory
lapply(idat.files,function(x){
is.idat <- list.files(x,pattern = ".idat", full.names = TRUE)
file.copy(is.idat,"idat/")
unlink(x,recursive = TRUE)
})
suppressPackageStartupMessages(library(RnBeads))
rnb.options(
assembly = "hg19",
identifiers.column = "submitter_id",
import = TRUE,
import.default.data.type = "idat.dir",
import.table.separator = "\t",
import.sex.prediction = TRUE,
qc = TRUE,
preprocessing = FALSE,
exploratory = FALSE,
inference = FALSE,
differential = FALSE,
export.to.bed = FALSE,
export.to.trackhub = NULL,
export.to.csv = FALSE
)
sample.anno <- "annotation/sample_annotation.tsv"
idat.folder <- "idat/"
dir.report <- paste0("report",Sys.Date(),"/")
temp.dir <- "/tmp"
options(fftempdir=temp.dir)
rnb.set <- rnb.run.analysis(dir.reports = dir.report,
sample.sheet = sample.anno,
data.dir = idat.folder)
PAUSE POINT We provide an example report containing the processed RnBSet object for further analysis. It can be obtained from http://epigenomics.dkfz.de/downloads/DecompProtocol/RnBeads_Report_TCGA_LUAD/index.html or directly loaded via:
download.file("http://epigenomics.dkfz.de/downloads/DecompProtocol/rnbSet_unnormalized.zip", destfile=paste0(tempdir(),"/RnBSet"))
rnb.set <- load.rnb.set(paste0(tempdir(),"/RnBSet"))
PAUSE POINT For reference, we provide a complete RnBeads report on the supplementary website http://epigenomics.dkfz.de/downloads/DecompProtocol/rnbSet_unnormalized.zip.
suppressPackageStartupMessages(library(DecompPipeline))
data.prep <- prepare.data(rnb.set = rnb.set,
analysis.name = "TCGA_LUAD",
normalization = "bmiq",
filter.beads = TRUE,
min.n.beads = 3,
filter.intensity = TRUE,
min.int.quant = 0.001,
max.int.quant = 0.999,
filter.na = TRUE,
filter.context = TRUE,
filter.snp = TRUE,
filter.sex.chromosomes = TRUE,
filter.cross.reactive = TRUE,
execute.lump=TRUE)
PAUSE POINT We provide a list of CpGs that are selected for the downstream analysis at http://epigenomics.dkfz.de/downloads/DecompProtocol/sites_passing_complete_filtering.csv. You can directly select those sites from you RnBSet object.
rem.sites <- rep(TRUE,nrow(meth(rnb.set)))
sel.sites <-
read.csv("http://epigenomics.dkfz.de/downloads/DecompProtocol/sites_passing_co
mplete_filtering.csv")
rem.sites[row.names(annotation(rnb.set))%in%as.character(sel.sites[,1])] <-
FALSE
rnb.set <- remove.sites(rnb.set,rem.sites)
data.prep <- list(rnb.set.filtered=rnb.set)
suppressPackageStartupMessages(library(DecompPipeline))
data.prep <- prepare.data(rnb.set = data.prep$rnb.set.filtered,
normalization = "none",
analysis.name = "TCGA_LUAD",
filter.beads=FALSE,
filter.intensity=FALSE,
remove.ICA=TRUE,
filter.na = FALSE,
filter.context = FALSE,
filter.snp = FALSE,
filter.sex.chromosomes = FALSE,
filter.cross.reactive = FALSE,
conf.fact.ICA=c("age_at_diagnosis","race","gender","ethnicity"),
ica.setting=c("alpha.fact"=1e-5,"save.report"=TRUE,
"ntry"=10,"nmin"=20,"nmax"=50,"ncores"=10))
cg_subset <- prepare.CG.subsets(rnb.set = data.prep$rnb.set.filtered,
marker.selection = "var",
n.markers = 5000)
names(cg_subset)
md.res <- start.medecom.analysis(
rnb.set=data.prep$rnb.set.filtered,
cg.groups = cg_subset,
Ks=2:15,
lambda.grid = c(0,10^-(2:5)),
factorviz.outputs = T,
analysis.name = "TCGA_LUAD",
cores = 15
)
PAUSE POINT The final MeDeCom result is stored in a format that can be directly imported with FactorViz. We provide the final result for exploration at http://epigenomics.dkfz.de/downloads/DecompProtocol/FactorViz_outputs.tar.gz.
download.file("http://epigenomics.dkfz.de/downloads/DecompProtocol/FactorViz_outputs.tar.gz",destfile=paste0(tempdir(),"/FactorViz_outputs.tar.gz"))
untar(paste0(tempdir(),"/FactorViz_outputs.tar.gz"))
suppressPackageStartupMessages(library(FactorViz))
startFactorViz(file.path(getwd(),"TCGA_LUAD","FactorViz_outputs"))
(Optional). Load the MeDeComSet object in an additional R session started in the working directory, where deconvolution has been performed. The object is stored in the FactorViz_outputs directory, and can be used for additional analysis.
Determine the number of LMCs (K) and the regularization parameter (λ) based on the cross-validation error. First, go to panel “K selection” to plot cross-validation error for the range of Ks specified earlier. We recommend selecting K values as a trade-off between too high complexity, i.e. fitting the noise in the data (higher K values) and insufficient degrees of freedom (lower K values). This typically corresponds to the saddle point where cross-validation error starts to level out.
For a fixed K, select a value for the regularization parameter (λ) by proceeding to panel “Lambda selection”. An optimal value for λ will often correspond to a local minimum of the cross-validation error. On the other hand, noticeable changes in other statistics, such as objective value or root mean squared error (RMSE), can point at a different λ value.
In the “Proportions” panel of FactorViz, visualize LMC proportions using heatmaps. Proportion heatmaps can be visually annotated with available qualitative and quantitative traits (see dropdown “Color samples by”). Further visualization options include stacked barplot and lineplots.
In the “Meta-analysis” panel, associate LMC proportions with quantitative and qualitative traits using correlation- and t-tests. Further sample annotations, such as mutational load (https://www.cbioportal.org/study/clinicalData?id=luad_tcga_pan_can_atlas_2018) and tumor purity scores can be obtained from public repositories and related publications (https://static-content.springer.com/esm/art%3A10.1038%2Fncomms3612/MediaObjects/41467_2013_BFncomms3612_MOESM489_ESM.xlsx). In addition, scripts to conduct such analysis are available on the Supplementary Website (http://epigenomics.dkfz.de/DecompProtocol/).
Explore the LMCs in the panel “LMCs”. Several visualization options are available such as Multidimensional Scaling and histograms. Reference profiles can be used for joint visualization in case those are available.
Determine DNA methylation sites that are specifically hypo- and hypermethylated in an LMC by comparing the methylation values in the LMC matrix for each LMC to the median of the remaining LMCs in the “Meta-analysis” panel. LMC-specific sites can be used for either a gene-centric Gene Ontology analysis (select ”Enrichments” and “GO Enrichments” in dropdown ”Analysis” and “Output type”) or for region-based enrichment analysis with the LOLA package (select “LOLA Enrichments” in dropdown ”Output type”). The differential sites can be exported for further analysis, and the resulting plots can be stored by clicking on the “PDF” button.
PAUSE POINT Additional built-in options for visualization of LMCs and proportions are available via the MeDeCom functions plotLMCs
and plotProportions
in R. Finally, the LMC and proportions matrix can be extracted from the MeDeComSet object, and the genomic annotation of CpGs (ann.C
) can be obtained from the FactorViz output for a custom downstream analysis:
LMCs <- getLMCs(medecom.set, K=7, lambda=0.001)
LMC.proportions <- getProportions(medecom.set, K=7, lambda=0.001)
load("FactorViz_outputs/ann_C.RData")
head(ann.C)