This document introduces the concepts of annotation, databank, meta table and platform. It also provides an overview of how annotations are stored in EpiAnnotator.
Enrichment analysis of selected (compared to background) regions is performed with respect to a genomic annotation. In general, an annotation is one of the following:
A non-overlapping set of genomic regions. Examples for such annotation is Ensemble genes or CpG islands. In this case, we can also think of annotation as a classification of every base pair in the genome into yes (this base lies within a region of the annotation) or no (this base is outside the annotation’s regions).
A partition of the genome into multiple different states. An example for such annotation are chromatin states for a particular cell line or cell type, as identified in the ENCODE project. In this case, we can also think of annotation as a classification of every base pair in the genome into one of the available states.
In EpiAnnotator, annotations are never strand-specific. Strand information is also ignored in the sets of interest uploaded for running an enrichment analysis.
In EpiAnnotator, a databank contains a group of annotations on the same genome assembly. A databank’s name can contain only Latin letters, digits, and the underscore symbol (_). Also, its suffix indicates which assembly it targets. For example, EpiAnnotator provides the following databanks:
EpiAnnotator_hg38
LOLA_Core_hg38
EpiAnnotator_hg19
LOLA_Core_hg19
EpiAnnotator_mm10
EpiAnnotator_mm9
Every databank is saved in a dedicated directory and contains the following components:
chromosomes.txt
.meta.csv
and it contains the columns “ID”, “Repository”, “Annotation”, “Class”, “Subclass”, “Tissue”, “Cell line”, “Disease”, “Version”, “Sex”, “Additional”.The genome-wide methylation array MethylationEPIC by Illumina interrogates over 850,000 CpGs in the human genome. Studies based on this assay typically produce sets of selected and background probes. Examples for selected probes include the hypermethylated probes in a certain disease subtype1, or the CpGs that change their methylation state with age2. The background set consists of those probes that first passed the filtering criteria and then appeared non-significant after being tested for differential methylation. An enrichment analysis with respect to a known annotation, say, gene promoters, would test if the rate of overlap of gene promoters with CpGs in the selected set is significantly higher or lower than the corresponding rate in the background set.
The MethylationEPIC assay mentioned in the paragraph above is an example for platform. Platforms in EpiAnnotator are genome-wide assays that target a limited predefined set of genomic locations. Other examples for such assays could be the Affymetrix gene and exon expression arrays. Every platform defines a universe of possible regions that can appear in the selected and background sets. In contrast to arbitrary genomic regions, supporting a platform for a given databank provides two advantages: