Guide for running GO enrichment on Maize genes
This guide is heavily based on the David Winter’s guide available here. Please refer to the documentation for more details. The GO annotations (PANZZER) for all NAMs are available here (files with _GO.out
extension). We use the topGO
package in R to run the enrichment analysis and R
(tidyverse
) to plot the results.
1. Download the GO annotations
First, get the GO annotations you wish to run the enrichment on. For example, if you want to run the enrichment on the B73.v5 genes (eg: Zm00001eb123456
), you can download the GO annotations from MaizeGDB B73_GO.out
. The file looks something like this:
qpid | ontology | goid | desc | ARGOT_score | ARGOT_PPV | ARGOT_rank | goclasscount |
---|---|---|---|---|---|---|---|
Zm00001eb093920_P002 | BP | 10119 | regulation of stomatal movement | 11.68753 | 0.801096 | 1 | 1 |
Zm00001eb093920_P002 | CC | 5634 | nucleus | 0.89513 | 0.442076 | 1 | 1 |
Zm00001eb093920_P001 | CC | 5634 | nucleus | 4.110372 | 0.599079 | 1 | 8 |
Zm00001eb093920_P001 | MF | 976 | transcription cis-regulatory region binding | 1.176456 | 0.462191 | 1 | 1 |
Zm00001eb093920_P001 | BP | 30154 | cell differentiation | 0.9394 | 0.445432 | 1 | 1 |
Zm00001eb033650_P002 | MF | 43733 | DNA-3-methylbase glycosylase activity | 11.67336 | 0.800795 | 1 | 100 |
Zm00001eb033650_P002 | BP | 6284 | base-excision repair | 8.374165 | 0.724882 | 1 | 100 |
Zm00001eb033650_P001 | MF | 43733 | DNA-3-methylbase glycosylase activity | 11.67341 | 0.800796 | 1 | 100 |
Zm00001eb033650_P001 | BP | 6284 | base-excision repair | 8.374205 | 0.724883 | 1 | 100 |
(first 10 lines)
The qpid
column in the GO annotations is the geneid. The goid
column is the GO term id. The ontology
column is the GO term ontology (Biological Processes or BP, Molecular Function or MF, Cellular Component or CC). The ARGOT_score
is the score assigned by PANNZER. The ARGOT_PPV
is the precision score assigned by PANNZER. The ARGOT_rank
is the rank assigned by PANNZER. The goclasscount
is the number of genes in the GO term. The desc
is the description of the GO term.
to download:
wget https://download.maizegdb.org/GeneFunction_and_Expression/Pannzer_GO_Terms/B73_GO.out
2. Get the list of genes you want to run the enrichment on
The list of genes can be anything, for example, the genes that are differentially expressed between two conditions or genes that are in a group of interest. Most common reason you want to run the enrichment is to check if a collection of genes have something in common. The differentially expressed genes between two conditions, are usually obtained from DESeq2
output that is filtered based on log2fc
and p-value
. If you are running this on B73.v5 geneids should look something like this:
Zm00001eb389020_T001
Zm00001eb009030_T001
Zm00001eb013120_T001
Zm00001eb107030_T001
Zm00001eb115040_T001
Zm00001eb241190_T004
Zm00001eb370330_T001
Zm00001eb269130_T001
Zm00001eb295360_T001
The _T001
at the end of the geneid is the transcript id. Sometimes it can also be _P001
3. Format the GO annotations and filter them to run topGO
Required packages:
setwd("~/Desktop/GO_enrichment")
library(tidyverse)
library(scales)
library(topGO)
Read the GO annotations and filter them to only include the genes you want to run the enrichment on. The qpid
(the geneid column) should be similar to the geneids in the list of genes you want to run the enrichment on. Note that your gene list have _T00N
while the table has them _P00N
.
We also want to include only those geneids that have confident ARGOT_score
assigned by PANNZER. And finally, we want it in a format that topGO
can read. We can do this in R
as follows:
fname = 'B73_GO.out'
go <-
read.table(
fname,
quote = '"',
sep = "\t",
header = TRUE,
colClasses = c('goid' = 'character', 'qpid' = 'character')
)
# some filtering
go_filt <- go[(go$ARGOT_PPV > 0.5), ]
go_filt$goid <- paste0('GO:', go_filt$goid)
panzer_to_golist <- function(panzer_df){
go_df <- aggregate( goid ~ qpid, data=panzer_df, FUN=c)
structure(go_df$goid, .Names=go_df$qpid)
}
all_golist <- panzer_to_golist(go_filt)
4. Formatting input genelists
Since our genelist have transcript ids, we need to first convert them to protein ids.
myInput <-
read.table("mygenelist.txt",
quote = "\"",
comment.char = "")
colnames(myInput) <- "genenames"
myInput <- myInput %>%
mutate(genenames = str_replace(genenames, "_T", "_P")) %>%
mutate(myInput = "yes")
head(myInput)
the stdout should show:
genenames myInput
1 Zm00001eb196330_P001 yes
2 Zm00001eb034510_P001 yes
3 Zm00001eb001710_P001 yes
4 Zm00001eb115990_P002 yes
5 Zm00001eb281950_P001 yes
6 Zm00001eb329780_P001 yes
For topGO
we need the entire list of genes with yes
or no
for each gene. The yes
or no
indicates if the gene is in the list of genes you want to run the enrichment on. We can do this in R
as follows:
# create full list of genes
gene_names <- names(all_golist)
myGenes <- as.data.frame(gene_names)
colnames(myGenes) <- "genenames"
myGenes <- left_join(myGenes, myInput)
myGenes$myInput <- myGenes$myInput %>%
replace_na('no')
myGenesF <- factor(as.integer(myGenes$myInput == 'yes'))
names(myGenesF) <- gene_names
The stdout should show:
> head(myGenesF, 3)
Zm00001eb000010_P001 Zm00001eb000020_P001 Zm00001eb000020_P002
0 0 0
Now we are ready to run the enrichment analysis.
5. Run the enrichment analysis
If you have a large number of list to run the enrichment, you can create a function and use that to run the enrichment.
run_topGO <- function(gene_list, ontology, gene2GO_list){
topGO_data <- new("topGOdata", ontology = ontology, allGenes = gene_list,
annot = annFUN.gene2GO, gene2GO = gene2GO_list)
fishers_result <- runTest(topGO_data, algorithm = "elim", statistic = "fisher")
fishers_table <- GenTable(topGO_data, Fishers = fishers_result, useLevels = TRUE)
fishers_table$Ontology <- ontology
fishers_table
}
Now we can run the enrichment analysis on the list of genes we want to run the enrichment on.
myGenes_BP.table <- run_topGO(myGenesF, "BP", all_golist)
myGenes_MF.table <- run_topGO(myGenesF, "MF", all_golist)
myGenes_CC.table <- run_topGO(myGenesF, "CC", all_golist)
The stdout should show something like this:
> head(myGenes_BP.table)
GO.ID Term Level Annotated Significant Expected Fishers Ontology
1 GO:0000706 meiotic DNA double-strand break processi... 11 2 1 0.00 0.0010 BP
2 GO:0017126 nucleologenesis 8 5 1 0.00 0.0026 BP
3 GO:0010555 response to mannitol 6 11 1 0.01 0.0057 BP
4 GO:0016121 carotene catabolic process 9 13 1 0.01 0.0068 BP
5 GO:1905786 positive regulation of anaphase-promotin... 13 14 1 0.01 0.0073 BP
6 GO:0005975 carbohydrate metabolic process 4 2529 5 1.32 0.0086 BP
You can also combine them to a single table using rbind
if you wish:
myGenesGO.table <- (rbind(myGenes_BP.table, myGenes_MF.table, myGenes_CC.table))
myGenesGO.table <- myGenesGO.table[order(myGenesGO.table$Fishers),]
6. Plot the results
For plotting, we will use ggplot2
and tidyverse
packages. If you are running this on multiple lists, you can create a function and use that to plot the results. Two visualizations that are easy to interpret are Bubble plot and Bar plot.
Bubble plot
myBubblePlot <- function(table = myGenes_BP.table, term = "BP") {
ggplot(table,
aes(
x = Term,
y = -log10(Fishers),
size = Significant,
fill = -log10(Fishers)
)) +
expand_limits(y = 1) +
geom_point(shape = 21) +
# scale_size(range = c(2.5, 12.5)) +
scale_fill_continuous(low = 'royalblue', high = 'tomato') +
xlab('') + ylab('-log10 (p value') +
labs(title = paste0('GO Analysis (', term, ")")) +
geom_hline(
yintercept = c(-log10(0.05), -log10(0.01), -log10(0.001)),
linetype = c("dotted", "dotted", "dotted"),
colour = c("black", "black", "black"),
size = c(1, 1, 1)
) +
theme_bw(base_size = 12) +
theme(
legend.position = 'right',
legend.background = element_rect(),
plot.title = element_text(vjust = 1),
plot.subtitle = element_text(vjust = 1),
plot.caption = element_text(vjust = 1),
axis.text.x = element_text(hjust = 1.10),
axis.text.y = element_text(vjust = 0.5),
axis.line = element_line(colour = 'black'),
legend.key = element_blank(),
legend.key.size = unit(1, "cm"),
) +
coord_flip()
}
Bar plot
myBarPlot <- function(table = myGenes_BP.table, term = "BP") {
table <- table %>% mutate(word = case_when(Significant == 1 ~ "hit",
TRUE ~ "hits"))
ggplot(table, aes(reorder(Term, -log10(Fishers)),
-log10(Fishers), fill = Ontology)) +
geom_bar(stat = 'identity') +
scale_fill_manual(values = c(
"BP" = "#9531b8",
"MF" = "#9dd745",
"CC" = "#da4218"
)) +
geom_text(aes(label = paste(Significant, word, "of", Annotated)),
hjust = 2,
color = "white") +
theme_classic(base_size = 14) +
ylab("- log10 p-value") +
xlab("") +
labs(title = paste0('GO Analysis (', term, ")")) +
scale_y_continuous(expand = expansion(mult = c(0, .1)),
breaks = scales::pretty_breaks()) +
theme(legend.position = "none") +
coord_flip()
}
To run them:
myBubblePlot(table = myGenes_BP.table, term = "CC")
myBarPlot(table = myGenes_BP.table, term = "CC")
You can export the plots as pdf or png using ggsave
function.
ggsave("mybarplot.png")
There are various other visualizations as well, but these two are the most common ones. I hope this guide was helpful. If you have any questions, please feel free to contact me.