Material created by Peter Mac Data Science.
ifelse()
, c()
, %in%
)mutate()
to add columns to tables.geom_point()
to create a volcano plotggplot_text_repel
() to repel overlapping labelsIn this tutorial, we will learn some R through creating volcano plots from an RNA-seq experiment.
Volcano plots are commonly used to display the results of RNA-seq or other omics experiments. A volcano plot is a type of scatterplot that shows statistical significance (P value) versus magnitude of change (fold change). It enables quick visual identification of genes with large fold changes that are also statistically significant. These may be the most biologically significant genes. In a volcano plot, the most upregulated genes are towards the right, the most downregulated genes are towards the left, and the most statistically significant genes are towards the top.
We will again use the published RNA-seq data from the Nature Cell Biology paper by Fu et al. 2015. This study examined expression in basal and luminal cells from mice at different stages (virgin, pregnant and lactating).
To create the volcano plot we will work with the differential expression results. We will use the results from comparing the luminal cells from the mammary gland of the pregnant versus the luminal cells from the lactating mice. This is a file that contains the log fold changes (logFC) and P values for the genes. If a gene has a positive logFC value, it means it is upregulated in the luminal cells of the pregnant mice compared to the luminal cells from the lactating mice. If a gene has a negative logFC value, it means it is downregulated in the luminal cells of the pregnant mice compared to the luminal cells from the lactating mice.
We have already seen some tidyverse packages: readr for reading in files, ggplot2 for plotting. Today we will use those packages again aswell as another really useful tidyverse package called dplyr that can be used to manipulate tables. We will also use a non-tidyverse R package called ggrepel.
First let’s open a new R script. From the top menu in RStudio: File > New File > R Script
. Let’s save it as volcano_plots.R
.
We will begin by loading in the packages that we need. These are already installed for you but if you need to install them on your own computer you can use install.packages()
.
library(tidyverse)
library(ggrepel)
Next we load in our RNA-seq data. The file we will use is tab-separated, so we will use the read_tsv()
function from the tidyverse readr package to read it in. We will store the contents of our file in an object called de_results
.
de_results <- read_tsv("data/limma-voom_luminalpregnant-luminallactate.tsv.gz")
Parsed with column specification:
cols(
ENTREZID = col_double(),
SYMBOL = col_character(),
GENENAME = col_character(),
logFC = col_double(),
AveExpr = col_double(),
t = col_double(),
P.Value = col_double(),
adj.P.Val = col_double(),
B = col_double()
)
There should be 15,804 rows and 8 columns. We can see that in the Environment tab on the right.
We can type de_results
to have a look at the data.
de_results
We can’t see all the columns here but we can look at all the data with View()
or use the R function colnames()
if we just want to see the column names.
View(de_results)
colnames(de_results)
To make a volcano plot we make a scatterplot using geom_point()
and plot the log fold change (logFC) vs the negative log10 P value. Why do we use the negative log 10 P value and not just the P value? Let’s have a look at what happens if we plot the P value as is versus the logFC. We use the columns called logFC and P.Value.
ggplot(data=de_results, mapping=aes(x=logFC, y=P.Value)) +
geom_point()
Genes with small P values are the most significant and in this plot they are all squashed at the bottom of the plot, not very easy to see. Let’s see what happens when we use log10()
.
ggplot(data=de_results, mapping=aes(x=logFC, y=log10(P.Value))) +
geom_point()
Looks better but this still has the most significant genes at the bottom and the usual way is to plot the most significant at the top so we use the negative log10 (-log10). Let’s try that.
ggplot(data=de_results, mapping=aes(x=logFC, y=-log10(P.Value))) +
geom_point()
We could shange the size of the points if we wanted, by adding size=
. We can add that into the geom_point().
ggplot(data=de_results, mapping=aes(x=logFC, y=-log10(P.Value))) +
geom_point(size=7)
If we want all the points to be the same size we don’t put size= inside aes()
. However if we wanted to map the size of the points to a column in our dataset then we would put it inside aes()
as shown below.
ggplot(data=de_results, mapping=aes(x=logFC, y=-log10(P.Value))) +
geom_point(mapping=aes(size=logFC))
We could change the shape of the points with shape=
. For a triangle shape we add shape=2
. See here for the codes for the different possible shapes.
ggplot(data=de_results, mapping=aes(x=logFC, y=-log10(P.Value))) +
geom_point(shape=2)
We could change the colour of all the points with colour=
.
ggplot(data=de_results, mapping=aes(logFC, -log10(P.Value))) +
geom_point(colour="red")
We could change the size and the shape and colour.
ggplot(data=de_results, mapping=aes(logFC, -log10(P.Value))) +
geom_point(size=7, shape=2, colour="red")
But maybe that doesn’t look so good.
Let’s colour genes that are significant. We will call genes significantly differentially expressed if they have an adjusted P Value below 0.05. An adjusted P value means the P value has been adjusted for multiple testing, as we are testing many thousands of genes. This is what we filter on in RNA-Seq.
Remember we saw in the previous tutorial that we can use a column to colour features of our data with fill=
or col=
. In this case we are colouring points so we use col=
.
We will add a column that says whether genes are significant or not. To add columns we use dplyr’s mutate()
.
Let’s have a look at mutate
first to see how it works. We give mutate our data (de_results
), then a name for the column we want to create.
Let’s make a column called signif (for significance). Then we put what we want to have in the column in brackets after the column name. For example, if we wanted to label every gene as significant we could write below. We’ll try running it and store it in an object called testing
as we’re just seeing how it works.
testing <- mutate(de_results, signif=("Signif"))
We can use View() to have a look at dataset. There should be a new column called signif at the end with “Signif” in every row. mutate always adds the column to the end of the table.
View(testing)
testing
object that contains the value “Labels” in every row.But obviously not all genes are significant. We want to only call genes significant if they have a P value below 0.05. We can use an R function called ifelse()
to say if our adj.P.Val is below 0.05 add “Signif” to the sig column, otherwise add “Not signif”.
We can take a look at the ifelse help to see how it works. The help tells us the Usage is ifelse(test, yes, no)
. Our “test” is whether our gene has an adj.P.Val < 0.05, if the answer is yes, we’ll add “Signif” into the column, if the answer is no, then we’ll add “Not signif”. We’ll save the output as de_results
(this overwrites the original de_results
object).
de_results <- mutate(de_results, signif=ifelse(adj.P.Val < 0.05, "Signif", "Not signif"))
Let’s take a look at de_results
again now. We should see we have a new column at the end called “signif”.
View(de_results)
In View() we can sort on the signif column to check we see both Signif and Not signif entries.
Now that we have a column that flags whether the genes are significant or not we can use that to colour the significant genes by adding col=signif
to our ggplot.
ggplot(data=de_results, mapping=aes(x=logFC, y=-log10(P.Value), col=signif)) +
geom_point()
We might want to change the colours.
There are built-in colour palettes, that can be handy to use, where the sets of colours are predefined. scale_colour_brewer()
is a popular one (there is also scale_fill_brewer()
). You can take a look at the help for scale_colour_brewer()
to see what pallettes are available. There is also an R colours cheatsheet that shows the colours of the palettes here.
There’s one called “Dark2”, let’s have a look at that.
ggplot(data=de_results, mapping=aes(x=logFC, y=-log10(P.Value), col=signif)) +
geom_point() +
scale_colour_brewer(palette = "Dark2")
There’s one called “Set1”, let’s have a look at that.
ggplot(data=de_results, mapping=aes(x=logFC, y=-log10(P.Value), col=signif)) +
geom_point() +
scale_colour_brewer(palette = "Set1")
Or we could choose to set the colours manually. We could decide to colour the non-significant genes grey and the significant genes red. We are using col=
so to specify our own colours we add + scale_colour_manual(values=))
(as we will see, we can keep adding layers to our plot with +
). If we were using fill=
we would use + scale_fill_manual(values=))
To use two colours we add + scale_colour_manual(values=c("red", "grey"))
. Note that here we see the function c()
for the first time. We use function extremely often in R when we have multiple items that we are combining. Here we have two colours we want to use, so we need to use c()
to combine them to give to values=
. Thus we need to add values=c("red", "grey")
.
ggplot(data=de_results, mapping=aes(x=logFC, y=-log10(P.Value), col=signif)) +
geom_point() +
scale_colour_manual(values=c("red", "grey"))
Hmm this is the wrong way around, our significant points are grey. We could change the order of the colours in values=
.
ggplot(data=de_results, mapping=aes(x=logFC, y=-log10(P.Value), col=signif)) +
geom_point() +
scale_colour_manual(values=c("grey", "red"))
Or we could specify which value in our signif column we want to map to each colour.
ggplot(data=de_results, mapping=aes(x=logFC, y=-log10(P.Value), col=signif)) +
geom_point() +
scale_colour_manual(values=c("Signif"="red", "Not signif"="grey"))
colours()
to see what colours are available. Choose two nice colours or two ugly ones. The colours can be seen in the R colours cheatsheet here.We could colour our significant genes that are downregulated and the genes that are upregulated using separate colours. These are the genes upregulated in the luminal cells from the pregnant mice compared to the luminal cells from the lactating mice. To do this we can change our signif column and have three values instead of two. We could have “Up”“,”Down" and “Not signif”.
Let’s colour significant genes > logFC of 1, red and < logFC -1, blue. To do this we change our signif column by adding another ifelse()
. We are asking: if genes are significantly up, add the value “Up” otherwise if the genes are significantly down, add the value “Down” otherwise add the value “Not signif”.
To ask if are genes have an adj.P.Value < 0.05 and also have a logFC > 1, we use the syntax adj.P.Val < 0.05 & logFC > 1
, note the &
. If they are < 0.05 and have a logFC < -1 we use adj.P.Val < 0.05 & logFC > 1
. We write this criteria as below and save as de_results
again.
de_results <- mutate(de_results, signif=ifelse((adj.P.Val < 0.05 & logFC > 1), "Up", ifelse((adj.P.Val < 0.05 & logFC < -1), "Down", "Not signif")))
Let’s take a look at the output.
View(de_results)
Now that we have our signif column with three values, “Up”, “Down”, “Not signif”, we can colour the volcano plot points, up - red, not signif - grey, and down - blue.
ggplot(data=de_results, mapping=aes(x=logFC, y=-log10(P.Value), col=signif)) +
geom_point() +
scale_colour_manual(values=c("Up"="red", "Not signif"="grey", "Down"="blue"))
We can label one or more genes of interest in a volcano plot. This enables us to visualize where these genes are in terms of significance and in comparison to the other genes. In the original paper using this dataset, there is a heatmap of 31 genes in Figure 6b. These genes are a set of 30 cytokines/growth factor identified as differentially expressed, and the authors’ main gene of interest, Mcl1. These genes are provided in the volcano genes file and shown below. We will label these genes in the volcano plot. We’ll read in the genes and store it in an object called goi
.
goi <- read_tsv("data/volcano_genes")
Parsed with column specification:
cols(
GeneID = col_character()
)
Let’s take a look at what’s in goi
goi
goi
contains gene symbols stored in a column. We need to get these symbols out of the column format. We can do that with dplyr’s pull()
. We give pull()
the goi
object and the name of the column we want to pull the values from. We’ll store the symbols in an object called goi_syms
.
goi_syms <- pull(goi, GeneID)
Take a look.
goi_syms
[1] "Mcl1" "Hbegf" "Tgfb2" "Cxcl16" "Csf1" "Pdgfb" "Edn1" "Lif" "Kitl" "Bmp1" "Pdgfa" "Cmtm3"
[13] "Cx3cl1" "Ctgf" "Wnt5a" "Ptn" "Spp1" "Bmp3" "Cmtm8" "Gmfg" "Cxcl2" "Cxcl3" "Il15" "Egf"
[25] "Cmtm7" "Il34" "Pdgfd" "Nov" "Cmtm6" "Ccl28" "Cxcl1"
We’ll add another columns for these labels, let’s call it mygenes. We do this similar to what we did for the top 10 genes.
de_results <- mutate(de_results, mygenes=ifelse(SYMBOL %in% goi_syms, SYMBOL, ""))
Let’s make the volcano plot and label this custom set of genes. We add + geom_text()
and add the labels into that. Note we are using +
again and adding another layer to our plot. This time we are adding a layer of text (labels). We put labels=mygenes
inside aes()
in geom_text()
as we are mapping to a column in our data.
ggplot(data=de_results, mapping=aes(x=logFC, y=-log10(P.Value), col=signif)) +
geom_point() +
scale_colour_manual(values=c("Up"="red", "Not signif"="grey", "Down"="blue")) +
geom_text(mapping=aes(label=mygenes))
The legend now has a legend for the labels (the letter “a”) overlapping the original legend but we can remove that by adding show.legend=FALSE
.
ggplot(data=de_results, mapping=aes(x=logFC, y=-log10(P.Value), col=signif)) +
geom_point() +
scale_colour_manual(values=c("Up"="red", "Not signif"="grey", "Down"="blue")) +
geom_text(mapping=aes(label=mygenes), show.legend = FALSE)
The labels also don’t look great as the labels are overlapping but we can fix that. We can replace + geom_text()
with + geom_text_repel()
from the package ggrepel.
Note that running geom_text_repel()
on the server can be very slow (take minutes), think it may be the same issue reported in Stack Overflow here, but it should run a lot quicker on your laptop/desktop.
ggplot(data=de_results, mapping=aes(x=logFC, y=-log10(P.Value), col=signif)) +
geom_point() +
scale_colour_manual(values=c("Up"="red", "Not signif"="grey", "Down"="blue")) +
geom_text_repel(mapping=aes(label=mygenes))
If we want, we can colour just the points red and blue, and leave the labels black. We do this by adding the col=signif
into the geom_point()
aes()
instead of in the ggplot()
aes()
.
ggplot(data=de_results, mapping=aes(logFC, -log10(P.Value))) +
geom_point(aes(col=signif)) +
scale_colour_manual(values=c("Up"="red", "Not signif"="grey", "Down"="blue")) +
geom_text_repel(mapping=aes(label=mygenes), show.legend = FALSE)
We could also label the top significant genes with the gene names so we can see what they are. We have gene symbols in our de_results
so we could use those.
Let’s label the top 10 most significant genes. To do this we first identify the top 10 genes by P.Value. The file is already sorted by P value and remember head() shows us the top 6 lines in a file. If we look at the help for ?head
we see that there is a default argument n = 6L
, this is why it returns 6 lines. We can change this to 10L to get the top 10 lines. We will store this at an object called top10
.
top10 <- head(de_results, n = 10L)
Take a look.
top10
goi
) e.g. you will need to extract the gene symbols from the top10
info and store them in an object called top10_syms
, then add a column called e.g. top10 to the de_results that says which genes to label.What if we want the categories in the legend in a different order, for example, Up, Not signif, Down. We can addbreaks=
with the order we want into scale_color_manual()
. Note that here we use c()
again to pass multiple values to breaks.
ggplot(data=de_results, mapping=aes(x=logFC, y=-log10(P.Value), col=signif)) +
geom_point() +
scale_colour_manual(values=c("Up"="red", "Not signif"="grey", "Down"="blue"), breaks=c("Up", "Down", "Not signif"))
We can change the name of the legend by adding name=
into scale_colour_manual
.
ggplot(data=de_results, mapping=aes(x=logFC, y=-log10(P.Value), col=signif)) +
geom_point() +
scale_colour_manual(name="My new name", values=c("Up"="red", "Not signif"="grey", "Down"="blue"))
We can also make the plot look nicer, for example, adjusting the x axis limits and label, adding a title and changing the background.
First let’s store our current plot in an object called p
, to make it easier to see what we’re changing.
p <- ggplot(data=de_results, mapping=aes(x=logFC, y=-log10(P.Value))) +
geom_point(mapping=aes(col=signif)) +
scale_colour_manual(values=c("Up"="red", "Not signif"="grey", "Down"="blue"))
Note that when we store the plot in an object it doesn’t print out the plot. If we want to see the plot we type the name of the object p
.
p
We can adjust the x axis so it’s the same limit on the right and left with scale_x_continuous()
. If we had categories (discrete values) on the x axis we would use scale_x_discrete()
instead. There are also equivalents for the y axis (scale_y_continuous()
and scale_y_discrete()
).
p <- p + scale_x_continuous(limits=c(-10, 10))
p
We can change the axis labels with labs()
. To change the x axis label we use labs(x="New name")
. To change the y axis label we use labs(y="New name")
.
p <- p + labs(x="Log2 fold change")
p
We can add a title with labs()
.
p <- p + labs(title="Luminal pregnant vs lactating")
p
We can remove the grey background and grid lines. To do this we modify the ggplot theme. Themes are the non-data parts of the plot.
There are also a lot of built-in themes. Let’s have a look at a couple of the more widely used themes. We won’t save these (we won’t use p <-
) we’ll just print them to have a look. The default ggplot theme is theme_grey().
p + theme_bw()
p + theme_minimal()
There are many themes available, you can see some in the R graph gallery.
We can also modify parts of the theme individually. We can remove the grey background and grid lines with the code below.
p <- p + theme(panel.background = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
p
We can add axis lines.
p <- p + theme(axis.line = element_line(size=0.2, colour = "black"))
p
We can remove the legend completely.
p <- p + theme(legend.position = "none")
p
We can centre the title.
p <- p + theme(plot.title = element_text(hjust = 0.5))
p
geom_point()
by plotting the log fold change (logFC) vs the negative log10 P valuemutate()
ifelse()
to test if values meet specified conditionsc()
to combine multiple values%in%
to check if a value is in a set of valuespull()
to pull a set of values out of a columngeom_text()
to add text to a plot