Download materials from Q-Step ELE page (log-in required)
or from Github
https://github.com/YiLiu6240/exeter-qstep-data-visualisation-workshop
This is the Exeter Q-step workshop guide to data visualisation in R. We will be covering the basics of plotting in base R and using the package "ggplot2"
.
For the purpose of this workshop, we will use the titanic
dataset to demonstrate how data visualisation works in R.
It is recommended to use RStudio for this workshop.
Use the code below to initialise the working environment.
# If you need to install these packages
install.packages(c("tidyverse", "titanic"),
repos = "https://cran.rstudio.com")
library("tidyverse")
library("titanic")
The data set we are going to use is the training split of the titanic data.
# We create a tibble dataframe called `df` from `titanic_train`
df <- titanic_train %>% as_tibble()
df %>% glimpse()
## Observations: 891
## Variables: 12
## $ PassengerId <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
## $ Survived <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0,...
## $ Pclass <int> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3,...
## $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bra...
## $ Sex <chr> "male", "female", "female", "female", "male", "mal...
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, ...
## $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4,...
## $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1,...
## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "1138...
## $ Fare <dbl> 7.250, 71.283, 7.925, 53.100, 8.050, 8.458, 51.862...
## $ Cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", ...
## $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", ...
The first 10 rows of the dataset:
df %>% head(10) %>% knitr::kable()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.250 | S | |
2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38 | 1 | 0 | PC 17599 | 71.283 | C85 | C |
3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.925 | S | |
4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.100 | C123 | S |
5 | 0 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.050 | S | |
6 | 0 | 3 | Moran, Mr. James | male | NA | 0 | 0 | 330877 | 8.458 | Q | |
7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54 | 0 | 0 | 17463 | 51.862 | E46 | S |
8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2 | 3 | 1 | 349909 | 21.075 | S | |
9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27 | 0 | 2 | 347742 | 11.133 | S | |
10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14 | 1 | 0 | 237736 | 30.071 | C |
Meanings of categorical variables
Survived
: whether the passenger survived; 0: Did not survive, 1: Survivedpclass
: ticket class; 1st, 2nd, 3rdSibSp
: Number of siblings / spouses aboardParch
: Numebr of parents / children aboardEmbarked
: Port of Embarkation; C: Cherbourg, Q: Queenstown, S: SouthhamptonWe would also need to use a factor type for the categorical variables that are not numerical in nature:
df <- titanic_train %>% as_tibble() %>%
mutate_at(vars(PassengerId, Survived, Pclass), as.factor)
df %>% glimpse()
## Observations: 891
## Variables: 12
## $ PassengerId <fctr> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15...
## $ Survived <fctr> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0...
## $ Pclass <fctr> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3...
## $ Name <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bra...
## $ Sex <chr> "male", "female", "female", "female", "male", "mal...
## $ Age <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, ...
## $ SibSp <int> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4,...
## $ Parch <int> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1,...
## $ Ticket <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "1138...
## $ Fare <dbl> 7.250, 71.283, 7.925, 53.100, 8.050, 8.458, 51.862...
## $ Cabin <chr> "", "C85", "", "C123", "", "", "E46", "", "", "", ...
## $ Embarked <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", ...
Base R plotting is done primarily by the plot
function:
plot(x = df$Age, y = df$Fare,
main = "Your plot title", sub = "Your plot subtitle",
type = "p", col = "#9d0006")
plot(x = df$Age[df$Survived == 1], y = df$Fare[df$Survived == 1],
main = "Scatter plot of 'Age ~ Fare'",
sub = "How can one survive the Titanic accident",
xlab = "Age", ylab = "Fare",
type = "p", col = "#458588")
points(x = df$Age[df$Survived == 0], y = df$Fare[df$Survived == 0],
type = "p", col = "#9d0006")
legend(x = "topright",
legend = c("Survived", "Did not survive"),
lty = 2,
col = c("#458588", "#9d0006"))
hist(df$Age)
density_age <- density(na.omit(df$Age))
plot(density_age)
Alternatively, you can chain the procedure using a %>%
pipe:
df$Age %>% na.omit() %>%
density() %>% plot()
png("base-r-density-plot.png",
width = 7.2, height = 4.8, units = "in", res = 300)
df$Age %>% na.omit() %>%
density() %>% plot()
dev.off()
ggplot(data = df,
mapping = aes(x = Age, y = Fare)) +
geom_point(aes(color = Survived))
Alternatively you can write the above code as
df %>% ggplot(aes(Age, Fare)) +
geom_point(aes(color = Survived))
What if we want a scatter plot of “Sex ~ Fare”?
df %>% ggplot(aes(x = Sex, y = Fare)) +
geom_point(aes(color = Survived))
We need to change the main geom (geometric object) to geom_jitter
when variables on both axes are categorical.
df %>% ggplot(aes(x = Sex, y = Fare)) +
geom_jitter(aes(color = Survived))
Alternatively we can do a boxplot:
df %>% ggplot(aes(x = Sex, y = Fare)) +
geom_boxplot()
df %>% ggplot(aes(x = Pclass)) +
geom_bar(stat = "count")
df %>% ggplot(aes(x = Pclass)) +
geom_bar(aes(fill = Survived),
stat = "count", position = "stack")
When you need to supply your own y
in a barchart:
df %>% group_by(Survived, Pclass) %>%
summarise(AverageFare = mean(Fare, na.rm = TRUE)) %>%
ggplot(aes(x = Pclass, y = AverageFare)) +
# `geom_bar(stat = "identity", ...)` is equivalent to `geom_col(..)`
geom_col(aes(fill = Survived), position = "dodge")
df %>% ggplot(aes(x = Fare)) +
geom_histogram(aes(fill = Survived), position = "stack")
df %>% ggplot(aes(x = Age)) +
geom_density(aes(fill = Survived), alpha = 0.6)
For line charts that represent connections, we ususally need to specify a “group” aesthetics.
df %>%
filter(Embarked != "") %>%
mutate(Embarked = Embarked %>% factor(levels = c("S", "C", "Q"))) %>%
group_by(Embarked, Survived) %>%
summarise(AverageFare = mean(Fare, na.rm = TRUE)) %>%
ggplot(aes(x = Embarked, y = AverageFare,
colour = Survived, group = Survived)) +
geom_point() + geom_line()
Practice with the ggplot2 geoms and aesthetics with the titanic data, using the examples above.
facet_wrap
and facet_grid
df %>% ggplot(aes(x = Age, y = Fare)) +
geom_point(aes(color = Survived)) +
facet_wrap(~ Sex)
facet_wrap
allows for flexible column layout:
df %>% ggplot(aes(x = Age, y = Fare)) +
geom_point(aes(color = Survived)) +
facet_wrap(Pclass ~ Sex, ncol = 2)
facet_grid
is more ideal for facetting with 2 factors:
df %>% ggplot(aes(x = Age, y = Fare)) +
geom_point(aes(color = Survived)) +
facet_grid(Pclass ~ Sex, scales = "free_y")
Configurations regarding the figure as a whole are provided by the theme()
function. Read all the available options here.
Change the position of the figure legend:
df %>% ggplot(aes(x = Age, y = Fare)) +
geom_point(aes(color = Survived)) +
facet_grid(Pclass ~ Sex, scales = "free_y") +
theme(legend.position = "bottom")
Add title and other elements:
df %>% ggplot(aes(x = Age, y = Fare)) +
geom_point(aes(color = Survived)) +
facet_grid(Pclass ~ Sex, scales = "free_y") +
labs(title = "Place your title here",
subtitle = "Place your subtitle here",
x = "Age of passengers",
y = "Trip fare") +
theme_classic()
Saving a ggplot object is done by the ggsave
function.
# Option 1: Assign the ggplot object to a variable
fig <- df %>% ggplot(aes(x = Age, y = Fare)) +
geom_point(aes(color = Survived)) +
facet_grid(Pclass ~ Sex, scales = "free_y") +
labs(title = "Place your title here",
subtitle = "Place your subtitle here",
x = "Age of passengers",
y = "Trip fare") +
theme_classic()
ggsave(filename = "ggplot-figure.png", plot = fig,
width = 7.2, height = 4.8, units = "in", dpi = 300)
# Option 2: Evaluate your plot within `()` then chain it
(
df %>% ggplot(aes(x = Age, y = Fare)) +
geom_point(aes(color = Survived)) +
facet_grid(Pclass ~ Sex, scales = "free_y") +
labs(title = "Place your title here",
subtitle = "Place your subtitle here",
x = "Age of passengers",
y = "Trip fare") +
theme_classic()
) %>%
ggsave(filename = "ggplot-figure.png",
width = 7.2, height = 4.8, units = "in", dpi = 300)
Plotting in ggplot2 is done by:
ggplot(data)
to initialise the plotting processaes(x = ..., y = ..., ...)
functiongeom_
functionsLet us practice what we learn today and see if you could reproduce one of the following figures.
Exploratory data analysis assisted by visualisation is only the first step in your analysis.
Reference manuals and websites:
Extensions to ggplot2
:
Interactive plots:
Other types of plots: