Introduction to ggplot2

Valerio Licursi

IBPM-CNR

2024-06-26

Introduction to ggplot2


What is ggplot2?


  • Part of the tidyverse
  • Designed for creating complex plots from data in a data frame
  • Based on the Grammar of Graphics: a coherent system for describing and building graphs

The Grammar of Graphics

  • Visualisation concept created by Leland Wilkinson (1999)
    • to define the basic elements of a statistical graphic
  • Adapted for R by Hadley Wickham (2009)
    • consistent and compact syntax to describe statistical graphics
    • highly modular as it breaks up graphs into semantic components

Introduction to ggplot2


Why use ggplot2?


  • High customization
  • Elegant and versatile plots
  • Consistent syntax

Basic ggplot2 Syntax

ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) +
  <GEOM_FUNCTION>()
  • data: The data frame containing the variables
  • aes: Aesthetic mappings, describing how variables are mapped to visual properties
  • geom: Geometric objects, representing what is actually displayed (e.g., points, lines)

Anatomy of a ggplot2 call

ggplot(
  data = [dataframe], 
  mapping = aes(
    x = [var_x], y = [var_y], 
    color = [var_for_color], 
    shape = [var_for_shape],
    ...
  )
) +
  geom_[some_geom](
    mapping = aes(
      color = [var_for_geom_color],
      ...
    )
  ) +
  ... # other geometries
  scale_[some_axis]_[some_scale]() +
  facet_[some_facet]([formula]) +
  ... # other options

Creating a Basic Plot

ggplot(data = mtcars, aes(x = wt, y = mpg)) +
  geom_point()
  • ggplot() initializes the plot
  • aes() maps the variables
  • geom_point() adds a layer of points to the plot

Adding Aesthetics

ggplot(data = mtcars, aes(x = wt, y = mpg, color = as.factor(cyl))) +
  geom_point()
  • color: Differentiates points by cylinder count
  • Aesthetics can also include size, shape, and more

Titles and Labels

ggplot(data = mtcars, aes(x = wt, y = mpg, color = as.factor(cyl))) +
  geom_point() +
  ggtitle("MPG vs Weight of Cars") +
  xlab("Weight (1000 lbs)") +
  ylab("Miles per Gallon") +
  labs(color = "Cylinders")
  • ggtitle(): Adds a plot title
  • xlab() and ylab(): Label the axes
  • labs(): Additional labels, such as legends

Boxplots and Histograms

Boxplot Example:

ggplot(data = mtcars, aes(x = as.factor(cyl), y = mpg)) +
  geom_boxplot()

Histogram Example:

ggplot(data = mtcars, aes(x = mpg)) +
  geom_histogram(binwidth = 2, fill = "blue", color = "black")
  • geom_boxplot(): Creates boxplots
  • geom_histogram(): Creates histograms with specified bin width

Faceting

ggplot(data = mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  facet_wrap(~ cyl)
  • facet_wrap(~ variable): Creates a separate plot for each level of the variable

Customizing Themes

ggplot(data = mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  theme_minimal()
  • theme_minimal(): Applies a minimal theme
  • Other themes include theme_classic(), theme_dark(), etc.

Saving Plots

ggplot(data = mtcars, aes(x = wt, y = mpg)) +
  geom_point()
ggsave("scatter_plot.png")
  • ggsave(): Saves the last plot to a file
  • Can specify file format (PDF, PNG, …), dimensions, and more

Diamonds

set.seed(2024)
diamonds <- sample_n(ggplot2::diamonds, 1000)
diamonds
# A tibble: 1,000 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  1.5  Very Good F     SI2      58.5    55  9236  7.51  7.56  4.41
 2  0.49 Fair      E     VVS2     65.5    58  1705  4.91  4.86  3.2 
 3  0.32 Very Good D     SI1      63      57   526  4.35  4.38  2.75
 4  1.44 Premium   I     VS1      62.6    59  8426  7.08  7.14  4.45
 5  1.02 Very Good G     SI2      62.9    59  4291  6.38  6.4   4.02
 6  0.32 Ideal     E     VVS2     62      55   842  4.38  4.4   2.72
 7  0.27 Ideal     E     VVS2     62.2    55   622  4.12  4.17  2.58
 8  0.91 Premium   E     SI1      62.6    58  4211  6.14  6.17  3.85
 9  1.02 Ideal     F     VS1      61.5    56  7916  6.47  6.5   3.99
10  0.7  Very Good G     VS1      59.2    58  2676  5.8   5.83  3.44
# ℹ 990 more rows
head(diamonds$cut)
[1] Very Good Fair      Very Good Premium   Very Good Ideal    
Levels: Fair < Good < Very Good < Premium < Ideal
head(diamonds$color)
[1] F E D I G E
Levels: D < E < F < G < H < I < J

Example 1

  • Which data are used as an input?
  • Are the variables transformed before plotting?
  • What geometric objects are used to represent the data?
  • What variables are mapped onto which aesthetic attributes?
ggplot(data = diamonds, aes(x = carat, y = price)) +
  geom_point()

Altering aesthetics

  • How did the plot change?
  • Are these changes based on data or are the changes based on stylistic choices for the geometric objects?
ggplot(data = diamonds, aes(x = carat, y = price)) +
  geom_point(alpha = 0.25, color = "blue")

Example 2

  • Which data are used as an input?
  • Are the variables transformed before plotting?
  • What geometric objects are used to represent the data?
  • What variables are mapped onto which aesthetic attributes?
ggplot(data = diamonds, aes(x = carat, y = sqrt(price), color = color)) +
  geom_point()

Example 3

  • Which data are used as an input?
  • Are the variables transformed before plotting?
  • What geometric objects are used to represent the data?
  • What variables are mapped onto which aesthetic attributes?
ggplot(data = diamonds, aes(x = carat, y = sqrt(price), color = table)) +
  geom_point()

Example 4

  • Which data are used as an input?
  • Are the variables transformed before plotting?
  • What geometric objects are used to represent the data?
  • What variables are mapped onto which aesthetic attributes?
ggplot(data = diamonds, aes(x = cut, y = price, fill = color)) +
  geom_boxplot() +
  scale_y_log10()
ggplot(data = diamonds, aes(x = cut, y = log(price,10), fill = color)) +
  geom_boxplot() +
  scale_y_continuous()

Example 5

  • Which data are used as an input?
  • Are the variables transformed before plotting?
  • What geometric objects are used to represent the data?
  • What variables are mapped onto which aesthetic attributes?
  • What type of scales are used to map data to aesthetics?
ggplot(data = diamonds, aes(x = cut, fill=color)) +
  geom_bar(position = "dodge", color = "black") +
  coord_flip() +
  scale_fill_brewer(palette = "Blues")

Example 6

  • Which data are used as an input?
  • Are the variables transformed before plotting?
  • What geometric objects are used to represent the data?
  • What variables are mapped onto which aesthetic attributes?
  • What type of scales are used to map data to aesthetics?
ggplot(data = ggplot2::diamonds, aes(x = price/carat, fill=color)) +
  geom_density(alpha=0.5) +
  facet_grid(rows = vars(color), cols = vars(cut)) +
  scale_x_sqrt() + 
  labs(x = "Price per carat")

ggplot objects

g <- ggplot(diamonds, aes(x = carat, y = price)) + geom_point()
class(g)
[1] "gg"     "ggplot"
g

g + geom_smooth(se=FALSE)

Multi panel Plots

library(patchwork)

p1 <- ggplot(diamonds) + geom_point(aes(x = carat, y = price))

p2 <- ggplot(diamonds) + geom_boxplot(aes(x = cut, y = price))

p3 <- ggplot(diamonds) + geom_boxplot(aes(x = color, y = price))

p4 <- ggplot(diamonds) + geom_boxplot(aes(x = clarity, y = price))
p1 + p2 + p3 + p4
p1 + p2 + p3 + p4 + plot_layout(nrow=1)
p1 / (p2 + p3 + p4)
p1 + {
  p2 + {
    p3 + p4 + plot_layout(ncol = 1)
  }
} + plot_layout(ncol = 1)
p1 + p2 + p3 + p4 + plot_annotation(title = "Diamonds data", tag_levels = c("A","1"))

Why do we visualize?

Asncombe’s Quartet

datasets::anscombe %>% as_tibble()
# A tibble: 11 × 8
      x1    x2    x3    x4    y1    y2    y3    y4
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1    10    10    10     8  8.04  9.14  7.46  6.58
 2     8     8     8     8  6.95  8.14  6.77  5.76
 3    13    13    13     8  7.58  8.74 12.7   7.71
 4     9     9     9     8  8.81  8.77  7.11  8.84
 5    11    11    11     8  8.33  9.26  7.81  8.47
 6    14    14    14     8  9.96  8.1   8.84  7.04
 7     6     6     6     8  7.24  6.13  6.08  5.25
 8     4     4     4    19  4.26  3.1   5.39 12.5 
 9    12    12    12     8 10.8   9.13  8.15  5.56
10     7     7     7     8  4.82  7.26  6.42  7.91
11     5     5     5     8  5.68  4.74  5.73  6.89

Asncombe’s Quartet

Anscombe’s quartet comprises four datasets that have nearly identical simple descriptive statistics, yet have very different distributions and appear very different when graphed. Each dataset consists of eleven (x, y) points. They were constructed in 1973 by the statistician Francis Anscombe to demonstrate both the importance of graphing data when analyzing it, and the effect of outliers and other influential observations on statistical properties.

from: https://en.wikipedia.org/wiki/Anscombe%27s_quartet

Tidy Anscombe

tidy_anscombe <- datasets::anscombe %>%
  pivot_longer(everything(), names_sep = 1, names_to = c("var", "group")) %>%
  pivot_wider(id_cols = group, names_from = var, 
              values_from = value, values_fn = list(value = list)) %>% 
  unnest(cols = c(x,y))

Tidy Anscombe

tidy_anscombe %>%
  group_by(group) %>%
  summarize(mean_x = mean(x), mean_y = mean(y), sd_x = sd(x), sd_y = sd(y), cor = cor(x,y))
# A tibble: 4 × 6
  group mean_x mean_y  sd_x  sd_y   cor
  <chr>  <dbl>  <dbl> <dbl> <dbl> <dbl>
1 1          9   7.50  3.32  2.03 0.816
2 2          9   7.50  3.32  2.03 0.816
3 3          9   7.5   3.32  2.03 0.816
4 4          9   7.50  3.32  2.03 0.817

Tidy Anscombe

ggplot(tidy_anscombe, aes(x = x, y = y, color = as.factor(group))) +
  geom_point(size=2) +
  facet_wrap(vars(group)) +
  geom_smooth(method="lm", se=FALSE, fullrange=TRUE) +
  guides(color=FALSE)

DatasauRus

datasauRus::datasaurus_dozen
# A tibble: 1,846 × 3
   dataset     x     y
   <chr>   <dbl> <dbl>
 1 dino     55.4  97.2
 2 dino     51.5  96.0
 3 dino     46.2  94.5
 4 dino     42.8  91.4
 5 dino     40.8  88.3
 6 dino     38.7  84.9
 7 dino     35.6  79.9
 8 dino     33.1  77.6
 9 dino     29.0  74.5
10 dino     26.2  71.4
# ℹ 1,836 more rows
datasauRus::datasaurus_dozen %>%
  group_by(dataset) %>%
  summarize(mean_x = mean(x), mean_y = mean(y), 
            sd_x = sd(x), sd_y = sd(y), 
            cor = cor(x,y))
# A tibble: 13 × 6
   dataset    mean_x mean_y  sd_x  sd_y     cor
   <chr>       <dbl>  <dbl> <dbl> <dbl>   <dbl>
 1 away         54.3   47.8  16.8  26.9 -0.0641
 2 bullseye     54.3   47.8  16.8  26.9 -0.0686
 3 circle       54.3   47.8  16.8  26.9 -0.0683
 4 dino         54.3   47.8  16.8  26.9 -0.0645
 5 dots         54.3   47.8  16.8  26.9 -0.0603
 6 h_lines      54.3   47.8  16.8  26.9 -0.0617
 7 high_lines   54.3   47.8  16.8  26.9 -0.0685
 8 slant_down   54.3   47.8  16.8  26.9 -0.0690
 9 slant_up     54.3   47.8  16.8  26.9 -0.0686
10 star         54.3   47.8  16.8  26.9 -0.0630
11 v_lines      54.3   47.8  16.8  26.9 -0.0694
12 wide_lines   54.3   47.8  16.8  26.9 -0.0666
13 x_shape      54.3   47.8  16.8  26.9 -0.0656
ggplot(datasauRus::datasaurus_dozen, aes(x = x, y = y, color = dataset)) +
  geom_point() +
  facet_wrap(vars(dataset)) +
  guides(color=FALSE)

DatasauRus

The Datasaurus dozen comprises thirteen data sets that have nearly identical simple descriptive statistics to two decimal places, yet have very different distributions and appear very different when graphed. It was inspired by the smaller Anscombe’s quartet that was created in 1973.

from: https://en.wikipedia.org/wiki/Datasaurus_dozen

Resources