00. What we’ll be doing:

  1. Review;
  2. Summary Statistics;
  3. ggplot2.

0. Review

  • R is a language RStudio is where we write it.
  • Directories are how your computer organizes files.
  • Change the working directory using setwd()
  • Data is R has a few basic form: 1) Vectors, 2) Matrices, 3) Data Frames, 4) Lists.
  • Data in R has a few basic classes: 1) Numeric; 2) Integer; 3) Character; 4) Logical; 5) Factor.
  • The pipe |> is used between operations to take the output of what comes before and use it as the input of what comes next.
  • Indexing (extracting a specific element of an object) can be done using:
    • [i] with vectors (one dimension);
    • [i,j] with matrices or data frames (two dimensions, row number i and column number j);
    • mydf$aVariableName in data frames, extracts the column of the corresponding name.
  • You can change the structure of you data frame using different functions from dplyr:
    • select(): which columns do you want;
    • rename(): change name of a column;
    • mutate(): add a new variable;
    • filter(): which rows do you want (based on a condition).

0.1 Last week’s exercises

  1. What is the fertility rate and ppgdp of the 4 country with highest female life expectancy (lifeExpF)? Make sure you are using no country where region is missing.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data(UN, package="carData")
UN <- as_tibble(UN,rownames="country")

UN |> 
  drop_na(region) |> 
  arrange(-lifeExpF) |> 
  select(country,fertility,ppgdp) |> 
  slice_head(n=4)
## # A tibble: 4 × 3
##   country   fertility  ppgdp
##   <chr>         <dbl>  <dbl>
## 1 Japan          1.42 43141.
## 2 Hong Kong      1.14 31824.
## 3 France         1.99 39546.
## 4 Spain          1.50 30543.
  1. What is the fertility rate and ppgdp of the 2 most urbanised country (pctUrban) for each region? Make sure you are using no country where region is missing.
UN |> 
  drop_na(region) |> 
  group_by(region) |> 
  arrange(-pctUrban) |> 
  select(country,fertility,ppgdp) |> 
  slice_head(n=2)
## Adding missing grouping variables: `region`
## # A tibble: 15 × 4
## # Groups:   region [8]
##    region        country       fertility  ppgdp
##    <fct>         <chr>             <dbl>  <dbl>
##  1 Africa        Gabon              3.20 12469.
##  2 Africa        Libya              2.41 11321.
##  3 Asia          Hong Kong          1.14 31824.
##  4 Asia          Macao              1.16 49990.
##  5 Caribbean     Anguilla           2    13750.
##  6 Caribbean     Bermuda            1.76 92625.
##  7 Europe        Belgium            1.84 43815.
##  8 Europe        Malta              1.28 19599.
##  9 Latin Amer    Venezuela          2.39 13503.
## 10 Latin Amer    Argentina          2.17  9162.
## 11 North America United States      2.08 46546.
## 12 North America Canada             1.69 46361.
## 13 NorthAtlantic Greenland          2.22 35293.
## 14 Oceania       Nauru              3.3   6190.
## 15 Oceania       Australia          1.95 57119.
  1. Create a variable called fertility1000, the result of this operation fertility*1000-infantMortality. What is the highest country on fertility1000 in each region? Make sure to use only countries where both fertility and infantMortality are known.
UN |> 
  drop_na(fertility,infantMortality) |> 
  mutate(fertility1000=fertility*1000-infantMortality) |> 
  group_by(region) |> 
  arrange(-fertility1000) |> 
  slice_head(n=1)
## # A tibble: 7 × 9
## # Groups:   region [7]
##   country        region group fertility  ppgdp lifeExpF pctUrban infantMortality
##   <chr>          <fct>  <fct>     <dbl>  <dbl>    <dbl>    <dbl>           <dbl>
## 1 Niger          Africa afri…      6.92   358.     55.8       17           85.8 
## 2 Timor Leste    Asia   other      5.92   706.     64.2       29           56.5 
## 3 Haiti          Carib… other      3.16   613.     63.9       54           58.3 
## 4 Iceland        Europe other      2.10 39278      83.8       94            2.06
## 5 Guatemala      Latin… other      3.84  2882.     75.1       50           26.3 
## 6 United States  North… oecd       2.08 46546.     81.3       83            6.46
## 7 Marshall Isla… Ocean… other      4.38  3069.     70.6       72           21   
## # ℹ 1 more variable: fertility1000 <dbl>

1. Summary Statistics

There’s a bunch of them:

Statistic R function Formula
Mean mean() \(\bar Y=\frac{\sum_{i=1}^n Y_i}{n}\)
Variance var() \(sd^2=\frac{\sum_{i=1}^n(Y_i-\bar Y)^2}{n-1}\)
Standard Deviation sd() \(sd=\sqrt\frac{\sum_{i=1}^n(Y_i-\bar Y)^2}{n-1}\)
Median median() Middle value (\(Q_2\))
Interquartile Range IQR() \(iqr=Q_3-Q_1\)
Minimum min()
Maximum max()

Usefull functions:

  • summary(myVector) will give you most of what’s above.
  • DAMisc::sumStats(myVector) will give you all of what’s above in a formatted table.

1.1 Examples

  1. What is the average price of diamonds in the dataset diamonds that ships with the tidyverse?
#### Initialization ####
#install.packages("devtools")
#devtools::install_github("https://github.com/davidaarmstrong/damisc.git")
library(DAMisc)
## Registered S3 method overwritten by 'broom':
##   method        from      
##   nobs.multinom clarkeTest
mean(diamonds$price)
## [1] 3932.8
  1. What is the standard deviation of the carat of the diamonds?
sd(diamonds$carat)
## [1] 0.4740112
  1. What about the variance?
var(diamonds$carat)
## [1] 0.2246867
# You could also do this instead
sd(diamonds$carat)^2
## [1] 0.2246867
  1. What if you want all of the statistics for clarity?
class(diamonds$clarity)
## [1] "ordered" "factor"
table(diamonds$clarity)
## 
##    I1   SI2   SI1   VS2   VS1  VVS2  VVS1    IF 
##   741  9194 13065 12258  8171  5066  3655  1790
# Being a factor, summary will not be much help
summary(diamonds$clarity)
##    I1   SI2   SI1   VS2   VS1  VVS2  VVS1    IF 
##   741  9194 13065 12258  8171  5066  3655  1790
# Except if you do this
summary(as.numeric(diamonds$clarity))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   4.000   4.051   5.000   8.000
# But sumStats does better
sumStats(diamonds, "clarity")
## # A tibble: 1 × 11
##   variable  mean    sd   iqr   min   q25   q50   q75   max     n   nNA
##   <chr>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
## 1 clarity   4.05  1.65     2     1     3     4     5     8 53940     0
  1. What is the mean and standard deviation of the price of diamonds with a clarity of VVS2 and carat between 2.0 and 2.1?

2. ggplot2

  • ggplot2 is a package from the tidyverse suit, written as part of Hadley Wickham PhD dissertation.
  • gg stands for Grammar of Graphics.
  • It works in layers.
  • Best ressource here.

General Setup:

library(tidyverse)

ggplot(diamonds,                                # 1. Data;
       aes(x=cut))+                             # 2. Axis variables;
  geom_bar()+                                   # 3. What you want to draw;
  scale_x_discrete("\nDiamond Quality")+        # 4. Specify x axis text;
  scale_y_continuous("Frequency\n",             # 5. Specify y axis text;
                     breaks=seq(0,20000,2500))+ #
  theme_minimal()+                             # 6. General Style;
  theme(panel.grid.minor = element_blank())     # 7. Specific Style.

With these notions, you can be quite creative:

Sky is the limit bud!
Sky is the limit bud!

2.1 Examples

  1. Using the diamonds dataset, create a bar plot of the number of observations for each diamond quality (cut). Subset the data to only diamonds above the mean price from the dataset.
diamonds %>% 
  filter(price > mean(diamonds$price)) %>% 
  ggplot()+
  geom_bar(aes(x=cut))+
  xlab("\nQuality")+
  ylab("Frequency\n")+
  theme_light()

  1. Reorder the above from highest to lowest frequency.
# Option 1
diamonds %>% 
  filter(price > mean(diamonds$price)) %>% 
  mutate(cut_rev=factor(cut,
                        levels=c("Ideal","Premium","Very Good","Good","Fair"))) %>% 
  ggplot()+
  geom_bar(aes(x=cut_rev))+
  xlab("\nQuality")+
  ylab("Frequency\n")+
  theme_light()

# Option 2
diamonds %>% 
  filter(price > mean(diamonds$price)) %>% 
  mutate(ones = 1) %>% 
  ggplot(aes(x=reorder(cut,ones,sum,decreasing = T)))+
  geom_bar()+
  xlab("\nQuality")+
  ylab("Frequency\n")+
  theme_light()

# Option 3
diamonds %>% 
  filter(price > mean(diamonds$price)) %>% 
  group_by(cut) %>% 
  summarise(freq=n()) %>% 
  ggplot(aes(x=reorder(cut,freq,decreasing = T),y=freq)) +
  geom_col()+
  xlab("\nQuality")+
  ylab("Frequency\n")+
  theme_light()

  1. Plot a histogram of price per cut of diamonds.
ggplot(diamonds,aes(x=price))+
  geom_histogram()+
  facet_wrap(~cut)+
  xlab("\nPrice in $")+
  ylab("Frequency\n")+
  theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  1. Arrange the facets above in a single row in order of their respective medians. Make sure the x axis is readable.
ggplot(diamonds,aes(x=price))+
  geom_histogram()+
  facet_wrap(~reorder(cut,price,median),nrow=1)+
  xlab("\nPrice in $")+
  ylab("Frequency\n")+
  theme_minimal()+
  theme(axis.text.x = element_text(angle=45,hjust=1))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  1. Plot the average price of diamonds per carat as they relate to clarity
diamonds %>% 
  group_by(clarity,carat) %>% 
  summarise(averagePrice=mean(price)) %>% 
  ggplot(aes(x=carat,y=averagePrice,color=clarity))+
  geom_point()+
  scale_color_brewer("Diamond Clarity",palette = "RdBu")+
  xlab("\nCarat")+
  ylab("Price in $\n")+
  theme_minimal()+
  theme(legend.position = "top")+
  guides(colour = guide_legend(nrow = 1))
## `summarise()` has grouped output by 'clarity'. You can override using the
## `.groups` argument.

3. Try it yourself

From the carData::Salaries data set of 2008-2009 nine-month academic salary for Assistant Professors, Associate Professors and Professors in a college in the U.S., create a graph that presents:

  1. The mean salary per years since PhD with regards to sex per discipline.
  2. Make sure that the axis, facets and legend are properly named.
  3. Rescale the y axis to have salaries in increments of 25000 and the x axis to have years in increments of 5.
  4. Make sure the legend is on top of the graph.