POLISCI 9590 - Lab2: Summary Statistics and ggplot2

00. What we’ll be doing:

Review;
Summary Statistics;
ggplot2.

0. Review

R is a language RStudio is where we write it.
Directories are how your computer organizes files.
Change the working directory using setwd()
Data is R has a few basic form: 1) Vectors, 2) Matrices, 3) Data Frames, 4) Lists.
Data in R has a few basic classes: 1) Numeric; 2) Integer; 3) Character; 4) Logical; 5) Factor.
The pipe |> is used between operations to take the output of what comes before and use it as the input of what comes next.
Indexing (extracting a specific element of an object) can be done using:
- [i] with vectors (one dimension);
- [i,j] with matrices or data frames (two dimensions, row number i and column number j);
- mydf$aVariableName in data frames, extracts the column of the corresponding name.
You can change the structure of you data frame using different functions from dplyr:
- select(): which columns do you want;
- rename(): change name of a column;
- mutate(): add a new variable;
- filter(): which rows do you want (based on a condition).

0.1 Last week’s exercises

What is the fertility rate and ppgdp of the 4 country with highest female life expectancy (lifeExpF)? Make sure you are using no country where region is missing.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

data(UN, package="carData")
UN <- as_tibble(UN,rownames="country")

UN |> 
  drop_na(region) |> 
  arrange(-lifeExpF) |> 
  select(country,fertility,ppgdp) |> 
  slice_head(n=4)

## # A tibble: 4 × 3
##   country   fertility  ppgdp
##   <chr>         <dbl>  <dbl>
## 1 Japan          1.42 43141.
## 2 Hong Kong      1.14 31824.
## 3 France         1.99 39546.
## 4 Spain          1.50 30543.

What is the fertility rate and ppgdp of the 2 most urbanised country (pctUrban) for each region? Make sure you are using no country where region is missing.

UN |> 
  drop_na(region) |> 
  group_by(region) |> 
  arrange(-pctUrban) |> 
  select(country,fertility,ppgdp) |> 
  slice_head(n=2)

## Adding missing grouping variables: `region`

## # A tibble: 15 × 4
## # Groups:   region [8]
##    region        country       fertility  ppgdp
##    <fct>         <chr>             <dbl>  <dbl>
##  1 Africa        Gabon              3.20 12469.
##  2 Africa        Libya              2.41 11321.
##  3 Asia          Hong Kong          1.14 31824.
##  4 Asia          Macao              1.16 49990.
##  5 Caribbean     Anguilla           2    13750.
##  6 Caribbean     Bermuda            1.76 92625.
##  7 Europe        Belgium            1.84 43815.
##  8 Europe        Malta              1.28 19599.
##  9 Latin Amer    Venezuela          2.39 13503.
## 10 Latin Amer    Argentina          2.17  9162.
## 11 North America United States      2.08 46546.
## 12 North America Canada             1.69 46361.
## 13 NorthAtlantic Greenland          2.22 35293.
## 14 Oceania       Nauru              3.3   6190.
## 15 Oceania       Australia          1.95 57119.

Create a variable called fertility1000, the result of this operation fertility*1000-infantMortality. What is the highest country on fertility1000 in each region? Make sure to use only countries where both fertility and infantMortality are known.

UN |> 
  drop_na(fertility,infantMortality) |> 
  mutate(fertility1000=fertility*1000-infantMortality) |> 
  group_by(region) |> 
  arrange(-fertility1000) |> 
  slice_head(n=1)

## # A tibble: 7 × 9
## # Groups:   region [7]
##   country        region group fertility  ppgdp lifeExpF pctUrban infantMortality
##   <chr>          <fct>  <fct>     <dbl>  <dbl>    <dbl>    <dbl>           <dbl>
## 1 Niger          Africa afri…      6.92   358.     55.8       17           85.8 
## 2 Timor Leste    Asia   other      5.92   706.     64.2       29           56.5 
## 3 Haiti          Carib… other      3.16   613.     63.9       54           58.3 
## 4 Iceland        Europe other      2.10 39278      83.8       94            2.06
## 5 Guatemala      Latin… other      3.84  2882.     75.1       50           26.3 
## 6 United States  North… oecd       2.08 46546.     81.3       83            6.46
## 7 Marshall Isla… Ocean… other      4.38  3069.     70.6       72           21   
## # ℹ 1 more variable: fertility1000 <dbl>

1. Summary Statistics

There’s a bunch of them:

Statistic	R function	Formula
Mean	`mean()`	$\bar Y=\frac{\sum_{i=1}^n Y_i}{n}$
Variance	`var()`	$sd^2=\frac{\sum_{i=1}^n(Y_i-\bar Y)^2}{n-1}$
Standard Deviation	`sd()`	$sd=\sqrt\frac{\sum_{i=1}^n(Y_i-\bar Y)^2}{n-1}$
Median	`median()`	Middle value ($Q_2$)
Interquartile Range	`IQR()`	$iqr=Q_3-Q_1$
Minimum	`min()`
Maximum	`max()`

Usefull functions:

summary(myVector) will give you most of what’s above.
DAMisc::sumStats(myVector) will give you all of what’s above in a formatted table.

1.1 Examples

What is the average price of diamonds in the dataset diamonds that ships with the tidyverse?

#### Initialization ####
#install.packages("devtools")
#devtools::install_github("https://github.com/davidaarmstrong/damisc.git")
library(DAMisc)

## Registered S3 method overwritten by 'broom':
##   method        from      
##   nobs.multinom clarkeTest

mean(diamonds$price)

## [1] 3932.8

What is the standard deviation of the carat of the diamonds?

sd(diamonds$carat)

## [1] 0.4740112

What about the variance?

var(diamonds$carat)

## [1] 0.2246867

# You could also do this instead
sd(diamonds$carat)^2

## [1] 0.2246867

What if you want all of the statistics for clarity?

class(diamonds$clarity)

## [1] "ordered" "factor"

table(diamonds$clarity)

## 
##    I1   SI2   SI1   VS2   VS1  VVS2  VVS1    IF 
##   741  9194 13065 12258  8171  5066  3655  1790

# Being a factor, summary will not be much help
summary(diamonds$clarity)

##    I1   SI2   SI1   VS2   VS1  VVS2  VVS1    IF 
##   741  9194 13065 12258  8171  5066  3655  1790

# Except if you do this
summary(as.numeric(diamonds$clarity))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   4.000   4.051   5.000   8.000

# But sumStats does better
sumStats(diamonds, "clarity")

## # A tibble: 1 × 11
##   variable  mean    sd   iqr   min   q25   q50   q75   max     n   nNA
##   <chr>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
## 1 clarity   4.05  1.65     2     1     3     4     5     8 53940     0

What is the mean and standard deviation of the price of diamonds with a clarity of VVS2 and carat between 2.0 and 2.1?

2. ggplot2

ggplot2 is a package from the tidyverse suit, written as part of Hadley Wickham PhD dissertation.
gg stands for Grammar of Graphics.
It works in layers.
Best ressource here.

General Setup:

library(tidyverse)

ggplot(diamonds,                                # 1. Data;
       aes(x=cut))+                             # 2. Axis variables;
  geom_bar()+                                   # 3. What you want to draw;
  scale_x_discrete("\nDiamond Quality")+        # 4. Specify x axis text;
  scale_y_continuous("Frequency\n",             # 5. Specify y axis text;
                     breaks=seq(0,20000,2500))+ #
  theme_minimal()+                             # 6. General Style;
  theme(panel.grid.minor = element_blank())     # 7. Specific Style.

With these notions, you can be quite creative:

Sky is the limit bud!

2.1 Examples

Using the diamonds dataset, create a bar plot of the number of observations for each diamond quality (cut). Subset the data to only diamonds above the mean price from the dataset.

diamonds %>% 
  filter(price > mean(diamonds$price)) %>% 
  ggplot()+
  geom_bar(aes(x=cut))+
  xlab("\nQuality")+
  ylab("Frequency\n")+
  theme_light()

Reorder the above from highest to lowest frequency.

# Option 1
diamonds %>% 
  filter(price > mean(diamonds$price)) %>% 
  mutate(cut_rev=factor(cut,
                        levels=c("Ideal","Premium","Very Good","Good","Fair"))) %>% 
  ggplot()+
  geom_bar(aes(x=cut_rev))+
  xlab("\nQuality")+
  ylab("Frequency\n")+
  theme_light()

# Option 2
diamonds %>% 
  filter(price > mean(diamonds$price)) %>% 
  mutate(ones = 1) %>% 
  ggplot(aes(x=reorder(cut,ones,sum,decreasing = T)))+
  geom_bar()+
  xlab("\nQuality")+
  ylab("Frequency\n")+
  theme_light()

# Option 3
diamonds %>% 
  filter(price > mean(diamonds$price)) %>% 
  group_by(cut) %>% 
  summarise(freq=n()) %>% 
  ggplot(aes(x=reorder(cut,freq,decreasing = T),y=freq)) +
  geom_col()+
  xlab("\nQuality")+
  ylab("Frequency\n")+
  theme_light()

Plot a histogram of price per cut of diamonds.

ggplot(diamonds,aes(x=price))+
  geom_histogram()+
  facet_wrap(~cut)+
  xlab("\nPrice in $")+
  ylab("Frequency\n")+
  theme_minimal()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Arrange the facets above in a single row in order of their respective medians. Make sure the x axis is readable.

ggplot(diamonds,aes(x=price))+
  geom_histogram()+
  facet_wrap(~reorder(cut,price,median),nrow=1)+
  xlab("\nPrice in $")+
  ylab("Frequency\n")+
  theme_minimal()+
  theme(axis.text.x = element_text(angle=45,hjust=1))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Plot the average price of diamonds per carat as they relate to clarity

diamonds %>% 
  group_by(clarity,carat) %>% 
  summarise(averagePrice=mean(price)) %>% 
  ggplot(aes(x=carat,y=averagePrice,color=clarity))+
  geom_point()+
  scale_color_brewer("Diamond Clarity",palette = "RdBu")+
  xlab("\nCarat")+
  ylab("Price in $\n")+
  theme_minimal()+
  theme(legend.position = "top")+
  guides(colour = guide_legend(nrow = 1))

## `summarise()` has grouped output by 'clarity'. You can override using the
## `.groups` argument.

3. Try it yourself

From the carData::Salaries data set of 2008-2009 nine-month academic salary for Assistant Professors, Associate Professors and Professors in a college in the U.S., create a graph that presents:

The mean salary per years since PhD with regards to sex per discipline.
Make sure that the axis, facets and legend are properly named.
Rescale the y axis to have salaries in increments of 25000 and the x axis to have years in increments of 5.
Make sure the legend is on top of the graph.