000. What we’ll be doing

  1. Review of last week’s exercise
  2. Review of data cleaning
  3. Reshaping data
  4. From wide to long
  5. From long to wide
  6. Exercise

00. Last week’s exercise

From the carData::Salaries data set of 2008-2009 nine-month academic salary for Assistant Professors, Associate Professors and Professors in a college in the U.S., create a graph that presents:

  1. The mean salary per years since PhD with regards to sex per discipline.
  2. Make sure that the axis, facets and legend are properly named.
  3. Rescale the y axis to have salaries in increments of 25000 and the x axis to have years in increments of 5.
  4. Make sure the legend is on top of the graph.
library(tidyverse)
range(carData::Salaries$yrs.since.phd)
## [1]  1 56
range(carData::Salaries$salary)
## [1]  57800 231545
carData::Salaries |> 
  mutate(discipline=factor(discipline,
                           levels=c("A","B"),
                           labels=c("Theoretical","Applied"))) |> 
  group_by(yrs.since.phd,sex,discipline) |> 
  summarise(meanSal=mean(salary)) |> 
  ggplot(aes(x=yrs.since.phd,y=meanSal,color=sex))+
  geom_line()+
  scale_x_continuous("\nYears since PhD.",breaks=seq(0,56,5))+
  scale_y_continuous("Salaries in USD\n",breaks=seq(0,250000,25000))+
  scale_color_brewer("",palette = "Set1")+
  facet_wrap(~discipline)+
  theme_minimal()+
  theme(legend.position = "top")

0. Data Cleaning

## Setting up
library(rio)
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
CES_raw <- import("2021 Canadian Election Study v2.0.dta") 

## A function to replace -99 by NA
remove99 <- function(x){
  out <- ifelse(x==-99,NA,x)
  return(out)
}

## Cleaning
CES <- factorize(CES_raw) |> 
  rename(UUID=cps21_ResponseId,
         yob=cps21_yob,
         gender=cps21_genderid,
         prov=cps21_province,
         vote=pes21_votechoice2021,
         pplFeel_Americans=cps21_groups_therm_7,
         pplFeel_Francophones=cps21_groups_therm_3,
         pplFeel_LPC=cps21_party_rating_23,
         pplFeel_CPC=cps21_party_rating_24,
         pplFeel_NDP=cps21_party_rating_25,
         pplFeel_BQ=cps21_party_rating_26,
         pplFeel_GPC=cps21_party_rating_27,
         pplFeel_PPC=cps21_party_rating_29) |> 
  select(UUID,yob,gender,prov,vote,
         pplFeel_Americans,pplFeel_Francophones,
         pplFeel_LPC,pplFeel_CPC,pplFeel_NDP,
         pplFeel_BQ,pplFeel_GPC,pplFeel_PPC) |> 
  filter(!(prov %in% c("Northwest Territories","Nunavut","Yukon"))) |> 
  mutate(age=2021-as.numeric(as.character(yob)),
         vt = case_when(
              vote == "Liberal Party" ~ "LPC",
              vote == "Conservative Party" ~ "CPC",
              vote == "ndp" ~ "NDP",
              vote == "Bloc Québécois" ~ "BQ",
              vote == "Green Party" ~ "GPC",
              vote == "People's Party" ~ "PPC",
              vote == "Another party (specify)" ~ "Other",
              vote %in% c("I spoiled my vote","Don't know / Prefer not to answer") ~ NA_character_
            ),
         prov=factor(prov,levels=c("British Columbia","Alberta","Saskatchewan","Manitoba",
                                   "Ontario","Quebec",
                                   "New Brunswick","Prince Edward Island","Nova Scotia",
                                   "Newfoundland and Labrador"))) |> 
  mutate_at(vars(starts_with("pplFeel")),remove99)

1. Reshaping data, the idea

  • Wide data: Each unit of analysis is found ONCE in the ID column.
  • Long data: The units of analysis are repeated in the ID column.
  • Going from wide to long: tidyr::pivot_longer()
  • Going from long to wide: tidyr::pivot_wider()
Wide: Each unit of analysis is found ONCE in the ID column
religion  $10-40k  $40-100k  >100k
Atheist 116.00 178.00 133.00
Catholic 2019.00 2703.00 1425.00
Evangelical Prot 2915.00 3316.00 1137.00
Long: The units of analysis are repeated in the ID column
religion salary n
Atheist $10-40k 116.00
Atheist $40-100k 178.00
Atheist >100k 133.00
Catholic $10-40k 2019.00
Catholic $40-100k 2703.00
Catholic >100k 1425.00
Evangelical Prot $10-40k 2915.00
Evangelical Prot $40-100k 3316.00
Evangelical Prot >100k 1137.00

2. From wide to long

Using the CES data, create a graph that superimposes the histogram of feelings towards Francophones on top of the histogram of feelings towards Americans.

Long <- CES |> 
  pivot_longer(c("pplFeel_Americans","pplFeel_Francophones"),
               names_pattern = "pplFeel_(.*)",
               names_to = "group",values_to = "score")

ggplot(Long,aes(x=score,fill=group))+
  geom_histogram(position="identity",bins = 20,alpha=0.3)+
  theme_minimal()+
  theme(legend.position = "top")+
  labs(x="\nFeeling Thermometer Score (0-100)",
       y="Frequency\n",
       fill="")

2.1 Looking under the hood

Let’s do the same exercise, but without the help of pivot_longer.

Data_A <- CES |> 
  select(-pplFeel_Francophones) |> 
  rename(score=pplFeel_Americans) |> 
  mutate(group="Americans")

Data_F <- CES |> 
  select(-pplFeel_Americans) |> 
  rename(score=pplFeel_Francophones) |> 
  mutate(group="Francophones")

Long2 <- rbind(Data_A,Data_F)

ggplot(Long2,aes(x=score,fill=group))+
  geom_histogram(position="identity",bins = 20,alpha=0.3)+
  theme_minimal()+
  theme(legend.position = "top")+
  labs(x="\nFeeling Thermometer Score (0-100)",
       y="Frequency\n",
       fill="")

3. From long to wide

From Long, go back to the original data.

Wide <- Long |> 
  pivot_wider(names_from = "group", names_prefix = "pplFeel_",values_from = score)

3.1 Looking under the hood

From Long to the original data without the use of pivot_wider.

Wide2 <- Long2 |> 
  filter(group=="Americans") |> 
  rename(pplFeel_Americans=score) |> 
  mutate(pplFeel_Francophones=Long$score[Long$group=="Francophones"]) |> 
  select(-group)

4. Exercise

Using the data we just cleaned (CES):

  1. Compute the average score given to each party per province.
  2. Present these averages in a graph where the Y axis are the provinces and the X axis the averages.