POLISCI 9590 - Lab3: Reshaping Data

000. What we’ll be doing

Review of last week’s exercise
Review of data cleaning
Reshaping data
From wide to long
From long to wide
Exercise

00. Last week’s exercise

From the carData::Salaries data set of 2008-2009 nine-month academic salary for Assistant Professors, Associate Professors and Professors in a college in the U.S., create a graph that presents:

The mean salary per years since PhD with regards to sex per discipline.
Make sure that the axis, facets and legend are properly named.
Rescale the y axis to have salaries in increments of 25000 and the x axis to have years in increments of 5.
Make sure the legend is on top of the graph.

library(tidyverse)
range(carData::Salaries$yrs.since.phd)

## [1]  1 56

range(carData::Salaries$salary)

## [1]  57800 231545

carData::Salaries |> 
  mutate(discipline=factor(discipline,
                           levels=c("A","B"),
                           labels=c("Theoretical","Applied"))) |> 
  group_by(yrs.since.phd,sex,discipline) |> 
  summarise(meanSal=mean(salary)) |> 
  ggplot(aes(x=yrs.since.phd,y=meanSal,color=sex))+
  geom_line()+
  scale_x_continuous("\nYears since PhD.",breaks=seq(0,56,5))+
  scale_y_continuous("Salaries in USD\n",breaks=seq(0,250000,25000))+
  scale_color_brewer("",palette = "Set1")+
  facet_wrap(~discipline)+
  theme_minimal()+
  theme(legend.position = "top")

0. Data Cleaning

## Setting up
library(rio)
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
CES_raw <- import("2021 Canadian Election Study v2.0.dta") 

## A function to replace -99 by NA
remove99 <- function(x){
  out <- ifelse(x==-99,NA,x)
  return(out)
}

## Cleaning
CES <- factorize(CES_raw) |> 
  rename(UUID=cps21_ResponseId,
         yob=cps21_yob,
         gender=cps21_genderid,
         prov=cps21_province,
         vote=pes21_votechoice2021,
         pplFeel_Americans=cps21_groups_therm_7,
         pplFeel_Francophones=cps21_groups_therm_3,
         pplFeel_LPC=cps21_party_rating_23,
         pplFeel_CPC=cps21_party_rating_24,
         pplFeel_NDP=cps21_party_rating_25,
         pplFeel_BQ=cps21_party_rating_26,
         pplFeel_GPC=cps21_party_rating_27,
         pplFeel_PPC=cps21_party_rating_29) |> 
  select(UUID,yob,gender,prov,vote,
         pplFeel_Americans,pplFeel_Francophones,
         pplFeel_LPC,pplFeel_CPC,pplFeel_NDP,
         pplFeel_BQ,pplFeel_GPC,pplFeel_PPC) |> 
  filter(!(prov %in% c("Northwest Territories","Nunavut","Yukon"))) |> 
  mutate(age=2021-as.numeric(as.character(yob)),
         vt = case_when(
              vote == "Liberal Party" ~ "LPC",
              vote == "Conservative Party" ~ "CPC",
              vote == "ndp" ~ "NDP",
              vote == "Bloc Québécois" ~ "BQ",
              vote == "Green Party" ~ "GPC",
              vote == "People's Party" ~ "PPC",
              vote == "Another party (specify)" ~ "Other",
              vote %in% c("I spoiled my vote","Don't know / Prefer not to answer") ~ NA_character_
            ),
         prov=factor(prov,levels=c("British Columbia","Alberta","Saskatchewan","Manitoba",
                                   "Ontario","Quebec",
                                   "New Brunswick","Prince Edward Island","Nova Scotia",
                                   "Newfoundland and Labrador"))) |> 
  mutate_at(vars(starts_with("pplFeel")),remove99)

1. Reshaping data, the idea

Wide data: Each unit of analysis is found ONCE in the ID column.
Long data: The units of analysis are repeated in the ID column.
Going from wide to long: tidyr::pivot_longer()
Going from long to wide: tidyr::pivot_wider()

Wide: Each unit of analysis is found ONCE in the ID column
religion	$10-40k	$40-100k	>100k
Atheist	116.00	178.00	133.00
Catholic	2019.00	2703.00	1425.00
Evangelical Prot	2915.00	3316.00	1137.00

Long: The units of analysis are repeated in the ID column
religion	salary	n
Atheist	$10-40k	116.00
Atheist	$40-100k	178.00
Atheist	>100k	133.00
Catholic	$10-40k	2019.00
Catholic	$40-100k	2703.00
Catholic	>100k	1425.00
Evangelical Prot	$10-40k	2915.00
Evangelical Prot	$40-100k	3316.00
Evangelical Prot	>100k	1137.00

2. From wide to long

Using the CES data, create a graph that superimposes the histogram of feelings towards Francophones on top of the histogram of feelings towards Americans.

Long <- CES |> 
  pivot_longer(c("pplFeel_Americans","pplFeel_Francophones"),
               names_pattern = "pplFeel_(.*)",
               names_to = "group",values_to = "score")

ggplot(Long,aes(x=score,fill=group))+
  geom_histogram(position="identity",bins = 20,alpha=0.3)+
  theme_minimal()+
  theme(legend.position = "top")+
  labs(x="\nFeeling Thermometer Score (0-100)",
       y="Frequency\n",
       fill="")

2.1 Looking under the hood

Let’s do the same exercise, but without the help of pivot_longer.

Data_A <- CES |> 
  select(-pplFeel_Francophones) |> 
  rename(score=pplFeel_Americans) |> 
  mutate(group="Americans")

Data_F <- CES |> 
  select(-pplFeel_Americans) |> 
  rename(score=pplFeel_Francophones) |> 
  mutate(group="Francophones")

Long2 <- rbind(Data_A,Data_F)

ggplot(Long2,aes(x=score,fill=group))+
  geom_histogram(position="identity",bins = 20,alpha=0.3)+
  theme_minimal()+
  theme(legend.position = "top")+
  labs(x="\nFeeling Thermometer Score (0-100)",
       y="Frequency\n",
       fill="")

3. From long to wide

From Long, go back to the original data.

Wide <- Long |> 
  pivot_wider(names_from = "group", names_prefix = "pplFeel_",values_from = score)

3.1 Looking under the hood

From Long to the original data without the use of pivot_wider.

Wide2 <- Long2 |> 
  filter(group=="Americans") |> 
  rename(pplFeel_Americans=score) |> 
  mutate(pplFeel_Francophones=Long$score[Long$group=="Francophones"]) |> 
  select(-group)

4. Exercise

Using the data we just cleaned (CES):

Compute the average score given to each party per province.
Present these averages in a graph where the Y axis are the provinces and the X axis the averages.