R
is a language RStudio
is where we write
it.setwd()
R
has a few basic form: 1) Vectors, 2)
Matrices, 3) Data Frames, 4) Lists.R
has a few basic classes: 1) Numeric; 2)
Integer; 3) Character; 4) Logical; 5) Factor.|>
is used between operations to take the
output of what comes before and use it as the input of what comes
next.[i]
with vectors (one dimension);[i,j]
with matrices or data frames (two dimensions, row
number i and column number j);mydf$aVariableName
in data frames, extracts the column
of the corresponding name.dplyr
:
select()
: which columns do you want;rename()
: change name of a column;mutate()
: add a new variable;filter()
: which rows do you want (based on a
condition).fertility
rate and ppgdp
of
the 4 country with highest female life expectancy
(lifeExpF
)? Make sure you are using no country where
region
is missing.library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data(UN, package="carData")
UN <- as_tibble(UN,rownames="country")
UN |>
drop_na(region) |>
arrange(-lifeExpF) |>
select(country,fertility,ppgdp) |>
slice_head(n=4)
## # A tibble: 4 × 3
## country fertility ppgdp
## <chr> <dbl> <dbl>
## 1 Japan 1.42 43141.
## 2 Hong Kong 1.14 31824.
## 3 France 1.99 39546.
## 4 Spain 1.50 30543.
fertility
rate and ppgdp
of
the 2 most urbanised country (pctUrban
) for each
region
? Make sure you are using no country where
region
is missing.UN |>
drop_na(region) |>
group_by(region) |>
arrange(-pctUrban) |>
select(country,fertility,ppgdp) |>
slice_head(n=2)
## Adding missing grouping variables: `region`
## # A tibble: 15 × 4
## # Groups: region [8]
## region country fertility ppgdp
## <fct> <chr> <dbl> <dbl>
## 1 Africa Gabon 3.20 12469.
## 2 Africa Libya 2.41 11321.
## 3 Asia Hong Kong 1.14 31824.
## 4 Asia Macao 1.16 49990.
## 5 Caribbean Anguilla 2 13750.
## 6 Caribbean Bermuda 1.76 92625.
## 7 Europe Belgium 1.84 43815.
## 8 Europe Malta 1.28 19599.
## 9 Latin Amer Venezuela 2.39 13503.
## 10 Latin Amer Argentina 2.17 9162.
## 11 North America United States 2.08 46546.
## 12 North America Canada 1.69 46361.
## 13 NorthAtlantic Greenland 2.22 35293.
## 14 Oceania Nauru 3.3 6190.
## 15 Oceania Australia 1.95 57119.
fertility1000
, the result of
this operation fertility*1000-infantMortality
. What is the
highest country on fertility1000
in each region? Make sure
to use only countries where both fertility
and
infantMortality
are known.UN |>
drop_na(fertility,infantMortality) |>
mutate(fertility1000=fertility*1000-infantMortality) |>
group_by(region) |>
arrange(-fertility1000) |>
slice_head(n=1)
## # A tibble: 7 × 9
## # Groups: region [7]
## country region group fertility ppgdp lifeExpF pctUrban infantMortality
## <chr> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Niger Africa afri… 6.92 358. 55.8 17 85.8
## 2 Timor Leste Asia other 5.92 706. 64.2 29 56.5
## 3 Haiti Carib… other 3.16 613. 63.9 54 58.3
## 4 Iceland Europe other 2.10 39278 83.8 94 2.06
## 5 Guatemala Latin… other 3.84 2882. 75.1 50 26.3
## 6 United States North… oecd 2.08 46546. 81.3 83 6.46
## 7 Marshall Isla… Ocean… other 4.38 3069. 70.6 72 21
## # ℹ 1 more variable: fertility1000 <dbl>
There’s a bunch of them:
Statistic | R function | Formula |
---|---|---|
Mean | mean() |
\(\bar Y=\frac{\sum_{i=1}^n Y_i}{n}\) |
Variance | var() |
\(sd^2=\frac{\sum_{i=1}^n(Y_i-\bar Y)^2}{n-1}\) |
Standard Deviation | sd() |
\(sd=\sqrt\frac{\sum_{i=1}^n(Y_i-\bar Y)^2}{n-1}\) |
Median | median() |
Middle value (\(Q_2\)) |
Interquartile Range | IQR() |
\(iqr=Q_3-Q_1\) |
Minimum | min() |
|
Maximum | max() |
Usefull functions:
summary(myVector)
will give you most of what’s
above.DAMisc::sumStats(myVector)
will give you all of what’s
above in a formatted table.price
of diamonds in the dataset
diamonds
that ships with the tidyverse?#### Initialization ####
#install.packages("devtools")
#devtools::install_github("https://github.com/davidaarmstrong/damisc.git")
library(DAMisc)
## Registered S3 method overwritten by 'broom':
## method from
## nobs.multinom clarkeTest
mean(diamonds$price)
## [1] 3932.8
carat
of the
diamonds?sd(diamonds$carat)
## [1] 0.4740112
var(diamonds$carat)
## [1] 0.2246867
# You could also do this instead
sd(diamonds$carat)^2
## [1] 0.2246867
clarity
?class(diamonds$clarity)
## [1] "ordered" "factor"
table(diamonds$clarity)
##
## I1 SI2 SI1 VS2 VS1 VVS2 VVS1 IF
## 741 9194 13065 12258 8171 5066 3655 1790
# Being a factor, summary will not be much help
summary(diamonds$clarity)
## I1 SI2 SI1 VS2 VS1 VVS2 VVS1 IF
## 741 9194 13065 12258 8171 5066 3655 1790
# Except if you do this
summary(as.numeric(diamonds$clarity))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 4.000 4.051 5.000 8.000
# But sumStats does better
sumStats(diamonds, "clarity")
## # A tibble: 1 × 11
## variable mean sd iqr min q25 q50 q75 max n nNA
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
## 1 clarity 4.05 1.65 2 1 3 4 5 8 53940 0
General Setup:
library(tidyverse)
ggplot(diamonds, # 1. Data;
aes(x=cut))+ # 2. Axis variables;
geom_bar()+ # 3. What you want to draw;
scale_x_discrete("\nDiamond Quality")+ # 4. Specify x axis text;
scale_y_continuous("Frequency\n", # 5. Specify y axis text;
breaks=seq(0,20000,2500))+ #
theme_minimal()+ # 6. General Style;
theme(panel.grid.minor = element_blank()) # 7. Specific Style.
With these notions, you can be quite creative:
diamonds
dataset, create a bar plot of the
number of observations for each diamond quality (cut
).
Subset the data to only diamonds above the mean price
from
the dataset.diamonds %>%
filter(price > mean(diamonds$price)) %>%
ggplot()+
geom_bar(aes(x=cut))+
xlab("\nQuality")+
ylab("Frequency\n")+
theme_light()
# Option 1
diamonds %>%
filter(price > mean(diamonds$price)) %>%
mutate(cut_rev=factor(cut,
levels=c("Ideal","Premium","Very Good","Good","Fair"))) %>%
ggplot()+
geom_bar(aes(x=cut_rev))+
xlab("\nQuality")+
ylab("Frequency\n")+
theme_light()
# Option 2
diamonds %>%
filter(price > mean(diamonds$price)) %>%
mutate(ones = 1) %>%
ggplot(aes(x=reorder(cut,ones,sum,decreasing = T)))+
geom_bar()+
xlab("\nQuality")+
ylab("Frequency\n")+
theme_light()
# Option 3
diamonds %>%
filter(price > mean(diamonds$price)) %>%
group_by(cut) %>%
summarise(freq=n()) %>%
ggplot(aes(x=reorder(cut,freq,decreasing = T),y=freq)) +
geom_col()+
xlab("\nQuality")+
ylab("Frequency\n")+
theme_light()
price
per cut
of
diamonds.ggplot(diamonds,aes(x=price))+
geom_histogram()+
facet_wrap(~cut)+
xlab("\nPrice in $")+
ylab("Frequency\n")+
theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(diamonds,aes(x=price))+
geom_histogram()+
facet_wrap(~reorder(cut,price,median),nrow=1)+
xlab("\nPrice in $")+
ylab("Frequency\n")+
theme_minimal()+
theme(axis.text.x = element_text(angle=45,hjust=1))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
price
of diamonds per
carat
as they relate to clarity
diamonds %>%
group_by(clarity,carat) %>%
summarise(averagePrice=mean(price)) %>%
ggplot(aes(x=carat,y=averagePrice,color=clarity))+
geom_point()+
scale_color_brewer("Diamond Clarity",palette = "RdBu")+
xlab("\nCarat")+
ylab("Price in $\n")+
theme_minimal()+
theme(legend.position = "top")+
guides(colour = guide_legend(nrow = 1))
## `summarise()` has grouped output by 'clarity'. You can override using the
## `.groups` argument.
From the carData::Salaries
data set of 2008-2009
nine-month academic salary for Assistant Professors, Associate
Professors and Professors in a college in the U.S., create a graph that
presents: