Data Wrangling with Dplyr

class: center, middle, inverse, title-slide

.title[
# Data Wrangling with Dplyr
]

---

class: inverse

---
class: inverse, center, middle
# tidyverse

---
class: inverse

# `tidyverse` package

- The tidyverse is an opinionated collection of R packages designed for data science.

- All packages share an underlying design philosophy, grammar, and data structures.

---
class: inverse

# Core Packages

- **ggplot2**: Visualization

- **dplyr**: Data Wrangling

- **purrr**: Functional Programming

---
class: inverse

# Install `tidyverse`

```r
install.packages('tidyverse')
library(tidyverse)
```

```r
library(tidyverse)
```

---
class: inverse, center, middle

# read_csv

---
class: inverse

# read_csv

Use `read_csv` instead of `read.csv` to read the data

- `read_csv`

```r
library(tidyverse)
df <- read_csv('https://covidtracking.com/data/download/all-states-history.csv')
```

---
class: inverse

# read_csv vs. read.csv

- read_csv is faster

```r
start_time <- Sys.time()
df <- read.csv('https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv')
end_time <- Sys.time()

end_time - start_time
```

```
## Time difference of 2.413997 mins
```

---
class: inverse

# read_csv vs. read.csv

- read_csv is faster

```r
start_time <- Sys.time()
df <- read_csv('https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv')

end_time <- Sys.time()

end_time - start_time
```

```
## Time difference of 7.581324 secs
```

---
class: inverse

- read_csv recognize the types of data better

```r
df <- read.csv('https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv')

str(df)
```

```
## 'data.frame':	2502832 obs. of  6 variables:
##  $ date  : chr  "2020-01-21" "2020-01-22" "2020-01-23" "2020-01-24" ...
##  $ county: chr  "Snohomish" "Snohomish" "Snohomish" "Cook" ...
##  $ state : chr  "Washington" "Washington" "Washington" "Illinois" ...
##  $ fips  : int  53061 53061 53061 17031 53061 6059 17031 53061 4013 6037 ...
##  $ cases : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ deaths: int  0 0 0 0 0 0 0 0 0 0 ...
```

---
class: inverse

- read_csv recognize the types of data better

```r
df <- read_csv('https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv')

str(df)
```

```
## spec_tbl_df [2,502,832 x 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ date  : Date[1:2502832], format: "2020-01-21" "2020-01-22" ...
##  $ county: chr [1:2502832] "Snohomish" "Snohomish" "Snohomish" "Cook" ...
##  $ state : chr [1:2502832] "Washington" "Washington" "Washington" "Illinois" ...
##  $ fips  : chr [1:2502832] "53061" "53061" "53061" "17031" ...
##  $ cases : num [1:2502832] 1 1 1 1 1 1 1 1 1 1 ...
##  $ deaths: num [1:2502832] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   date = col_date(format = ""),
##   ..   county = col_character(),
##   ..   state = col_character(),
##   ..   fips = col_character(),
##   ..   cases = col_double(),
##   ..   deaths = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
```

---
class: inverse

- Data prints out nicely with read_csv

```r
df <- read_csv('https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv')

df
```

```
## # A tibble: 2,502,832 x 6
##    date       county      state      fips  cases deaths
##    <date>     <chr>       <chr>      <chr> <dbl>  <dbl>
##  1 2020-01-21 Snohomish   Washington 53061     1      0
##  2 2020-01-22 Snohomish   Washington 53061     1      0
##  3 2020-01-23 Snohomish   Washington 53061     1      0
##  4 2020-01-24 Cook        Illinois   17031     1      0
##  5 2020-01-24 Snohomish   Washington 53061     1      0
##  6 2020-01-25 Orange      California 06059     1      0
##  7 2020-01-25 Cook        Illinois   17031     1      0
##  8 2020-01-25 Snohomish   Washington 53061     1      0
##  9 2020-01-26 Maricopa    Arizona    04013     1      0
## 10 2020-01-26 Los Angeles California 06037     1      0
## # ... with 2,502,822 more rows
```

---
class: inverse, center, middle

# Pipe (%>%)

# Hotkey: Ctrl + Shift + M

---
class: inverse

# Pipe operator (%>%)

- `$x$` %>% `$f$` is the same `$f(x)$`

```r
x <- c(1:10)

# conventional way
sum(x)
```

```
## [1] 55
```

```r
# Pipe
x %>% sum
```

```
## [1] 55
```

---
class: inverse

# Pipe operator (%>%)

- `$x$` %>% `$f$` %>% `$g$` is the same `$g(f(x))$`

```r
x <- c(1:10)

# conventional way
log(sum(x))
```

```
## [1] 4.007333
```

```r
# Pipe
x %>% sum %>% log
```

```
## [1] 4.007333
```

---
class: inverse, center, middle

# dplyr

---
class: inverse

# Five verbs of dplyr

- `select()` selects columns from data
- `filter()` filter rows of data
- `summarise()` summarises data (calculating summary statistics)
- `arrange()` arrange data
- `mutate()` creates new variables

---
class: inverse, center, middle
# select

---
class: inverse

# select: select columns

```r
df <- read_csv('https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv')

df1 <-  select(df, state, date)
```

---
class: inverse

# select with pipe

```r
df %>% select(state, date)
```

```
## # A tibble: 2,502,832 x 2
##    state      date      
##    <chr>      <date>    
##  1 Washington 2020-01-21
##  2 Washington 2020-01-22
##  3 Washington 2020-01-23
##  4 Illinois   2020-01-24
##  5 Washington 2020-01-24
##  6 California 2020-01-25
##  7 Illinois   2020-01-25
##  8 Washington 2020-01-25
##  9 Arizona    2020-01-26
## 10 California 2020-01-26
## # ... with 2,502,822 more rows
```

---
class: inverse

```r
# Deselect date from the data
df %>% select(-date)
```

```
## # A tibble: 2,502,832 x 5
##    county      state      fips  cases deaths
##    <chr>       <chr>      <chr> <dbl>  <dbl>
##  1 Snohomish   Washington 53061     1      0
##  2 Snohomish   Washington 53061     1      0
##  3 Snohomish   Washington 53061     1      0
##  4 Cook        Illinois   17031     1      0
##  5 Snohomish   Washington 53061     1      0
##  6 Orange      California 06059     1      0
##  7 Cook        Illinois   17031     1      0
##  8 Snohomish   Washington 53061     1      0
##  9 Maricopa    Arizona    04013     1      0
## 10 Los Angeles California 06037     1      0
## # ... with 2,502,822 more rows
```

---
class: inverse

```r
# select only numeric variables
df %>% select_if(is.numeric)
```

```
## # A tibble: 2,502,832 x 2
##    cases deaths
##    <dbl>  <dbl>
##  1     1      0
##  2     1      0
##  3     1      0
##  4     1      0
##  5     1      0
##  6     1      0
##  7     1      0
##  8     1      0
##  9     1      0
## 10     1      0
## # ... with 2,502,822 more rows
```

---
class: inverse

```r
# select only non-numeric variables
df %>% select_if(~!is.numeric(.))
```

```
## # A tibble: 2,502,832 x 4
##    date       county      state      fips 
##    <date>     <chr>       <chr>      <chr>
##  1 2020-01-21 Snohomish   Washington 53061
##  2 2020-01-22 Snohomish   Washington 53061
##  3 2020-01-23 Snohomish   Washington 53061
##  4 2020-01-24 Cook        Illinois   17031
##  5 2020-01-24 Snohomish   Washington 53061
##  6 2020-01-25 Orange      California 06059
##  7 2020-01-25 Cook        Illinois   17031
##  8 2020-01-25 Snohomish   Washington 53061
##  9 2020-01-26 Maricopa    Arizona    04013
## 10 2020-01-26 Los Angeles California 06037
## # ... with 2,502,822 more rows
```

---
class: inverse, center, middle
# filter

---
class: inverse

# filter

```r
# Select rows or days with no deaths
df %>% filter(deaths==0)
```

```
## # A tibble: 262,872 x 6
##    date       county      state      fips  cases deaths
##    <date>     <chr>       <chr>      <chr> <dbl>  <dbl>
##  1 2020-01-21 Snohomish   Washington 53061     1      0
##  2 2020-01-22 Snohomish   Washington 53061     1      0
##  3 2020-01-23 Snohomish   Washington 53061     1      0
##  4 2020-01-24 Cook        Illinois   17031     1      0
##  5 2020-01-24 Snohomish   Washington 53061     1      0
##  6 2020-01-25 Orange      California 06059     1      0
##  7 2020-01-25 Cook        Illinois   17031     1      0
##  8 2020-01-25 Snohomish   Washington 53061     1      0
##  9 2020-01-26 Maricopa    Arizona    04013     1      0
## 10 2020-01-26 Los Angeles California 06037     1      0
## # ... with 262,862 more rows
```

---
class: inverse

# filter

```r
# Select rows or days with no deaths
df %>% filter(state=='Rhode Island',deaths==0)
```

```
## # A tibble: 182 x 6
##    date       county  state        fips  cases deaths
##    <date>     <chr>   <chr>        <chr> <dbl>  <dbl>
##  1 2020-03-01 Unknown Rhode Island <NA>      2      0
##  2 2020-03-02 Unknown Rhode Island <NA>      2      0
##  3 2020-03-03 Unknown Rhode Island <NA>      2      0
##  4 2020-03-04 Unknown Rhode Island <NA>      2      0
##  5 2020-03-05 Unknown Rhode Island <NA>      2      0
##  6 2020-03-06 Unknown Rhode Island <NA>      3      0
##  7 2020-03-07 Unknown Rhode Island <NA>      3      0
##  8 2020-03-08 Unknown Rhode Island <NA>      3      0
##  9 2020-03-09 Unknown Rhode Island <NA>      3      0
## 10 2020-03-10 Unknown Rhode Island <NA>      5      0
## # ... with 172 more rows
```

---
class: inverse, center, middle
# mutate

---
class: inverse

# mutate:  Create new columns

```r
df %>% mutate(cases_minus_deaths = cases - deaths)
```

```
## # A tibble: 2,502,832 x 7
##    date       county      state      fips  cases deaths cases_minus_deaths
##    <date>     <chr>       <chr>      <chr> <dbl>  <dbl>              <dbl>
##  1 2020-01-21 Snohomish   Washington 53061     1      0                  1
##  2 2020-01-22 Snohomish   Washington 53061     1      0                  1
##  3 2020-01-23 Snohomish   Washington 53061     1      0                  1
##  4 2020-01-24 Cook        Illinois   17031     1      0                  1
##  5 2020-01-24 Snohomish   Washington 53061     1      0                  1
##  6 2020-01-25 Orange      California 06059     1      0                  1
##  7 2020-01-25 Cook        Illinois   17031     1      0                  1
##  8 2020-01-25 Snohomish   Washington 53061     1      0                  1
##  9 2020-01-26 Maricopa    Arizona    04013     1      0                  1
## 10 2020-01-26 Los Angeles California 06037     1      0                  1
## # ... with 2,502,822 more rows
```

---
class: inverse

# mutate

- If you want to add the new columns to the original data

```r
df <- df %>% mutate(non_death = cases - deaths)
```

---
class: inverse

# mutate

```r
df %>% mutate(deaths2 = ifelse(deaths==0,'No_deaths','Has_deaths'))
```

```
## # A tibble: 2,502,832 x 8
##    date       county      state      fips  cases deaths non_death deaths2  
##    <date>     <chr>       <chr>      <chr> <dbl>  <dbl>     <dbl> <chr>    
##  1 2020-01-21 Snohomish   Washington 53061     1      0         1 No_deaths
##  2 2020-01-22 Snohomish   Washington 53061     1      0         1 No_deaths
##  3 2020-01-23 Snohomish   Washington 53061     1      0         1 No_deaths
##  4 2020-01-24 Cook        Illinois   17031     1      0         1 No_deaths
##  5 2020-01-24 Snohomish   Washington 53061     1      0         1 No_deaths
##  6 2020-01-25 Orange      California 06059     1      0         1 No_deaths
##  7 2020-01-25 Cook        Illinois   17031     1      0         1 No_deaths
##  8 2020-01-25 Snohomish   Washington 53061     1      0         1 No_deaths
##  9 2020-01-26 Maricopa    Arizona    04013     1      0         1 No_deaths
## 10 2020-01-26 Los Angeles California 06037     1      0         1 No_deaths
## # ... with 2,502,822 more rows
```

---
class: inverse, middle, center
# arrange

---
class: inverse

# arrange

```r
df %>% arrange(deaths)
```

```
## # A tibble: 2,502,832 x 7
##    date       county      state      fips  cases deaths non_death
##    <date>     <chr>       <chr>      <chr> <dbl>  <dbl>     <dbl>
##  1 2020-01-21 Snohomish   Washington 53061     1      0         1
##  2 2020-01-22 Snohomish   Washington 53061     1      0         1
##  3 2020-01-23 Snohomish   Washington 53061     1      0         1
##  4 2020-01-24 Cook        Illinois   17031     1      0         1
##  5 2020-01-24 Snohomish   Washington 53061     1      0         1
##  6 2020-01-25 Orange      California 06059     1      0         1
##  7 2020-01-25 Cook        Illinois   17031     1      0         1
##  8 2020-01-25 Snohomish   Washington 53061     1      0         1
##  9 2020-01-26 Maricopa    Arizona    04013     1      0         1
## 10 2020-01-26 Los Angeles California 06037     1      0         1
## # ... with 2,502,822 more rows
```
---
class: inverse

# arrange

```r
df %>% arrange(-deaths)
```

```
## # A tibble: 2,502,832 x 7
##    date       county        state    fips    cases deaths non_death
##    <date>     <chr>         <chr>    <chr>   <dbl>  <dbl>     <dbl>
##  1 2022-05-13 New York City New York <NA>  2422658  40267   2382391
##  2 2022-05-12 New York City New York <NA>  2417909  40261   2377648
##  3 2022-05-11 New York City New York <NA>  2411431  40256   2371175
##  4 2022-05-10 New York City New York <NA>  2408357  40248   2368109
##  5 2022-05-09 New York City New York <NA>  2406070  40240   2365830
##  6 2022-05-06 New York City New York <NA>  2392182  40232   2351950
##  7 2022-05-07 New York City New York <NA>  2392182  40232   2351950
##  8 2022-05-08 New York City New York <NA>  2392903  40232   2352671
##  9 2022-05-05 New York City New York <NA>  2387684  40227   2347457
## 10 2022-05-04 New York City New York <NA>  2383719  40220   2343499
## # ... with 2,502,822 more rows
```

---
class: inverse, center, middle
# summarise

---
class: inverse

# summarise: Calculations on continuous variables

```r
df %>% summarise(mean_death = mean(deaths, na.rm=TRUE))
```

```
## # A tibble: 1 x 1
##   mean_death
##        <dbl>
## 1       162.
```

---
class: inverse

# summarise: Calculations on continuous variables

```r
df %>% summarise(mean_death = mean(deaths, na.rm=TRUE), 
                 median_cases=median(cases, na.rm=TRUE), 
                 max_death=max(deaths, na.rm=TRUE))
```

```
## # A tibble: 1 x 3
##   mean_death median_cases max_death
##        <dbl>        <dbl>     <dbl>
## 1       162.         1773     40267
```

---
class: inverse, middle, center
# Some Other Functions

---
class: inverse

# Use `count` for categorical variables

```r
df %>% count(state)
```

```
## # A tibble: 56 x 2
##    state                    n
##    <chr>                <int>
##  1 Alabama              52312
##  2 Alaska               20568
##  3 American Samoa         239
##  4 Arizona              11934
##  5 Arkansas             59032
##  6 California           45693
##  7 Colorado             49527
##  8 Connecticut           7091
##  9 Delaware              3131
## 10 District of Columbia   798
## # ... with 46 more rows
```

---
class: inverse

```r
df %>% 
  count(state) %>% 
  arrange(-n)
```

```
## # A tibble: 56 x 2
##    state               n
##    <chr>           <int>
##  1 Texas          193519
##  2 Georgia        124889
##  3 Virginia       103128
##  4 Kentucky        92572
##  5 Missouri        89600
##  6 Illinois        79502
##  7 Kansas          79294
##  8 North Carolina  77809
##  9 Iowa            77070
## 10 Tennessee       74719
## # ... with 46 more rows
```

---
class: inverse

# n_distinct

```r
df %>% 
  summarise(n_distinct(state))
```

```
## # A tibble: 1 x 1
##   `n_distinct(state)`
##                 <int>
## 1                  56
```

```r
# OR
n_distinct(df$state)
```

```
## [1] 56
```

---
class: inverse

# Combo 1: group_by + summarise

```r
df %>% 
  group_by(state) %>% 
  summarise(mean_deaths = mean(deaths))
```

```
## # A tibble: 56 x 2
##    state                mean_deaths
##    <chr>                      <dbl>
##  1 Alabama                   139.  
##  2 Alaska                     15.8 
##  3 American Samoa              2.61
##  4 Arizona                   951.  
##  5 Arkansas                   68.7 
##  6 California                808.  
##  7 Colorado                   95.5 
##  8 Connecticut               772.  
##  9 Delaware                  362.  
## 10 District of Columbia      907.  
## # ... with 46 more rows
```

---
class: inverse

# Combo 2: filter + group_by + summarise

```r
df %>% 
  filter(date>'01-01-2021') %>% 
  group_by(state) %>% 
  summarise(mean_deaths = mean(deaths))
```

---
class: inverse

# Combo 3: filter + group_by + summarise + arrange

```r
df %>% 
  filter(date>'01-01-2021') %>% 
  group_by(state) %>% 
  summarise(average_deaths = mean(deaths)) %>% 
  arrange(-average_deaths)
```

```
## # A tibble: 56 x 2
##    state                average_deaths
##    <chr>                         <dbl>
##  1 New Jersey                    1006.
##  2 Massachusetts                  964.
##  3 Arizona                        951.
##  4 District of Columbia           907.
##  5 California                     808.
##  6 New York                       787.
##  7 Connecticut                    772.
##  8 Florida                        510.
##  9 Delaware                       362.
## 10 Rhode Island                   359.
## # ... with 46 more rows
```