Data Wrangling with Base R

class: center, middle, inverse, title-slide

.title[
# Data Wrangling with Base R
]

---

class: inverse, middle, center
# Data Frame

---
# What's a data frame?

- A data frame is a **table** of data

- Each **column** contains values of one variable

- Each **row** contains one set of values from each column.

- Example

| Name    | Salary | Start_Date | Gender |
|---------|--------|------------|--------|
| Rick    | 623.30 | 2012-01-01 | Male   |
| Dan     | 515.20 | 2013-09-23 | Male   |
| Jessica | 611.00 | 2014-11-15 | Female |
| Julia   | 729.00 | 2014-05-11 | Female |

---
# Creating a Data Frame

``` r
df = data.frame(Name = c('Rick','Dan','Jessica','Julia','Sophia'),
                Salary = c(623.30,515.20,611.00,729.00,843.25),
                Start_Date = c('2012-01-01',
                                       '2013-09-23',
                                       '2014-11-15',
                                       '2014-05-11',
                                       '2015-03-27'),
                Gender = c('Male', 'Male', 'Female','Female','Female' ))

df
```

```
##      Name Salary Start_Date Gender
## 1    Rick 623.30 2012-01-01   Male
## 2     Dan 515.20 2013-09-23   Male
## 3 Jessica 611.00 2014-11-15 Female
## 4   Julia 729.00 2014-05-11 Female
## 5  Sophia 843.25 2015-03-27 Female
```

---
class: middle, center

Import a data frame using `read.csv`

Use `str` function to give an overview

Correct data type using `as.factor`, `as.numeric`, `as.Date`, `as.character`

Check missing data with `sum(is.na())`, `colSums((is.na))`

---
# Read (Import) a Data Frame using `read.csv`

- Method 1: Download the file to your computer and read it. Usually the code and the data are in the same folder.

``` r
setwd("C:/Users/sonou/Downloads")
df <-  read.csv('WHO-COVID-19-global-data.csv')
```

---
# Read (Import) a Data Frame using `read.csv`

- Method 2: Read the data directly from its URL

``` r
df <-  read.csv('https://bryantstats.github.io/math421/data/WHO-COVID-19-global-data.csv')
```

---
# Check the variables (columns) of the data

``` r
names(df)
```

```
## [1] "Date_reported"     "Country_code"      "Country"          
## [4] "WHO_region"        "New_cases"         "Cumulative_cases" 
## [7] "New_deaths"        "Cumulative_deaths"
```

---
# Quick summary of the data

``` r
str(df)
```

```
## 'data.frame':	232023 obs. of  8 variables:
##  $ Date_reported    : chr  "2020-01-03" "2020-01-04" "2020-01-05" "2020-01-06" ...
##  $ Country_code     : chr  "AF" "AF" "AF" "AF" ...
##  $ Country          : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ WHO_region       : chr  "EMRO" "EMRO" "EMRO" "EMRO" ...
##  $ New_cases        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Cumulative_cases : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ New_deaths       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Cumulative_deaths: int  0 0 0 0 0 0 0 0 0 0 ...
```

---
# Quick view of the data

``` r
head(df)
```

```
##   Date_reported Country_code     Country WHO_region New_cases Cumulative_cases
## 1    2020-01-03           AF Afghanistan       EMRO         0                0
## 2    2020-01-04           AF Afghanistan       EMRO         0                0
## 3    2020-01-05           AF Afghanistan       EMRO         0                0
## 4    2020-01-06           AF Afghanistan       EMRO         0                0
## 5    2020-01-07           AF Afghanistan       EMRO         0                0
## 6    2020-01-08           AF Afghanistan       EMRO         0                0
##   New_deaths Cumulative_deaths
## 1          0                 0
## 2          0                 0
## 3          0                 0
## 4          0                 0
## 5          0                 0
## 6          0                 0
```

---
# Check missing values

``` r
sum(is.na(df))
```

```
## [1] 979
```

``` r
colSums(is.na(df))
```

```
##     Date_reported      Country_code           Country        WHO_region 
##                 0               979                 0                 0 
##         New_cases  Cumulative_cases        New_deaths Cumulative_deaths 
##                 0                 0                 0                 0
```

``` r
colMeans(is.na(df))
```

```
##     Date_reported      Country_code           Country        WHO_region 
##       0.000000000       0.004219409       0.000000000       0.000000000 
##         New_cases  Cumulative_cases        New_deaths Cumulative_deaths 
##       0.000000000       0.000000000       0.000000000       0.000000000
```

---
# Change the name of a variable

``` r
names(df)[1] <- 'Date_report'
```

---
# Change the type of a variable

``` r
# Check type
class(df$Date_report)
```

```
## [1] "character"
```

``` r
str(df)
```

```
## 'data.frame':	232023 obs. of  8 variables:
##  $ Date_report      : chr  "2020-01-03" "2020-01-04" "2020-01-05" "2020-01-06" ...
##  $ Country_code     : chr  "AF" "AF" "AF" "AF" ...
##  $ Country          : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ WHO_region       : chr  "EMRO" "EMRO" "EMRO" "EMRO" ...
##  $ New_cases        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Cumulative_cases : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ New_deaths       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Cumulative_deaths: int  0 0 0 0 0 0 0 0 0 0 ...
```

``` r
# Change type
df$Date_report = as.Date(df$Date_report)
```

---
class: inverse, middle, center
# Simple Data Exploration

---
# Statistics of continuous/date variable

``` r
mean(df$Cumulative_deaths, na.rm=TRUE)
```

```
## [1] 13768.36
```

``` r
max(df$Cumulative_deaths, na.rm=TRUE)
```

```
## [1] 1036949
```

``` r
# Other operation on one variable
summary(df$Date_report)
```

```
##         Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
## "2020-01-03" "2020-09-03" "2021-05-06" "2021-05-06" "2022-01-06" "2022-09-07"
```

---
# Frequency of Categorical variable

``` r
table(df$WHO_region)
```

```
## 
##  AFRO  AMRO  EMRO  EURO Other SEARO  WPRO 
## 48950 54824 21538 60698   979 10769 34265
```

``` r
prop.table(table(df$WHO_region))
```

```
## 
##        AFRO        AMRO        EMRO        EURO       Other       SEARO 
## 0.210970464 0.236286920 0.092827004 0.261603376 0.004219409 0.046413502 
##        WPRO 
## 0.147679325
```

---
# Correlation of two continuous variables

``` r
# Correlation

cor(df$Cumulative_cases, df$Cumulative_deaths)
```

```
## [1] 0.8981613
```

---
# Calculate a continuous by a categorical

``` r
by(df$New_deaths, df$WHO_region, mean)
```

```
## df$WHO_region: AFRO
## [1] 3.561798
## ------------------------------------------------------------ 
## df$WHO_region: AMRO
## [1] 51.46844
## ------------------------------------------------------------ 
## df$WHO_region: EMRO
## [1] 16.1502
## ------------------------------------------------------------ 
## df$WHO_region: EURO
## [1] 34.27919
## ------------------------------------------------------------ 
## df$WHO_region: Other
## [1] 0.01327886
## ------------------------------------------------------------ 
## df$WHO_region: SEARO
## [1] 73.93453
## ------------------------------------------------------------ 
## df$WHO_region: WPRO
## [1] 7.685539
```

---
# Calculate a continuous by a categorical

``` r
by(data = df$New_deaths, INDICES = df$WHO_region, FUN = mean)
```

---
class: inverse, middle, center

# Create new variables

*It is common that to create new variables to analyze the data better*

---
# Binning Continuous variable to a categorical variable

Create the categorical variable `New_cases2` variable taking the values as follows

- `No_new_cases` if there is a no new cases that day
-  `Has_new_cases` if there is at least a new case that day

---

``` r
df$New_cases2 <- ifelse(df$New_cases==0, 
                        'No_new_cases','Has_new_cases') 
```

---

``` r
library(dplyr)
```

```
## 
## Attaching package: 'dplyr'
```

```
## The following objects are masked from 'package:stats':
## 
##     filter, lag
```

```
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
```

``` r
df$New_cases2 <- case_when(df$New_cases==0 ~ 'No_new_cases', 
                            TRUE ~ 'Has_new_cases') 
```

---
# Binning Continuous variable

Create the categorical variable `New_deaths2` variable taking the values as follows

- `low_death` if the number of `New_deaths` smaller 2

- `mid_death` if the number of `New_deaths` from 2 to 5

- `high_death` if the number of `New_deaths` greater than 5

---

``` r
df$New_deaths2 <- case_when(df$New_deaths<2 ~ 'low_death',
                             df$New_deaths<5 ~ 'mid_death',
                             TRUE~'high_death')
table(df$New_deaths2)
```

```
## 
## high_death  low_death  mid_death 
##      61023     151654      19346
```

---
# Binning categories of a categorical variable

- If a variable has too many categories, we may want to group those categories to fewer categories

``` r
# Create a variable month
library(lubridate)
```

```
## 
## Attaching package: 'lubridate'
```

```
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
```

``` r
df$month <- month(df$Date_report, label = TRUE)

# group months into fewer categories
df$month2 <- case_when(df$month %in% c('Sep','Oct','Nov','Dec') ~ 'fall_semester',
                       df$month %in% c('Feb','Mar','Apr','May') ~ 'spring_semester',
                       TRUE~'break')
```

---
class: inverse, middle, center

# Subsetting

---
# Subsetting by columns

``` r
# by columns
df1 <- df[,c(1,3)]
head(df1)
```

```
##   Date_report     Country
## 1  2020-01-03 Afghanistan
## 2  2020-01-04 Afghanistan
## 3  2020-01-05 Afghanistan
## 4  2020-01-06 Afghanistan
## 5  2020-01-07 Afghanistan
## 6  2020-01-08 Afghanistan
```

---
# Subsetting by columns

``` r
# by columns
df1 <- df[, c('Date_report','New_cases')]
head(df1)
```

```
##   Date_report New_cases
## 1  2020-01-03         0
## 2  2020-01-04         0
## 3  2020-01-05         0
## 4  2020-01-06         0
## 5  2020-01-07         0
## 6  2020-01-08         0
```

---
# Subsetting by rows

``` r
# by rows
df1 <- df[c(1:5),]
head(df1)
```

```
##   Date_report Country_code     Country WHO_region New_cases Cumulative_cases
## 1  2020-01-03           AF Afghanistan       EMRO         0                0
## 2  2020-01-04           AF Afghanistan       EMRO         0                0
## 3  2020-01-05           AF Afghanistan       EMRO         0                0
## 4  2020-01-06           AF Afghanistan       EMRO         0                0
## 5  2020-01-07           AF Afghanistan       EMRO         0                0
##   New_deaths Cumulative_deaths   New_cases2 New_deaths2 month month2
## 1          0                 0 No_new_cases   low_death   Jan  break
## 2          0                 0 No_new_cases   low_death   Jan  break
## 3          0                 0 No_new_cases   low_death   Jan  break
## 4          0                 0 No_new_cases   low_death   Jan  break
## 5          0                 0 No_new_cases   low_death   Jan  break
```

---
# Subsetting by columns and rows

``` r
# by rows
df1 <- df[c(1:5), c('Date_report','New_cases')]
head(df1)
```

```
##   Date_report New_cases
## 1  2020-01-03         0
## 2  2020-01-04         0
## 3  2020-01-05         0
## 4  2020-01-06         0
## 5  2020-01-07         0
```

---
# Subsetting by condition

``` r
df1 <- df[df$New_deaths>10,]
df2 <- df[df$Country=='United States of America',]
df3 <- df[df$Date_report>='2021-01-01',]

df4 <- df[(df$Country=='United States of America')&
            (df$Date_report>='2021-01-01'),]
```

---
# Logical Operators

| Operator  	| Description              	|
|-----------	|--------------------------	|
| <         	| less than                	|
| <=        	| less than or equal to    	|
| >         	| greater than             	|
| >=        	| greater than or equal to 	|
| ==        	| exactly equal to         	|
| !=        	| not equal to             	|
| !x        	| Not x                    	|
| x &#124; y    	| x OR y                   	|
| x & y     	| x AND y                  	|
| isTRUE(x) 	| test if X is TRUE        	|

---
class: inverse, middle, center
# Some Examples

---
class: middle, center
Find the average new cases by region in 2021.

---

``` r
df1 <- df[df$Date_report>='2021-01-01',]
by(df1$New_cases, df1$WHO_region, mean)
```

```
## df1$WHO_region: AFRO
## [1] 240.5921
## ------------------------------------------------------------ 
## df1$WHO_region: AMRO
## [1] 4080.924
## ------------------------------------------------------------ 
## df1$WHO_region: EMRO
## [1] 1337.705
## ------------------------------------------------------------ 
## df1$WHO_region: EURO
## [1] 5817.783
## ------------------------------------------------------------ 
## df1$WHO_region: Other
## [1] 0.03089431
## ------------------------------------------------------------ 
## df1$WHO_region: SEARO
## [1] 7111.744
## ------------------------------------------------------------ 
## df1$WHO_region: WPRO
## [1] 3938.715
```

---
class: middle, center
Find the median new deaths by weekday in the US

---

``` r
df$weekdays <- weekdays(df$Date_report)
df1 = df[df$Country=='United States of America',]
by(data = df1$New_deaths, INDICES = df1$weekdays, FUN = median)
```

```
## df1$weekdays: Friday
## [1] 1077.5
## ------------------------------------------------------------ 
## df1$weekdays: Monday
## [1] 634
## ------------------------------------------------------------ 
## df1$weekdays: Saturday
## [1] 948.5
## ------------------------------------------------------------ 
## df1$weekdays: Sunday
## [1] 927.5
## ------------------------------------------------------------ 
## df1$weekdays: Thursday
## [1] 879
## ------------------------------------------------------------ 
## df1$weekdays: Tuesday
## [1] 479
## ------------------------------------------------------------ 
## df1$weekdays: Wednesday
## [1] 667
```