Assignment 4 - Data Wrangging with dplyr

How to do it?:

Open the Rmarkdown file of this assignment (link) in Rstudio.
Right under each question, insert a code chunk (you can use the hotkey Ctrl + Alt + I to add a code chunk) and code the solution for the question.
Notice that if there is eval=FALSE in the first line of the code chunk, the chunk will not be execute.
Knit the rmarkdown file (hotkey: Ctrl + Alt + K) to export an html.
Publish the html file to your Githiub Page.

Submission: Submit the link on Github of the assignment to Canvas under Assignment 4.

1. Install `tidyverse` package

An R package can be installed by install.packages function. Install tidyverse if you have not done so.

install.packages('tidyverse')

2. Read the data using `read_csv`

Use read_csv function to import the US Covid 19 data at link. Don’t forget to import tidyverse (library(tidyverse)) so that you can use read_csv.

3. Fix the date and ceate some new variables

lubridate is a package of the tidyverse packages. We will make uses of lubridate in this question.

Use the below codes to create month, weekday and monthday variables

library(lubridate)
df$month = month(df$date)

# day of the week
df$weekday = wday(df$date)

# day of the month
df$monthday <- mday(df$date)

4. Create new variables with `case_when`.

The function case_when is a good option to create a new variable from existing variable. For example, this below codes create a new variable, daily_death, from deathIncrease variable. deathIncrease is the number of daily new death by Covid19. The new variable daily_death takes three values: low (if deathIncrease less than 3), medium (deathIncrease from 3 to 14), and high (deathIncrease more than 14). Please notice that this can also be done in a different way as shown in Assignment 3.

df$daily_death <- case_when(
  df$deathIncrease <3 ~ 'low',
  df$deathIncrease <=14 ~ 'medium',
  TRUE ~ 'high'
)

Create variable month2 that takes three values: early_month (day of the month from 1-10), mid_month (day of the month from 11-20), and end_month (day of the month > 20).
Create variable weekend that takes two values: 1 if it’s Saturday or Sunday and 0 otherwise.

5. Select function

Use the select function to deselect the column totalTestsViral from the data.

6. Pipe Operator ( %>% )

Pipe operator offers another way to write R codes. Many times, it makes the codes more readable. Pipe works very well with all the tidyverse packages. Refer to these slides (slide 15, 16, 17 and 18) to rewrite the below codes using pipe operator

x <- c(1:10)

# square root of x
sqrt(x)

sum(sqrt(x))

log(sum(sqrt(x)))

# log base 2 of 16
log(16, 2)

7. Combo 1: group_by + summarise

This combo is used when you want to apply a function/calculation to different groups of the data. For example, to calculate the average number of cases (positiveIncrease) by dataQualityGrade, we use:

df %>% 
  group_by(weekday) %>% 
  summarise(mean(positiveIncrease))

Calculate the median number of cases (positiveIncrease) by month
Calculate the average number of cases (positiveIncrease) by month2
Calculate the median number of cases (positiveIncrease) by weekend

8. Combo 2: filter + group_by + summarise

An example: to calculate the average number of cases (positiveIncrease) in January and February separately, we use:

df %>% 
  filter(month==1|month==2) %>% 
  group_by(month) %>% 
  summarise(positve_increase = mean(positiveIncrease))

Calculate the median number of cases (positiveIncrease) on the weekend by month in October and November 2020.
Calculate the average number of death at different periods of a month (month2 variable) in Fall 2020
Compare the average number of hospitalizations between weekdays and weekends in Summer 2020
Redo Questions 14 and 15 in Assignment 3 using the combos. Notice: you also need to use the data used in Assignment 3.

9. Combo 3: filter + group_by + summarise + arrange

Use the arrange function to find a month that has the highest number of deaths on the weekend.

10. Use your own dataset and implement the follows functions or combos. You can use the Adult Census Income or Titanic data.

select
filter
mutate
summarise
arrange
count
count + arrange
filter + count + arrange
group_by + summarise
filter + group_by + summarise
filter + group_by + summarise + arrange