dplyr
How to do it?:
Open the Rmarkdown file of this assignment (link) in Rstudio.
Right under each question, insert a code chunk
(you can use the hotkey Ctrl + Alt + I
to add a code chunk)
and code the solution for the question.
Notice that if there is eval=FALSE
in the first line
of the code chunk, the chunk will not be execute.
Knit
the rmarkdown file (hotkey:
Ctrl + Alt + K
) to export an html.
Publish the html file to your Githiub Page.
Submission: Submit the link on Github of the assignment to Canvas under Assignment 4.
tidyverse
packageAn R package can be installed by install.packages
function. Install tidyverse
if you have not done so.
install.packages('tidyverse')
read_csv
Use read_csv
function to import the US Covid 19 data at
link.
Don’t forget to import tidyverse
(library(tidyverse)) so
that you can use read_csv.
lubridate
is a package of the tidyverse
packages. We will make uses of lubridate
in this
question.
month
,
weekday
and monthday
variableslibrary(lubridate)
df$month = month(df$date)
# day of the week
df$weekday = wday(df$date)
# day of the month
df$monthday <- mday(df$date)
case_when
.The function case_when
is a good option to create a new
variable from existing variable. For example, this below codes create a
new variable, daily_death
, from deathIncrease
variable. deathIncrease
is the number of daily new death by
Covid19. The new variable daily_death
takes three values:
low (if deathIncrease
less than 3), medium
(deathIncrease
from 3 to 14), and high
(deathIncrease
more than 14). Please notice that this can
also be done in a different way as shown in Assignment 3.
df$daily_death <- case_when(
df$deathIncrease <3 ~ 'low',
df$deathIncrease <=14 ~ 'medium',
TRUE ~ 'high'
)
Create variable month2
that takes three values:
early_month (day of the month from 1-10), mid_month (day of the month
from 11-20), and end_month (day of the month > 20).
Create variable weekend
that takes two values: 1 if
it’s Saturday or Sunday and 0 otherwise.
Use the select function to deselect the column
totalTestsViral
from the data.
Pipe operator offers another way to write R codes. Many times, it
makes the codes more readable. Pipe works very well with all the
tidyverse
packages. Refer to these slides (slide 15, 16, 17 and
18) to rewrite the below codes using pipe operator
x <- c(1:10)
# square root of x
sqrt(x)
sum(sqrt(x))
log(sum(sqrt(x)))
# log base 2 of 16
log(16, 2)
This combo is used when you want to apply a function/calculation to
different groups of the data. For example, to calculate the average
number of cases (positiveIncrease
) by
dataQualityGrade
, we use:
df %>%
group_by(weekday) %>%
summarise(mean(positiveIncrease))
Calculate the median number of cases
(positiveIncrease
) by month
Calculate the average number of cases
(positiveIncrease
) by month2
Calculate the median number of cases
(positiveIncrease
) by weekend
An example: to calculate the average number of cases
(positiveIncrease
) in January and February separately, we
use:
df %>%
filter(month==1|month==2) %>%
group_by(month) %>%
summarise(positve_increase = mean(positiveIncrease))
Calculate the median number of cases
(positiveIncrease
) on the weekend by month
in
October and November 2020.
Calculate the average number of death at different periods of a
month (month2
variable) in Fall 2020
Compare the average number of hospitalizations between weekdays and weekends in Summer 2020
Redo Questions 14 and 15 in Assignment 3 using the combos. Notice: you also need to use the data used in Assignment 3.
Use the arrange function to find a month that has the highest number of deaths on the weekend.