How to do it?:

Submission: Submit the link on Github of the assignment to Canvas under Assignment 5 - Extra Credits.


  1. Download the c2015 dataset to your computer at this link. Load the library readxl (library(readxl)) then use the function read_excel() to read the c2015 dataset. The data is from Fatality Analysis Reporting System (FARS). The data includes vital accidents information, such as when, where, and how the accident happened. FARS also includes the drivers and passengers’ information, such as age,gender etc. Some of the fatal accident had multiple vehicles involved. More information about FARS can be found at: https://www.nhtsa.gov/research-data/fatality-analysis-reporting-system-fars

  2. Let’s study the variable SEX. How many missing values in the NAs form?

  3. Still with variable SEX. There are missing values in this variables that are not NAs. Identify the forms of missing values in this variable. Change all the forms of missing values to NAs.

  4. Still with variable SEX. After all the missing values are in the NAs form. Change the missing values of this variable to the majority sex.

  5. Let’s study variable AGE. Use the table function to check out the values of these variable and forms of missing values. Use na_if to change all the forms of missing values to NAs.

  6. Still with variable AGE. Use the str_replace to replace Less than 1 to ‘0’ (character 0, not number 0).

  7. Still with variable AGE. Use the class function to check the type of this variable. Use the as.numeric function to change the type of the variable to numeric.

  8. Still with variable AGE. Replace the missing values NAs by the mean of the variable. `

  9. Let’s fix the variable TRAV_SP. Do the follows.

  1. Find the correlation between Age of the drivers and Travel speed (TRAV_SP). Hint: You want to look at the seat positions (SEAT_POS variable) to filter out the observations about the drivers, then calculate the correlation.