class: center, middle, inverse, title-slide .title[ # Data Wrangling with Base R ] --- class: inverse, middle, center # Data Frame --- # What's a data frame? - A data frame is a **table** of data - Each **column** contains values of one variable - Each **row** contains one set of values from each column. -- - Example | Name | Salary | Start_Date | Gender | |---------|--------|------------|--------| | Rick | 623.30 | 2012-01-01 | Male | | Dan | 515.20 | 2013-09-23 | Male | | Jessica | 611.00 | 2014-11-15 | Female | | Julia | 729.00 | 2014-05-11 | Female | --- # Creating a Data Frame ``` r df = data.frame(Name = c('Rick','Dan','Jessica','Julia','Sophia'), Salary = c(623.30,515.20,611.00,729.00,843.25), Start_Date = c('2012-01-01', '2013-09-23', '2014-11-15', '2014-05-11', '2015-03-27'), Gender = c('Male', 'Male', 'Female','Female','Female' )) df ``` ``` ## Name Salary Start_Date Gender ## 1 Rick 623.30 2012-01-01 Male ## 2 Dan 515.20 2013-09-23 Male ## 3 Jessica 611.00 2014-11-15 Female ## 4 Julia 729.00 2014-05-11 Female ## 5 Sophia 843.25 2015-03-27 Female ``` --- class: middle, center Import a data frame using `read.csv` Use `str` function to give an overview Correct data type using `as.factor`, `as.numeric`, `as.Date`, `as.character` Check missing data with `sum(is.na())`, `colSums((is.na))` --- # Read (Import) a Data Frame using `read.csv` - Method 1: Download the file to your computer and read it. Usually the code and the data are in the same folder. ``` r setwd("C:/Users/sonou/Downloads") df <- read.csv('WHO-COVID-19-global-data.csv') ``` --- # Read (Import) a Data Frame using `read.csv` - Method 2: Read the data directly from its URL ``` r df <- read.csv('https://bryantstats.github.io/math421/data/WHO-COVID-19-global-data.csv') ``` --- # Check the variables (columns) of the data ``` r names(df) ``` ``` ## [1] "Date_reported" "Country_code" "Country" ## [4] "WHO_region" "New_cases" "Cumulative_cases" ## [7] "New_deaths" "Cumulative_deaths" ``` --- # Quick summary of the data ``` r str(df) ``` ``` ## 'data.frame': 232023 obs. of 8 variables: ## $ Date_reported : chr "2020-01-03" "2020-01-04" "2020-01-05" "2020-01-06" ... ## $ Country_code : chr "AF" "AF" "AF" "AF" ... ## $ Country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ... ## $ WHO_region : chr "EMRO" "EMRO" "EMRO" "EMRO" ... ## $ New_cases : int 0 0 0 0 0 0 0 0 0 0 ... ## $ Cumulative_cases : int 0 0 0 0 0 0 0 0 0 0 ... ## $ New_deaths : int 0 0 0 0 0 0 0 0 0 0 ... ## $ Cumulative_deaths: int 0 0 0 0 0 0 0 0 0 0 ... ``` --- # Quick view of the data ``` r head(df) ``` ``` ## Date_reported Country_code Country WHO_region New_cases Cumulative_cases ## 1 2020-01-03 AF Afghanistan EMRO 0 0 ## 2 2020-01-04 AF Afghanistan EMRO 0 0 ## 3 2020-01-05 AF Afghanistan EMRO 0 0 ## 4 2020-01-06 AF Afghanistan EMRO 0 0 ## 5 2020-01-07 AF Afghanistan EMRO 0 0 ## 6 2020-01-08 AF Afghanistan EMRO 0 0 ## New_deaths Cumulative_deaths ## 1 0 0 ## 2 0 0 ## 3 0 0 ## 4 0 0 ## 5 0 0 ## 6 0 0 ``` --- # Check missing values ``` r sum(is.na(df)) ``` ``` ## [1] 979 ``` ``` r colSums(is.na(df)) ``` ``` ## Date_reported Country_code Country WHO_region ## 0 979 0 0 ## New_cases Cumulative_cases New_deaths Cumulative_deaths ## 0 0 0 0 ``` ``` r colMeans(is.na(df)) ``` ``` ## Date_reported Country_code Country WHO_region ## 0.000000000 0.004219409 0.000000000 0.000000000 ## New_cases Cumulative_cases New_deaths Cumulative_deaths ## 0.000000000 0.000000000 0.000000000 0.000000000 ``` --- # Change the name of a variable ``` r names(df)[1] <- 'Date_report' ``` --- # Change the type of a variable ``` r # Check type class(df$Date_report) ``` ``` ## [1] "character" ``` ``` r str(df) ``` ``` ## 'data.frame': 232023 obs. of 8 variables: ## $ Date_report : chr "2020-01-03" "2020-01-04" "2020-01-05" "2020-01-06" ... ## $ Country_code : chr "AF" "AF" "AF" "AF" ... ## $ Country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ... ## $ WHO_region : chr "EMRO" "EMRO" "EMRO" "EMRO" ... ## $ New_cases : int 0 0 0 0 0 0 0 0 0 0 ... ## $ Cumulative_cases : int 0 0 0 0 0 0 0 0 0 0 ... ## $ New_deaths : int 0 0 0 0 0 0 0 0 0 0 ... ## $ Cumulative_deaths: int 0 0 0 0 0 0 0 0 0 0 ... ``` ``` r # Change type df$Date_report = as.Date(df$Date_report) ``` --- class: inverse, middle, center # Simple Data Exploration --- # Statistics of continuous/date variable ``` r mean(df$Cumulative_deaths, na.rm=TRUE) ``` ``` ## [1] 13768.36 ``` ``` r max(df$Cumulative_deaths, na.rm=TRUE) ``` ``` ## [1] 1036949 ``` ``` r # Other operation on one variable summary(df$Date_report) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## "2020-01-03" "2020-09-03" "2021-05-06" "2021-05-06" "2022-01-06" "2022-09-07" ``` --- # Frequency of Categorical variable ``` r table(df$WHO_region) ``` ``` ## ## AFRO AMRO EMRO EURO Other SEARO WPRO ## 48950 54824 21538 60698 979 10769 34265 ``` ``` r prop.table(table(df$WHO_region)) ``` ``` ## ## AFRO AMRO EMRO EURO Other SEARO ## 0.210970464 0.236286920 0.092827004 0.261603376 0.004219409 0.046413502 ## WPRO ## 0.147679325 ``` --- # Correlation of two continuous variables ``` r # Correlation cor(df$Cumulative_cases, df$Cumulative_deaths) ``` ``` ## [1] 0.8981613 ``` --- # Calculate a continuous by a categorical ``` r by(df$New_deaths, df$WHO_region, mean) ``` ``` ## df$WHO_region: AFRO ## [1] 3.561798 ## ------------------------------------------------------------ ## df$WHO_region: AMRO ## [1] 51.46844 ## ------------------------------------------------------------ ## df$WHO_region: EMRO ## [1] 16.1502 ## ------------------------------------------------------------ ## df$WHO_region: EURO ## [1] 34.27919 ## ------------------------------------------------------------ ## df$WHO_region: Other ## [1] 0.01327886 ## ------------------------------------------------------------ ## df$WHO_region: SEARO ## [1] 73.93453 ## ------------------------------------------------------------ ## df$WHO_region: WPRO ## [1] 7.685539 ``` --- # Calculate a continuous by a categorical ``` r by(data = df$New_deaths, INDICES = df$WHO_region, FUN = mean) ``` ``` ## df$WHO_region: AFRO ## [1] 3.561798 ## ------------------------------------------------------------ ## df$WHO_region: AMRO ## [1] 51.46844 ## ------------------------------------------------------------ ## df$WHO_region: EMRO ## [1] 16.1502 ## ------------------------------------------------------------ ## df$WHO_region: EURO ## [1] 34.27919 ## ------------------------------------------------------------ ## df$WHO_region: Other ## [1] 0.01327886 ## ------------------------------------------------------------ ## df$WHO_region: SEARO ## [1] 73.93453 ## ------------------------------------------------------------ ## df$WHO_region: WPRO ## [1] 7.685539 ``` --- class: inverse, middle, center # Create new variables *It is common that to create new variables to analyze the data better* --- # Binning Continuous variable to a categorical variable Create the categorical variable `New_cases2` variable taking the values as follows - `No_new_cases` if there is a no new cases that day - `Has_new_cases` if there is at least a new case that day --- ``` r df$New_cases2 <- ifelse(df$New_cases==0, 'No_new_cases','Has_new_cases') ``` --- ``` r library(dplyr) ``` ``` ## ## Attaching package: 'dplyr' ``` ``` ## The following objects are masked from 'package:stats': ## ## filter, lag ``` ``` ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union ``` ``` r df$New_cases2 <- case_when(df$New_cases==0 ~ 'No_new_cases', TRUE ~ 'Has_new_cases') ``` --- # Binning Continuous variable Create the categorical variable `New_deaths2` variable taking the values as follows - `low_death` if the number of `New_deaths` smaller 2 - `mid_death` if the number of `New_deaths` from 2 to 5 - `high_death` if the number of `New_deaths` greater than 5 --- ``` r df$New_deaths2 <- case_when(df$New_deaths<2 ~ 'low_death', df$New_deaths<5 ~ 'mid_death', TRUE~'high_death') table(df$New_deaths2) ``` ``` ## ## high_death low_death mid_death ## 61023 151654 19346 ``` --- # Binning categories of a categorical variable - If a variable has too many categories, we may want to group those categories to fewer categories ``` r # Create a variable month library(lubridate) ``` ``` ## ## Attaching package: 'lubridate' ``` ``` ## The following objects are masked from 'package:base': ## ## date, intersect, setdiff, union ``` ``` r df$month <- month(df$Date_report, label = TRUE) # group months into fewer categories df$month2 <- case_when(df$month %in% c('Sep','Oct','Nov','Dec') ~ 'fall_semester', df$month %in% c('Feb','Mar','Apr','May') ~ 'spring_semester', TRUE~'break') ``` --- class: inverse, middle, center # Subsetting --- # Subsetting by columns ``` r # by columns df1 <- df[,c(1,3)] head(df1) ``` ``` ## Date_report Country ## 1 2020-01-03 Afghanistan ## 2 2020-01-04 Afghanistan ## 3 2020-01-05 Afghanistan ## 4 2020-01-06 Afghanistan ## 5 2020-01-07 Afghanistan ## 6 2020-01-08 Afghanistan ``` --- # Subsetting by columns ``` r # by columns df1 <- df[, c('Date_report','New_cases')] head(df1) ``` ``` ## Date_report New_cases ## 1 2020-01-03 0 ## 2 2020-01-04 0 ## 3 2020-01-05 0 ## 4 2020-01-06 0 ## 5 2020-01-07 0 ## 6 2020-01-08 0 ``` --- # Subsetting by rows ``` r # by rows df1 <- df[c(1:5),] head(df1) ``` ``` ## Date_report Country_code Country WHO_region New_cases Cumulative_cases ## 1 2020-01-03 AF Afghanistan EMRO 0 0 ## 2 2020-01-04 AF Afghanistan EMRO 0 0 ## 3 2020-01-05 AF Afghanistan EMRO 0 0 ## 4 2020-01-06 AF Afghanistan EMRO 0 0 ## 5 2020-01-07 AF Afghanistan EMRO 0 0 ## New_deaths Cumulative_deaths New_cases2 New_deaths2 month month2 ## 1 0 0 No_new_cases low_death Jan break ## 2 0 0 No_new_cases low_death Jan break ## 3 0 0 No_new_cases low_death Jan break ## 4 0 0 No_new_cases low_death Jan break ## 5 0 0 No_new_cases low_death Jan break ``` --- # Subsetting by columns and rows ``` r # by rows df1 <- df[c(1:5), c('Date_report','New_cases')] head(df1) ``` ``` ## Date_report New_cases ## 1 2020-01-03 0 ## 2 2020-01-04 0 ## 3 2020-01-05 0 ## 4 2020-01-06 0 ## 5 2020-01-07 0 ``` --- # Subsetting by condition ``` r df1 <- df[df$New_deaths>10,] df2 <- df[df$Country=='United States of America',] df3 <- df[df$Date_report>='2021-01-01',] df4 <- df[(df$Country=='United States of America')& (df$Date_report>='2021-01-01'),] ``` --- # Logical Operators | Operator | Description | |----------- |-------------------------- | | < | less than | | <= | less than or equal to | | > | greater than | | >= | greater than or equal to | | == | exactly equal to | | != | not equal to | | !x | Not x | | x | y | x OR y | | x & y | x AND y | | isTRUE(x) | test if X is TRUE | --- class: inverse, middle, center # Some Examples --- class: middle, center Find the average new cases by region in 2021. --- ``` r df1 <- df[df$Date_report>='2021-01-01',] by(df1$New_cases, df1$WHO_region, mean) ``` ``` ## df1$WHO_region: AFRO ## [1] 240.5921 ## ------------------------------------------------------------ ## df1$WHO_region: AMRO ## [1] 4080.924 ## ------------------------------------------------------------ ## df1$WHO_region: EMRO ## [1] 1337.705 ## ------------------------------------------------------------ ## df1$WHO_region: EURO ## [1] 5817.783 ## ------------------------------------------------------------ ## df1$WHO_region: Other ## [1] 0.03089431 ## ------------------------------------------------------------ ## df1$WHO_region: SEARO ## [1] 7111.744 ## ------------------------------------------------------------ ## df1$WHO_region: WPRO ## [1] 3938.715 ``` --- class: middle, center Find the median new deaths by weekday in the US --- ``` r df$weekdays <- weekdays(df$Date_report) df1 = df[df$Country=='United States of America',] by(data = df1$New_deaths, INDICES = df1$weekdays, FUN = median) ``` ``` ## df1$weekdays: Friday ## [1] 1077.5 ## ------------------------------------------------------------ ## df1$weekdays: Monday ## [1] 634 ## ------------------------------------------------------------ ## df1$weekdays: Saturday ## [1] 948.5 ## ------------------------------------------------------------ ## df1$weekdays: Sunday ## [1] 927.5 ## ------------------------------------------------------------ ## df1$weekdays: Thursday ## [1] 879 ## ------------------------------------------------------------ ## df1$weekdays: Tuesday ## [1] 479 ## ------------------------------------------------------------ ## df1$weekdays: Wednesday ## [1] 667 ```