class: center, middle, inverse, title-slide .title[ #
Predictive Modeling - Part 1 ] .author[ ###
Son Nguyen
] --- <style> .remark-slide-content { background-color: #FFFFFF; border-top: 80px solid #F9C389; font-size: 17px; font-weight: 300; line-height: 1.5; padding: 1em 2em 1em 2em } .inverse { background-color: #696767; border-top: 80px solid #696767; text-shadow: none; background-image: url(https://github.com/goodekat/presentations/blob/master/2019-isugg-gganimate-spooky/figures/spider.png?raw=true); background-position: 50% 75%; background-size: 150px; } .your-turn{ background-color: #8C7E95; border-top: 80px solid #F9C389; text-shadow: none; background-image: url(https://github.com/goodekat/presentations/blob/master/2019-isugg-gganimate-spooky/figures/spider.png?raw=true); background-position: 95% 90%; background-size: 75px; } .title-slide { background-color: #F9C389; border-top: 80px solid #F9C389; background-image: none; } .title-slide > h1 { color: #111111; font-size: 40px; text-shadow: none; font-weight: 400; text-align: left; margin-left: 15px; padding-top: 80px; } .title-slide > h2 { margin-top: -25px; padding-bottom: -20px; color: #111111; text-shadow: none; font-weight: 300; font-size: 35px; text-align: left; margin-left: 15px; } .title-slide > h3 { color: #111111; text-shadow: none; font-weight: 300; font-size: 25px; text-align: left; margin-left: 15px; margin-bottom: -30px; } </style> <style type="text/css"> .left-code { color: #777; width: 40%; height: 92%; float: left; } .right-plot { width: 59%; float: right; padding-left: 1%; } </style> # Data |target |Pclass |Sex | Age| SibSp| Parch| Fare|Embarked | |:------|:------|:------|--------:|-----:|-----:|-------:|:--------| |0 |3 |male | 22.00000| 1| 0| 7.2500|S | |1 |1 |female | 38.00000| 1| 0| 71.2833|C | |1 |3 |female | 26.00000| 0| 0| 7.9250|S | |1 |1 |female | 35.00000| 1| 0| 53.1000|S | |0 |3 |male | 35.00000| 0| 0| 8.0500|S | |0 |3 |male | 29.69912| 0| 0| 8.4583|Q | - Passengers in the Titanic - `Target = 1` means the passenger was survived - `Target = 0` means the passenger was not survived --- # Prediction Problem |target |Pclass |Sex | Age| SibSp| Parch| Fare|Embarked | |:------|:------|:------|--------:|-----:|-----:|-------:|:--------| |0 |3 |male | 22.00000| 1| 0| 7.2500|S | |1 |1 |female | 38.00000| 1| 0| 71.2833|C | |1 |3 |female | 26.00000| 0| 0| 7.9250|S | |1 |1 |female | 35.00000| 1| 0| 53.1000|S | |0 |3 |male | 35.00000| 0| 0| 8.0500|S | |0 |3 |male | 29.69912| 0| 0| 8.4583|Q | - We want to predict the `target` given the information of other variables. --- # Import and Clean the data ```r # Read in the data library(tidyverse) df = read_csv("https://bryantstats.github.io/math421/data/titanic.csv") ``` --- # Set the Target Variable - It's a common practice that the target variable named `target` ```r # Take out some columns df <- df %>% select(-PassengerId, -Ticket, -Name, -Cabin) # Set the target variable df <- df %>% rename(target=Survived) ``` --- # Correct Variables' Types - Make sure all categorical variables are factors. ```r # Correct variables' types df <- df %>% mutate(target = as.factor(target), Pclass = as.factor(Pclass), Embarked = as.factor(Embarked), Sex = as.factor(Sex) ) ``` --- # Handle Missing Values - Make sure there are no missing values ```r # Replace NA of Age by its mean mean_age <- mean(df$Age, na.rm=TRUE) df$Age <- replace_na(df$Age, mean_age) # Drop all rows that has an NA df = drop_na(df) ``` --- # Split the data to training and testing - Make sure to set.seed to that the results are reproducible. ```r library(caret) set.seed(2020) splitIndex <- createDataPartition(df$target, p = .70, list = FALSE) df_train <- df[ splitIndex,] df_test <- df[-splitIndex,] ``` --- # Create a tree ```r library(rpart) #load the rpart package # Create a tree tree_model <- rpart(target ~ ., data = df_train, control = rpart.control(maxdepth = 3)) ``` --- # Plot the tree ```r library(rattle) fancyRpartPlot(tree_model) ``` <img src="10_predictive_modeling_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> --- # Variable importances ```r tree_model$variable.importance ``` ``` ## Sex Fare Pclass Age SibSp Parch Embarked ## 83.761904 32.854400 28.807881 19.954127 19.287469 14.431109 7.625484 ``` --- # Variable importances ```r barplot(tree_model$variable.importance) ``` <img src="10_predictive_modeling_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> --- # Evaluate the tree ```r #predict on testing data pred <- predict(tree_model, df_test, type = "class") #Evaluate the predictions cm <- confusionMatrix(data = pred, reference = df_test$target, positive = "1") cm$overall[1] ``` ``` ## Accuracy ## 0.8383459 ``` --- # Evaluate the tree | | Metric| |:--------------|---------:| |Accuracy | 0.8383459| |Kappa | 0.6483429| |AccuracyLower | 0.7884950| |AccuracyUpper | 0.8804722| |AccuracyNull | 0.6165414| |AccuracyPValue | 0.0000000| |McnemarPValue | 0.0327626| --- # Evaluate the tree | | Metric| |:--------------------|---------:| |Sensitivity | 0.7156863| |Specificity | 0.9146341| |Pos Pred Value | 0.8390805| |Neg Pred Value | 0.8379888| |Precision | 0.8390805| |Recall | 0.7156863| |F1 | 0.7724868| |Prevalence | 0.3834586| |Detection Rate | 0.2744361| |Detection Prevalence | 0.3270677| |Balanced Accuracy | 0.8151602| --- # Random Forest - Random Forest is a collection of decision trees - Random Forest predict by the majority vote between the trees - For example: if 51 trees in a forest of 100 trees predict passenger A `survived`, then the forest also predict passenger A `survived` - Trees are trained only a subset of the original data - Only random of few variables are considered at each split --- # Random Forest ```r library(randomForest) forest_model = randomForest(target ~ ., data=df_train, ntree = 500) pred <- predict(forest_model, df_test, type = "class") cm <- confusionMatrix(data = pred, reference = df_test$target, positive = "1") cm$overall[1] ``` ``` ## Accuracy ## 0.8270677 ``` --- # Variable importances ```r importance(forest_model) ``` ``` ## MeanDecreaseGini ## Pclass 23.497310 ## Sex 69.014066 ## Age 38.661926 ## SibSp 10.796182 ## Parch 9.727980 ## Fare 49.767591 ## Embarked 7.816245 ``` --- # Evaluate the Forest | | Metric| |:--------------|---------:| |Accuracy | 0.8270677| |Kappa | 0.6187212| |AccuracyLower | 0.7761558| |AccuracyUpper | 0.8705225| |AccuracyNull | 0.6165414| |AccuracyPValue | 0.0000000| |McnemarPValue | 0.0019596| --- # Evaluate the Forest | | Metric| |:--------------------|---------:| |Sensitivity | 0.6666667| |Specificity | 0.9268293| |Pos Pred Value | 0.8500000| |Neg Pred Value | 0.8172043| |Precision | 0.8500000| |Recall | 0.6666667| |F1 | 0.7472527| |Prevalence | 0.3834586| |Detection Rate | 0.2556391| |Detection Prevalence | 0.3007519| |Balanced Accuracy | 0.7967480|