Random Forest - Classification

This notebook has two parts: Sample Codes and Practice.

1. Sample Codes

Installation: You will need to install python and jupyter. The easiest way is to install the package Anaconda as follows.

  • Download Anaconda from this link

  • Install Anaconda from the downloaded file

  • Open Jupyter Lab by

      1. Click to the Start Windows Logo and Type in Anconda Promp. Open Anaconda Promp
      1. In Anaconda Promp, type in: jupyter lab and hit Enter

Data: The data should be in the same folder as the notebook.

A procedure of training and tuning predictive models runs as follows.

  • Step 1: Import some packages
  • Step 2: Import data and do some cleaning
  • Step 3: Encode Categorical Variables, i.e. change a categorical variable to (multiple) numeric variables.
  • Step 4: Split the data into training and testing
  • Step 5: Train a first model
  • Step 6: Test the model
  • Step 7: Hyperparameters Tuning and redo Step 5 and 6

We will go over the above steps to train random forest with the titanic dataset. Notice that these codes can be reused for other dataset. The codes for Step 1, 4, 5, and 6 should be the same or at least similar when applied to other data. Only Step 2 and Step 3 will be different from data to data.

Step 1: Import some packages

import pandas as pd
import numpy as np
np.random.seed(12356)

Step 2: Import data and do some cleaning

# Import the data
df = pd.read_csv('titanic.csv')
# See all variables of the data
df.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
# Assign input variables
X = df.loc[:,['Pclass','Sex','Age','Fare','Embarked','SibSp','Parch']]

# Assign target variable
y = df['Survived']

# Replace missing values by the median
X["Age"] = X["Age"].fillna(X["Age"].median())

# Impute the Embarked variable
X["Embarked"] = X["Embarked"].fillna("S")

Step 3: Encode Categorical Variables

Encode a Categorical Variable = Turn it into multiple numeric variables

sklearn does not work directly with categorical variables. It requires the categorical variables to be encoded into numeric variables. There are multiple way to encode categorical variables. Here, we implement the simplest way of encoding: one-hot encoding or dummy encoding.

# Show the types of the variables
X.dtypes
Pclass        int64
Sex          object
Age         float64
Fare        float64
Embarked     object
SibSp         int64
Parch         int64
dtype: object
# Change Pclass to categorical variable
X['Pclass'] = X['Pclass'].astype(object)

# Encode categorical variable
X = pd.get_dummies(X)

Step 4: Split the data into training and testing

test_size =.3 means 30% of the data is saved for testing.

from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test=train_test_split(X,y,test_size=0.3)

Step 5: Train a model

n_estimators is the number of trees in the forest. max_features is the number of variables considered at each split.

from sklearn.ensemble import RandomForestClassifier
r1 = RandomForestClassifier(n_estimators=10, max_features=2)
r1.fit(x_train, y_train)
RandomForestClassifier(max_features=2, n_estimators=10)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Step 6: Test the model

# Accuracry on test data
r1.score(x_test, y_test)
0.832089552238806

Step 7: Hyperparameters Tuning and redo Step 5 and 6

How do we know the selection of the selected values of n_estimators and max_features above are the best selection? We actually do not know!

Tuning hyperparameters or Tuning a model is to search for the set of hyperparameters that works the best. To tune a model, one first needs to know what the hyperparameters/tuning parameters that the model has. A model may have several hyperparamters that sometime it is not practical to tune all the hyperparameters.

Our model here is random forest. To see the list of tuning parameters of random forest, one can check at the sklearn document of the model. One way to find out is to google: RandomForestClassifier and sklearn. This search brings us to this link.

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

From the link, you can see the list of (hyper)parameters. In this example, we will tune hyperparameter n_estimators and max_features using grid search (GridSearchCV)

import warnings
warnings.filterwarnings("ignore")


# Decide what hyperparameter to tune then decide the searching range
param_grid = {'n_estimators': range(5,15), 'max_features':range(2, 5)}

# Create a list of trees
from sklearn.model_selection import GridSearchCV
r2 = GridSearchCV(RandomForestClassifier(), param_grid, cv = 3)
r2.fit(x_train, y_train)
GridSearchCV(cv=3, estimator=RandomForestClassifier(),
             param_grid={'max_features': range(2, 5),
                         'n_estimators': range(5, 15)})
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
cv_result = pd.concat([pd.DataFrame(r2.cv_results_["params"]),pd.DataFrame(r2.cv_results_["mean_test_score"], columns=["Accuracy"])],axis=1)
cv_result['Rank'] = (-cv_result['Accuracy']).argsort().argsort()
cv_result
max_features n_estimators Accuracy Rank
0 2 5 0.773659 24
1 2 6 0.789700 10
2 2 7 0.755992 28
3 2 8 0.791326 8
4 2 9 0.776864 22
5 2 10 0.772064 25
6 2 11 0.768828 26
7 2 12 0.780085 19
8 2 13 0.780062 20
9 2 14 0.802544 2
10 3 5 0.799339 3
11 3 6 0.783290 15
12 3 7 0.784916 12
13 3 8 0.794539 5
14 3 9 0.781687 18
15 3 10 0.791342 6
16 3 11 0.788113 11
17 3 12 0.791303 9
18 3 13 0.784900 13
19 3 14 0.791334 7
20 4 5 0.748026 29
21 4 6 0.776856 23
22 4 7 0.764005 27
23 4 8 0.809001 0
24 4 9 0.781703 16
25 4 10 0.797760 4
26 4 11 0.781695 17
27 4 12 0.784877 14
28 4 13 0.776880 21
29 4 14 0.804178 1

What are the best values for n_estimators and max_features?

r2.best_params_
{'max_features': 4, 'n_estimators': 8}

Train and test the random forest with the best parameters found above

r3 = RandomForestClassifier(**r2.best_params_)
r3.fit(x_train, y_train)
r3.score(x_test, y_test)
0.8544776119402985

Variable Importance

import warnings
warnings.filterwarnings('ignore')

sorted_idx = (-r3.feature_importances_).argsort()

feature_importance = pd.DataFrame({'Variables':x_train.columns[sorted_idx], 'Importance':r1.feature_importances_[sorted_idx]})
dff = feature_importance[:10]
dff.sort_values('Importance',inplace=True)

dff.plot(kind='barh',y='Importance',x='Variables', legend=False)
<Axes: ylabel='Variables'>

Training Errors vs. Testing Errors

We plot the accuracy of random forest when changing the number of trees. The plot shows increasing the number of trees does not necessarily overfit the data (the testing accuracy does not decrease).

n_estimators = range(2, 100)
max_features = range(2, 3)

#erros_plot = function(criterion, )

rs = pd.DataFrame(columns = ['n_estimators','max_features', 'Data','Accuracy'])

for n_estimators1 in n_estimators:
    for max_features1 in max_features:
        r1 = RandomForestClassifier(n_estimators=n_estimators1, max_features=max_features1, oob_score=True)
        r1.fit(x_train, y_train)
        new_row={'n_estimators':n_estimators1,'max_features':max_features1, 'Data':'Train','Accuracy':r1.score(x_train, y_train)}
        
        #rs=rs.append(new_row, ignore_index=True)
        
        rs = pd.concat([rs, pd.DataFrame([new_row])], ignore_index=True)
        
        new_row={'n_estimators':n_estimators1,'max_features':max_features1, 'Data':'Test','Accuracy':r1.score(x_test, y_test)}
        #rs=rs.append(new_row, ignore_index=True)
        rs = pd.concat([rs, pd.DataFrame([new_row])], ignore_index=True)
        
        new_row={'n_estimators':n_estimators1,'max_features':max_features1, 'Data':'OOB','Accuracy':r1.oob_score_}
        #rs=rs.append(new_row, ignore_index=True)
        rs = pd.concat([rs, pd.DataFrame([new_row])], ignore_index=True)
        
        
import seaborn as sns
import matplotlib.pyplot as plt
sns.lineplot(data=rs, y="Accuracy", x="n_estimators", hue='Data', ci=None)     
<Axes: xlabel='n_estimators', ylabel='Accuracy'>

2. Practice

Following the sample codes above to do/answer the below.

  • Import the breast cancer dataset. The data can be downloaded at this link
  • Check out the missing values in each columns
  • Set the input (X) and output (y). Split the data into 70% training and 30% testing
  • Train a random forest of 100 trees. Consider 3 variables at each split. What is the training accuracy and testing accuracy of the forest?
  • Consider a collection of random forest where the number of trees run from 20 to 200 and the variable considered at each split runs from 3 to 10. What is the best random forest in this collection in term of test accuracy?
  • Train a random forest using the best hyperparameters found above then calculate the testing error of this tree.