Random Forest - Regression

1. Predicting Age

1.1 Import the data
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
np.random.seed(12356)
df = pd.read_csv('titanic.csv')
df = df.dropna()
1.2 Assign input and output variabble
# Assign input variables
X = df.loc[:,['Pclass','Sex','Fare','Embarked','SibSp','Parch','Survived']]

# Assign target variable
y = df['Age']
1.3 Handle some missing and fix variables types
# Impute the Embarked variable
X["Embarked"] = X["Embarked"].fillna("S")
# Change Pclass to categorical variable
X['Pclass'] = X['Pclass'].astype(object)
X['Survived'] = X['Survived'].astype(object)
1.4 Encode categorical variable
X = pd.get_dummies(X)
1.5 Split the data
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(X,y,test_size=0.3)
1.6 Set up and Train Gradient Boosting
from sklearn.ensemble import RandomForestRegressor
r1 = RandomForestRegressor(n_estimators=100, max_features=3)
r1.fit(x_train, y_train)

# Rsquared
from sklearn.metrics import r2_score
print('Rsquared on Testing: ', r2_score(y_test, r1.predict(x_test)))
Rsquared on Testing:  -0.10467825143722398

Variable Importance

import warnings
warnings.filterwarnings('ignore')

sorted_idx = (-r1.feature_importances_).argsort()

feature_importance = pd.DataFrame({'Variables':x_train.columns[sorted_idx], 'Importance':r1.feature_importances_[sorted_idx]})
df = feature_importance[:10]
df.sort_values('Importance',inplace=True)

df.plot(kind='barh',y='Importance',x='Variables', legend=False)
<Axes: ylabel='Variables'>

n1 = 1
n2 = 100
l1 = 1
l2 = 7

ac = pd.DataFrame([], columns=list(['Number of Trees','Max Features','Train Rsquared']))

for rs in range(n1, n2):
    for lr in [l1, l2]:
        boost = RandomForestRegressor(n_estimators=rs, max_features=lr)
        boost.fit(x_train,y_train)
        
        ac = pd.concat([ac, pd.DataFrame([[rs, lr, boost.score(x_train, y_train)]], 
                                    columns=list(['Number of Trees','Max Features','Train Rsquared']))], 
                       ignore_index=True)
        

import seaborn as sns; sns.set()
import matplotlib.pyplot as plt
ax = sns.lineplot(x="Number of Trees", y="Train Rsquared", hue =ac['Max Features'].astype('category'),data=ac)

n1 = 1
n2 = 100
l1 = 1
l2 = 7

ac = pd.DataFrame([], columns=list(['Number of Trees','Max Features','Testing Rsquared']))

for rs in range(n1, n2):
    for lr in [l1, l2]:
        boost = RandomForestRegressor(n_estimators=rs, max_features=lr)
        boost.fit(x_train,y_train)
        
        ac = pd.concat([ac, pd.DataFrame([[rs, lr, boost.score(x_test, y_test)]], 
                                    columns=list(['Number of Trees','Max Features','Testing Rsquared']))], 
                       ignore_index=True)
        

import seaborn as sns; sns.set()
import matplotlib.pyplot as plt
ax = sns.lineplot(x="Number of Trees", y="Testing Rsquared", hue =ac['Max Features'].astype('category'),data=ac)

2. Practice

Predicting NBA Salary. Download the data at: https://bryantstats.github.io/math460/python/nba_salary.csv

  1. Import the data and drop all the missing values

  2. Set the input (X) and output (y) (Use df.columns to see all the columns to easier copy/paste). Split the data into 60% training and 40% testing (No need to do 1.3 and 1.4 as all variables are numeric and have correct types)

  3. Train a random forest of 200 trees and consider only 2 variables at each split for the trees. What is testing Rsquared of the model?

  4. What is the most important variable according to the model?

  5. Find a random forest that have a higher testing Rsquared than the first forest.