Part 1: Use Machine Learning to Predict a Death Event Due to Heart Failure
The goal of this analysis is to compare machine learning methods when predicting a death event due to heart failure. The predictor variables of a death event include age, anaemia, creatinine phosphokinase, diabetes, ejection fraction, high blood pressure, platelets, serum creatinine, serum sodium, sex, smoking, and time.
library(rpart)
library(gbm)
library(randomForest)
library(ISLR)
heart_failure<-read.csv(“heart_failure.csv”)
hf <- heart_failure
head(hf)
Split data into training and test sets.
set.seed(21)
test.d <- sample(1:nrow(hf), .20*nrow(hf))
test <- hf[test.d, ]
training <- hf[-test.d, ]
y.true <- as.numeric(test$DEATH_EVENT)
y.true
Grow a single tree using 10 for cross validation.
model.control <- rpart.control(minsplit = 5, xval = 10, cp = 0)
fit <- rpart(DEATH_EVENT~., data = training, method = “class”, control = model.control)
plot(fit)
text(fit, use.n = TRUE, cex = .5)

The tree is overfit. I am going to look at the complexity parameter table. The complexity parameter gives the needed improvement at each node. If adding another variable increases the cost above the complexity parameter, the variable will not be added. This will help me prune the tree.
min.cp <-which.min(fit$cptable[,4])
plot(fit$cptable[,4], main = “Cp for model selection”, ylab = “cv error”)

This complexity parameter table shows that 2 is optimal.
pruned.fit <- prune(fit, cp = fit$cptable[min.cp, 1])
plot(pruned.fit)
text(pruned.fit, use.n = TRUE, cex = .5)

Compute test error for a single tree.
my.pred <- predict(pruned.fit, newdata = test, type = “class”)
my.pred
y.hat <- as.numeric(my_pred)-1
y.hat
misclass.tree <- sum(abs(y.true- y.hat))/length(y.hat)
misclass.tree
Test error for a single tree: 0.1355932.
Build a random forest model.
rf.fit <- randomForest(DEATH_EVENT~., data = training, n.tree = 10000)
varImpPlot(rf.fit)

importance(rf.fit)
IncNodePurity
age 4.5938688
anaemia 0.3684661
creatinine_phosphokinase 2.9997713
diabetes 0.3819893
ejection_fraction 7.3631650
high_blood_pressure 0.5095131
platelets 3.3403799
serum_creatinine 6.2360515
serum_sodium 2.5012488
sex 0.4585470
smoking 0.4557590
time 19.6994486
Compute the test error for the random forest.
y.hat <- predict(rf.fit, newdata = test, type = “response”)
y.hat <- as.numeric(y.hat)
y.hat
misclass.rf <- sum(abs(y.true- y.hat))/length(y.hat)
misclass.rf
Random forest test error: 0.2446791.
Bagging
bag.fit <- randomForest(DEATH_EVENT~., data = training, n.tree = 10000, mtry = 10)
varImpPlot(bag.fit)

importance(bag.fit)
IncNodePurity
age 3.1057867
anaemia 0.2509914
creatinine_phosphokinase 2.9177911
diabetes 0.2167239
ejection_fraction 6.2159849
high_blood_pressure 0.3778630
platelets 2.8023183
serum_creatinine 6.5530256
serum_sodium 1.8748913
sex 0.3664596
smoking 0.2428489
time 24.7588903
Compute test error for the bagging.
y.hat <- predict(bag.fit, newdata = test, type = “response”)
y.hat <- as.numeric(y.hat)
y.hat
misclass.bag <- sum(abs(y.true- y.hat))/length(y.hat)
misclass.bag
Bagging test error: 0.2385333.
Boosting
boost.train <- training;
boost.train$DEATH_EVENT <- as.numeric(training$DEATH_EVENT)
boost.train$DEATH_EVENT
boost.test <- test;
boost.test$DEATH_EVENT <- as.numeric(test$DEATH_EVENT)
boost.test$DEATH_EVENT
boost.fit <- gbm(DEATH_EVENT~., data = boost.train, n.trees = 1000, shrinkage <-.1, interaction.depth = 3, distribution = “adaboost”)
boost.fit2 <- gbm(DEATH_EVENT~., data = boost.train, n.trees = 1000, shrinkage <-.6, interaction.depth = 3, distribution = “adaboost”)
summary(boost.fit)

Variable relative influence:
time 37.9954489
ejection_fraction 14.0026115
serum_creatinine 11.6500200
platelets 10.4647959
age 8.7592157
creatinine_phosphokinase 8.0280420
serum_sodium 4.4700776
sex 1.4639108
diabetes 1.0710947
smoking 1.0005206
anaemia 0.5917193
high_blood_pressure 0.5025430
summary(boost.fit2)

Variable relative influence:
time 29.389490031
serum_creatinine 14.053466311
platelets 13.644517810
ejection_fraction 13.067936909
creatinine_phosphokinase 10.309038549
age 7.130393657
serum_sodium 5.015771226
anaemia 3.339362804
diabetes 2.694198896
sex 1.199288907
smoking 0.150299322
high_blood_pressure 0.006235579
Look at the test error for shrinkage = .1.
y.hat <- predict(boost.fit, newdata = boost.test, n.trees = 1000, type = “response”)
misclass.boost.1 <- sum(abs(y.hat — y.true))/ length(y.true)
misclass.boost.1
Test error for shrinkage = .1: 0.1677244
Look at the test error for shrinkage = .6.
y.hat <- predict(boost.fit2, newdata = boost.test, n.trees = 1000, type = “response”)
misclass.boost.6 <- sum(abs(y.hat — y.true))/ length(y.true)
misclass.boost.6
Test error for shrinkage = .6: 0.1865359
Error profiles for different shrinkage values:

Build a k nearest neighbor model.
library(class)
library(factoextra)
knn.total.data <- rbind(training, test)
set.seed(21)
Use an elbow function to compute total within-cluster sum of squares.
dev.off()
fviz_nbclust(knn.total.data, kmeans, method = “wss”, k.max = 20) + theme_minimal() + ggtitle(“the Elbow Method”)
k=5
knn<- knn(train= training, test= test, cl=training$DEATH_EVENT, k=5)
knn
knn.cm <- table(test$DEATH_EVENT, knn)
knn.error<- mean(knn != test$DEATH_EVENT)
print(paste(‘Accuracy =’, 1-knn.error))
Accuracy = 0.728813559322034
knn.error
Test error for k=5: 0.2711864
Build a logistic regression model.
glm.fit <- glm(DEATH_EVENT ~. , data = training, family = “binomial”)
summary(glm.fit)
names(glm.fit)
round(glm.fit$fitted.values)
glm.probs.test <- predict(glm.fit, newdata = test, type = “response”)
y.hat.test <- round(glm.probs.test)
log.tab <- table(y.hat.test, test$DEATH_EVENT)

.86441 accuracy
1- .86441
Logistic regression test error: 0.13559.
This is the lowest of all models but it is nearly identical to the single tree test error which was 0.1355932.
The variables that the logistic regression found important are:
time with a pvalue of 6.51e-09
ejection_fraction with a pvalue of 1.47e-06
serum_creatinine with a pvalue of 0.000592
age with a pvalue of 0.000911
The single tree error was 0.1355932. The random forest error was 0.2446791. The bagging error was 0.2385333. The lowest boosting error was 0.1677244 with .1 shrinkage and 1000 trees. The KNN error was the highest at 0.2711864, and the logistic regression test error was the lowest at 0.13559.
I was surprised that some of the simpler models performed better than bagging, boosting and random forests. The advantage of single tree model compared to bagging, boosting and random forests is that it is much more interpretable. The KNN did not do as well and the elbow method for selecting k was unreliable in this case, so I had to try a few different k values out. I found k=5 resulted in the lowest error at 27%, but this was not comparable to the results of bagging, boosting and random forests. The logistic regression had the lowest error rate of all at 13.56%. This was the easiest to implement and it identified several of the same important variables as boosting.
Part 2: Use Machine Learning to Predict Diabetes
The Pima dataset comes from the National Institute of Diabetes and Digestive and Kidney Diseases. My goal with this analysis is to predict diabetes based on 7 health predictors that include the number of pregnancies, glucose level, diastolic blood pressure, skin fold thickness, body mass index, pedigree, and age. I also want to compare the machine learning methods used for prediction.
pima
pima <- pima[-9] #remove redundant column
pima
head(pima)
Split data into training and test sets.
set.seed(21)
test.d <- sample(1:nrow(pima), .20*nrow(pima))
test <- pima[test.d, ]
test$classdigit
training <- pima[-test.d, ]
training$classdigit
y.true <- as.numeric(test$classdigit)-1
y.true
Grow a single tree using 10 for cross validation.
model.control <- rpart.control(minsplit = 5, xval = 10, cp = 0)
fit <- rpart(classdigit~., data = training, method = “class”, control = model.control)
plot(fit)
text(fit, use.n = TRUE, cex = .5)

The tree is overfit. I am going to look at the complexity parameter table. The complexity parameter gives the needed improvement at each node. If adding another variable increases the cost above the complexity parameter, the variable will not be added. This will help me prune the tree.
min.cp <- which.min(fit$cptable[,4]) #column 4 is the xerror
plot(fit$cptable[,4], main = “Cp for model selection”, ylab = “cv error”)

This complexity parameter table shows that 5 is optimal.
pruned.fit <- prune(fit, cp = fit$cptable[min.cp, 1])
quartz()
plot(pruned.fit)
text(pruned.fit, use.n = TRUE, cex = .5)

Compute test error for a single tree.
my.pred <- predict(pruned.fit, newdata = test, type = “class”)
my.pred
y.hat <- as.numeric(my.pred)-1
y.hat
y.hat
misclass.tree <- sum(abs(y.true- y.hat))/length(y.hat)
misclass.tree
Single tree test error: 0.1886792.
Build a random forest model.
rf.fit <- randomForest(classdigit~., data = training, n.tree = 10000)
varImpPlot(rf.fit)

importance(rf.fit)
MeanDecreaseGini
npregnant 17.61143
glucose 50.48042
diastolic.bp 16.43220
skinfold.thickness 19.49788
bmi 30.02280
pedigree 27.75305
age 27.51580
Compute test error for the random forest.
y.hat <- predict(rf.fit, newdata = test, type = “response”)
y.hat <- as.numeric(y.hat)-1
y.hat
misclass.rf <- sum(abs(y.true- y.hat))/length(y.hat)
misclass.rf
Random forest test error: 0.2075472
Bagging
bag.fit <- randomForest(classdigit~., data = training, n.tree = 10000, mtry = 10)
varImpPlot(bag.fit)

importance(bag.fit)
MeanDecreaseGini
npregnant 14.94415
glucose 64.91023
diastolic.bp 13.71250
skinfold.thickness 15.64271
bmi 26.32955
pedigree 27.82432
age 26.63570
Compute test error for bagging.
y.hat <- predict(bag.fit, newdata = test, type = “response”)
y.hat <- as.numeric(y.hat)-1
y.hat
misclass.bag <- sum(abs(y.true- y.hat))/length(y.hat)
misclass.bag
Bagging test error: 0.2264151
Boosting
boost.train <- training
boost.train
boost.train$classdigit <- as.numeric(training$classdigit)-1
boost.train$classdigit
boost.test <- test
boost.test$classdigit <- as.numeric(test$classdigit)-1
boost.test$classdigit
boost.fit <- gbm(classdigit~., data = boost.train, n.trees = 1000, shrinkage = .1, interaction.depth = 3, distribution = “adaboost”)
Try different levels for shrinkage.
eboost.fit2 <- gbm(classdigit~., data = boost.train, n.trees = 1000, shrinkage = .6, interaction.depth = 3, distribution = “adaboost”)
summary(boost.fit)

Variable relative influence:
glucose 23.613904
pedigree 19.515308
bmi 18.367461
age 14.766844
skinfold.thickness 8.672439
npregnant 8.059944
diastolic.bp 7.004100
summary(boost.fit2)

Variable relative influence:
glucose 24.144609
pedigree 22.611261
bmi 15.964700
age 12.981947
skinfold.thickness 11.247379
npregnant 6.664430
diastolic.bp 6.385674
Look at the test error for shrinkage = .1.
y.hat <- predict(boost.fit, newdata = boost.test, n.trees = 1000, type = “response”)
misclass.boost.1 <- sum(abs(y.hat — y.true))/ length(y.true)
misclass.boost.1
Test error for shrinkage = .1: 0.2382885
Look at the test error for shrinkage = .6.
y.hat <- predict(boost.fit2, newdata = boost.test, n.trees = 1000, type = “response”)
misclass.boost.6 <- sum(abs(y.hat — y.true))/ length(y.true)
misclass.boost.6
Test error for shrinkage = .6: 0.2516983
Error profiles for different shrinkage values:

The random forest important variables differ from the single tree model which identified glucose, age, and pedigree as important variables. The random forest identified glucose as the most important, then BMI, and then age and pedigree. Boosting at .1, with the lower error rate, identified glucose, pedigree, bmi, and then age in order of importance. Bagging on the other hang was more similar to the single tree model because it identified glucose, pedigree, and age as the most important with BMI being similar to age in importance.
Random forest had an increased test error of 0.2075472 in comparison to the single tree that had an error of 0.1886792. Boosting had a further increased test error of 0.2382885 with .1 shrinkage. Boosting with .6 shrinkage had an error of 0.2516983. Since boosting did not do well as well as random forest, I was curious to see how bagging would compare. The bagging error was 0.2264151. This is similar to the situation I found with the heart failure data in that the single tree model performed better than the random forest, boosting, and bagging models. The benefit of this is a reduction in computational cost and higher interpretability.
By Aspen Gulley on .

Leave a Reply