Aspen Gulley

Data Scientist | Behavior Analyst in Training


Predicting Chronic Kidney Disease with Machine Learning in R


My goal with this analysis was to predict chronic kidney disease based on 24 attributes. The dimensions of the data are 400 by 25, including the dependent variables. There are 150 observations of patients who do not have chronic kidney disease and 250 observations of patients with chronic kidney disease. I hoped to create a model that predicted kidney disease with 80%+ accuracy. I began by creating dummy variables for the 11 binary nominal attributes. For example, red blood cells normal was coded as 0 and abnormal was coded as 1. Another example would be chronic kidney disease not present was coded as 0, and chronic kidney disease present was coded as 1.

Data Set Information:
The dimensions of the data are 400 by 25.
The attributes include:
age — age 
bp — blood pressure 
sg — specific gravity 
al — albumin 
su — sugar 
rbc — red blood cells
pc — pus cell 
pcc — pus cell clumps 
ba — bacteria 
bgr — blood glucose random
bu — blood urea 
sc — serum creatinine
sod — sodium 
pot — potassium
hemo — hemoglobin 
pcv — packed cell volume
wc — white blood cell count 
rc — red blood cell count
htn — hypertension 
dm — diabetes mellitus
cad — coronary artery disease 
appet — appetite
pe — pedal edema 
ane — anemia
classification — chronic kidney disease

The data set contains 470 missing values. Omitting observations with missing values was not a reasonable option because it would reduce the dataset almost in half to 228 observations. I wanted to get a better idea of these missing observations through visual analysis and researching different imputation methods. 95.5% of the values are present while 4.5% are missing. There are 61 observations that are missing both sodium and potassium. There are 29 observations that are missing both specific gravity and sugar. Overall, sodium and potassium are missing the most datapoints, but hemoglobin, which ends up being an important predictor variable, is also missing several.

I needed to research different methods of imputation. Mean Imputation is one method of handling missing values. In this method the mean of the variable is imputed. For categorical data the mode is used. Pros of this method are that it is simple to understand, implement, and explain to others. It will work with both continuous and categorical variables. Some cons of this method are that it can be overly influenced by outliers or a non-normal distribution of the variable. This method can also create bias in the standard errors and variance of imputed variables.

An alternative option was random forest imputation. This is a method that uses proximity to impute the weighted average of a missing value based on the similarity between observations. Pros of this method are that can handle non-linearity, interactions, and outliers effectively. It also will work with both continuous and categorical variables. It can be computationally costly, complex, and is not easily interpretable. It can also lead to bias when the missing data is not similar to the observed. Because of its ability to better handle non-linearity, interactions, and outliers in the data, I decided to move forward with random forest imputation. To prepare for this, the binary variables that were classified as factors needed to be changed to numerical values. Because of the binary nature of these variables, this reclassification would not be an issue for future model performance. I split the data into training and test sets and then performed the imputations.

One problem I encountered with this data is that it had high multicollinearity. Multicollinearity occurs when variables are highly correlated with one another. Multicollinearity makes it difficult to distinguish the effects of one independent variable on the dependent variable from another highly correlated independent variable. This dataset contains several instances where the variables are correlated greater than 50%. Some of the multicollinearity present is caused by one independent variable being dependent on another. For example, in this data set pus cell clusters are dependent on the presence of pus cells. Additionally, albumin, red blood cell count, and packed cell volume all have several high correlations with other variables. Because this is medical data, and because there were so many correlations, I did not want to delete the highly correlated variables because I was not sure which variables were going to be important, therefore, I needed to consider which models would be able to appropriately handle the multicollinearity.

Keeping the multicollinearity issue in mind, I decided to try several different models and compare the results. The models I chose were: Principal components regression, k nearest neighbors, classification and regression tree, and random forest.

The accuracy of predictions of the first group of models was quite high. I became skeptical and wondered if the random forest imputation was inflating the results. I decided to re-run the models using mean imputation and compare the results. The accuracy still was high. I then wondered if the multicollinearity of the data was responsible for the inflation. I re-ran the models a third time after removing the highly correlated variables pus cell clusters, albumin, red count, and packed cell volume. This greatly reduced the multicollinearity of the data. I used random forest imputation on this last set of models. The process of removing variables made me uncomfortable because a lot of the correlated variables are important. Despite this, I was curious to see how removing some of the highest correlated variables would impact model performances. When I present the results, I will show them in the process order: random forest imputation, mean imputation, and then the reduced variable model. At one point I also run the CART model with the missing values included, and a random forest model with the missing values removed.

Principal component regression uses the principal components as the predictor variables. Principal component regression was a good place to start because it is known for working well with data with high multicollinearity. I just needed to find the ideal number of principal components. For the dataset that had the random forest imputation, the principal components regression showed that 11 principal components were optimal with a RMSE of 22%. The percentage of the variance in the response variable explained by the principal components was 81%. The test RMSE was 23%. When I compared these results to the dataset with mean imputation, the optimal number of principal components raised to 12 and the RMSE also raised to 27%. The percentage of the variance in the response variable explained by the principal components reduced to 72%, and the test RMSE was 29%. After removing the highest correlated variables in the data, 10 principal components were selected with a RMSE of 24%. The percentage of the variance in the response variable explained by the principal components was 78%. The test RMSE was 23%. Between these three methods, the random forest imputation dataset resulted in the lowest root mean square error with the highest percentage of the variance in the response variable being explained at 81%.

The next model I decided to run was k nearest neighbors. This is a data classification method that estimates the likelihood that a datapoint will belong to a group based on the nearest neighboring datapoints. I knew this was not a great choice due to the multicollinearity in the data. K nearest neighbors does not handle correlated variables well because the datapoints become too close. Despite this, I wanted to explore and to compare model results. I used the elbow method for selecting k, but wanted to compare values at 5, 10, and 15. For the random forest imputation dataset the elbow graph showed me that k=10 or k=11 might be a good choice. I found a 3% increase in accuracy from 10 to 11. k=15 accomplished an additional 5% in accuracy. The mean imputation method achieved an improvement in performance; for example, k=10 increased from 67% to 74% between the models; but the most significant improvement was seen when the greatest correlations were removed from the dataset. k=5 in random forest imputation achieved 62% accuracy while k=5 with correlations removed achieved 85% accuracy. This illustrates k nearest neighbor’s lack of ability to deal with multicollinearity.

KNN Results:

RFI: k = 5, Acc. = 0.625
 k = 10, Acc. = 0.675
 k = 11, Acc. = 0.708
 k = 15, Acc. = 0.750

MI: k = 5, Acc. = 0.667
 k = 10, Acc. = 0.742
 k = 15, Acc. = 0.792

Correlations Removed: 
 k = 5, Acc. = 0.858
 k = 10, Acc.= 0.858
 k = 15, Acc. = 0.85

Next, I chose to run the data with a classification and regression tree model. CART is a predictive model based on a decision tree. CART generally do not handle multicollinearity well because it is known to arbitrarily choose one closely correlated variable over another. Despite this, I wanted to compare CART with a random forest model. The random forest imputation dataset achieved 97.5% accuracy with the CART model. I was a little surprised by the simplicity of the tree but its highly interpretable with hemoglobin and red blood cells chosen as the important variables. Hemoglobin is the part of red blood cells that carries oxygen throughout your body. Low hemoglobin can cause anemia, which is a symptom of kidney disease. High hemoglobin is also associated with declines in renal function in chronic kidney disease. When someone has kidney disease their body is not producing enough EPO hormone this results in fewer red blood cells being developed by bone marrow, so it makes sense that hemoglobin and red blood cells would be important predictor variables. In this model only three of the test observations were misclassified.

Confusion Matrix:
0 1
0 45 2
1 1 72

Predictive accuracy reduced to 95% with the mean imputation dataset. This model is using hemoglobin and specific gravity for prediction, as opposed to hemoglobin and red blood cells. Specific gravity is the ratio of the density of a substance. High or low specific gravity is a sign that the kidneys are not functioning correctly. Based on the data I believe they are testing urine specific gravity which measures the kidney’s ability to balance water and waste. 5 observations were misclassified by this model.

Confusion Matrix:
0 1
0 43 4
1 1 72

CART is known for working well with missing data, and I wanted to explore what would happen if I left the missing values in the dataset. This CART model achieves the same predictive accuracy as the random forest imputation dataset at 97.5%. Similarly, only 3 observations are misclassified. What is different about this model in comparison to the random forest imputation CART model is that it is again choosing specific gravity as a predictor as opposed to red blood cells.

Confusion Matrix:
0 1
0 45 2
1 1 72

The final CART model I ran was with the highly correlated variables removed. This model achieves 97.5% predictive accuracy with only 3 observations misclassified. This model is using red blood cells as the second predicting variable, as opposed to specific gravity.

Confusion Matrix:
0 1
0 44 3
1 0 73

Lastly, I ran the random forest models. Tony Yiu at towardsdatascience.com describes random forests as using “bagging and feature randomness when building each individual tree to try to create an uncorrelated forest of trees whose prediction by committee is more accurate than that of any individual tree.” (See link to his work below.) Random forest is known for working well on multicollinear data since it selects different features for different models. The random forest imputation dataset achieved 99% predictive accuracy while the mean imputation dataset achieved 98% predictive accuracy. With correlations removed the random forest again achieved 99% predictive accuracy, and when I excluded the missing values from the data, the model achieved 98% predictive accuracy.

The feature importance graph shows specific gravity, hemoglobin, packed cell volume, albumin, red blood cell count, and then red blood cells as important features, among others. I think this is interesting because red blood cell count was chosen as important in the 2 of the CART models, the random forest imputation model and the correlations removed model. The random forest feature importance graph is showing specific gravity as being significantly more important than red blood cells for predicting chronic kidney disease, which is consistent with the mean imputation and missing values included CART models. Interestingly, many of the most important variables are the ones that had high correlations in the dataset. Remember, some of the variables with the highest correlations included albumin, red blood cell count, and packed cell volume. This illustrates a strength of the random forest in working with multicollinear data and why it can be problematic to delete variables with high correlations.

In conclusion, I accomplished my goal of finding a model that accurately predicts chronic kidney disease with 80%+ accuracy based on the 24 predictor attributes listed. I ended up reading about several different imputation methods and learned how to implement random forest and mean imputation. I enjoyed the process of comparing and exploring the datasets and how they impacted the model outcomes. Despite it not being representative, I liked seeing how the KNN improved when the highly correlated variables were removed. I also thought comparison of the CART models was interesting based on what predictors were chosen as important and how this changed with the data. It was interesting to see how the random forest on the correlations removed dataset was similar to the random forest imputation dataset accuracy. This illustrates well how the random forest deals with multicollinearity. If I had to choose a final model it would be the random forest imputation dataset with the random forest model because it achieved 99% accuracy and handles the highly correlated variables well.

Note:
Here is the link to Tony Yiu’s great work on random forests that I referenced:
https://towardsdatascience.com/understanding-random-forest-58381e0602d2#:~:text=The%20random%20forest%20is%20a,that%20of%20any%20individual%20tree

R Code:
kid<-read.csv(“kidney_disease.csv”)
kid

#prepare and preprocess data

#set dummy variable 
kid$classification<- ifelse(kid$classification == ‘ckd’, 1, 0)

#replace abnormal red blood count with 1 and normal red blood count with 0 
kid$rbc[kid$rbc==”abnormal”]<-”1″
kid$rbc[kid$rbc==”normal”]<-”0″
kid$rbc 
kid$rbc<- as.numeric(kid$rbc)

#pus cell, normal 0 vs abnormal 1
kid$pc[kid$pc==”normal”]<-”0″ 
kid$pc[kid$pc==”abnormal”]<-”1″
kid$pc<- as.numeric(kid$pc)

#pus cell clumps
kid$pcc[kid$pcc==”notpresent”]<-”0″ 
kid$pcc[kid$pcc==”present”]<-”1″
kid$pcc<- as.numeric(kid$pc)

#bacteria
kid$ba[kid$ba==”notpresent”]<-”0″ 
kid$ba[kid$ba==”present”]<-”1″
kid$ba<- as.numeric(kid$ba)

#hypertension 
kid$htn[kid$htn==”no”]<-”0″ 
kid$htn[kid$htn==”yes”]<-”1″
kid$htn<- as.numeric(kid$htn)

#Diabetes Mellitus
kid$dm[kid$dm==”no”]<-”0″ 
kid$dm[kid$dm==”\tno”]<-”0″ 
kid$dm[kid$dm==”yes”]<-”1″
kid$dm[kid$dm==” yes”]<-”1″ #typo, correct line 31 “ yes” to 1
kid$dm[kid$dm==”\tyes”]<-”1″
kid$dm<- as.numeric(kid$dm)

#Coronary Artery Disease
kid$cad[kid$cad==”no”]<-”0″ 
kid$cad[kid$cad==”yes”]<-”1″
kid$cad<- as.numeric(kid$cad)

#appetite good 0 vs poor 1
kid$appet[kid$appet==”good”]<-”0″ 
kid$appet[kid$appet==”poor”]<-”1″
kid$appet<- as.numeric(kid$appet)

#Pedal Edema
kid$pe[kid$pe==”no”]<-”0″ 
kid$pe[kid$pe==”yes”]<-”1″
kid$pe<- as.numeric(kid$pe)

#anemia
kid$ane[kid$ane==”no”]<-”0″ 
kid$ane[kid$ane==”yes”]<-”1″
kid$ane<- as.numeric(kid$ane)

kid$pcv<- as.numeric(kid$pcv)
kid$wc<- as.numeric(kid$wc)
kid$rc<- as.numeric(kid$rc)

sapply(kid, class) 
#all classes are now numeric for the random forest imputation

save(kid, file = “kidney.Rdata”) #save data
load(“kidney.Rdata”)

dim(kid) #400 26, 400 x 25 though ignoring the ID column
sum(is.na(kid))#470 missing values
summary(kid)
sapply(kid, function(x) sum(is.na(x)))

#visualizing missing data vs not missing data on a barplot
library(ggplot2)
library(plyr)
library(tidyr)
library(dplyr)
kid %>%
 summarise_all(list(~is.na(.)))%>%
 pivot_longer(everything(),
 names_to = “variables”, values_to=”missing”) %>%
 count(variables, missing) %>%
 ggplot(aes(y=variables,x=n,fill=missing))+
 geom_col()

library(naniar)
#save image of missing observations to computer, PNG device
png(“missingobs.png”)
# Code
vis_miss(kid)
# Close device
dev.off()
#4.5% of observations are missing while 95.5% are present
#From these variables, I can see that sodium (22.75%) and potassium (22%) have several missing variables,
#hemoglobin, blood gluclose random, sugar, and specific gravity and albumin also have several missing values

png(“missingobs2.png”)
# Code
gg_miss_upset(kid)
# Close device
dev.off()
#An upset plot from the UpSetR package can be used to visualise the 
#patterns of missingness, or rather the combinations of missingness across cases. 
#specifically, there are 61 observations that are missing both potassium and sodium

kid.na.omit<-na.omit(kid)
dim(kid.na.omit) #228 25
#deletion is not the best option because it cuts the data set in half

#remove ID column from dataset
kid<- kid[-1] 
head(kid)

#make classification a factor to run rf
kid$classification<- as.factor(kid$classification)

#split data into training and testing sets
set.seed(21)
split.kid <- sample(1:nrow(kid), round(.70*nrow(kid)), replace = FALSE)
kid.train <- kid[split.kid, ]
kid.test <- kid[-split.kid, ]
dim(kid.train) #280 26
dim(kid.test) #120 26

#random forest imputation on training and testing sets 
library(missForest)
library(randomForest)
set.seed(16)
kid.train <- rfImpute(classification ~ ., data=kid.train) #random forest data imputation
kid.test <- rfImpute(classification ~ ., data=kid.test) #random forest data imputation
kid.y.train <- kid.train$classification
kid.y.test <- kid.test$classification

#find correlations in dataset, first fill in missing data by rf
kid.cor<- rfImpute(classification ~ ., data=kid)
cor(kid.cor[, c(‘age’,’bp’,’sg’,’al’,’bgr’,’bu’,’sc’,’sod’,’pot’,’hemo’,’pcv’,’wc’,’rc’)])
pairs(kid.cor[, c(‘age’,’bp’,’sg’,’al’,’bgr’,’bu’,’sc’,’sod’,’pot’,’hemo’,’pcv’,’wc’,’rc’)])
#Correlations worth noting:
#al-sg -0.71233145 
#hemo-sg 0.68208646 
#pvc-sg 0.67847179
#rc-sg 0.61909174
#al-hemo -0.7847450
#al-pvc -0.7755284
#al-rc -0.6400991
#al-bu 0.6619395
#al-sc 0.7028890
#bu-sc 0.8960939
#bu-hemo -0.7121959
#bu-pvc -0.7065823
#bu-rc -0.6214565
#sc-hemo -0.7239366 
#sc-pvc -0.7261869
#sc-rc-0.6390210
#hemo-rc 0.7439994
#pvc-rc 0.7390191

#a lot of the data is highly correlated

#Principle components regression
library(pls)
sapply(kid.train, class)
kid.train$classification<- as.numeric(kid.train$classification)
kid.test$classification<- as.numeric(kid.test$classification)
pcr.model <- pcr(classification~., data = kid.train, scale =TRUE, validation = “CV”)
summary(pcr.model)
#Validation
#This table tells us the test RMSE calculated by the k-fold cross validation.
#10 PC:0.2327 11 PC: 0.2227 (choose 11 PC)

#Training
#This table tells us the percentage of the variance in the response variable explained by the principal components. 
#81.29

validationplot(pcr.model) #RMSE plot given number of principle components 
#11 principle components look good

validationplot(pcr.model, val.type=”MSEP”) #mean square error given number of principle components
#11

validationplot(pcr.model, val.type = “R2”) #plot the r2, notice 11

pcr.pred <- predict(pcr.model, kid.test, ncomp=12)
pcr.pred

#calculate RMSE
sqrt(mean((pcr.pred — kid.test$classification)²)) #0.2323632 
#This is the average deviation between the predicted value for classification 
#and the observed value for hp for the observations in the testing set.

#KNN regression
library(class)

#k=5
kid.knn<- knn(train= kid.train, test= kid.test, cl=kid.train$classification, k=5)
kid.knn
knn.cm <- table(kid.test$classification, kid.knn)
knn.cm
knn.error<- mean(kid.knn != kid.test$classification)
knn.error #0.375
print(paste(‘Accuracy =’, 1-knn.error)) #”Accuracy = 0.625″

#k=10
kid.knn<- knn(train= kid.train, test= kid.test, cl=kid.train$classification, k=10)
kid.knn
knn.cm <- table(kid.test$classification, kid.knn)
knn.cm
knn.error<- mean(kid.knn != kid.test$classification)
knn.error #0.3083333
print(paste(‘Accuracy =’, 1-knn.error)) #”Accuracy = 0.675″

#k=15
kid.knn<- knn(train= kid.train, test= kid.test, cl=kid.train$classification, k=15)
kid.knn
knn.cm <- table(kid.test$classification, kid.knn)
knn.cm
knn.error<- mean(kid.knn != kid.test$classification)
knn.error #0.25
print(paste(‘Accuracy =’, 1-knn.error)) #”Accuracy = 0.75″

#Find best k
library(factoextra)
kid.total.data <- rbind(kid.train, kid.test)
set.seed(21)
dev.off()
# function to compute total within-cluster sum of squares
fviz_nbclust(kid.total.data, kmeans, method = “wss”, k.max = 18) + theme_minimal() + ggtitle(“the Elbow Method”)

#try k=11
kid.knn<- knn(train= kid.train, test= kid.test, cl=kid.train$classification, k=11)
kid.knn
knn.cm <- table(kid.test$classification, kid.knn)
knn.cm
knn.error<- mean(kid.knn != kid.test$classification)
knn.error #0.2916667
print(paste(‘Accuracy =’, 1-knn.error)) #”Accuracy = 0.708333333333333″

#Decision tree
library(rpart)
kid.tree<- rpart(classification ~., method=”class”, data=kid.train)
kid.tree

library(rpart.plot)
rpart.plot(kid.tree)
#n= 280

#node), split, n, loss, yval, (yprob)
#* denotes terminal node

#1) root 280 105 1 (0.375000000 0.625000000) 
#2) hemo>=13.05 116 14 0 (0.879310345 0.120689655) 
#4) rbc< 0.08937124 103 1 0 (0.990291262 0.009708738) *
#5) rbc>=0.08937124 13 0 1 (0.000000000 1.000000000) *
#3) hemo< 13.05 164 3 1 (0.018292683 0.981707317) *

pred.tree<- predict(kid.tree, kid.test, type=”class”)
kid.tree.table<- table(kid.test$classification, pred.tree)
sum(diag(kid.tree.table))/nrow(kid.test) #0.975 predictive accuracy

#Random forest 
library(randomForest)
rf.kid <- randomForest(classification ~ ., data=kid.train)
rf.kid

rf.pred<-predict(rf.kid, newdata=kid.test)
rf.pred.rounded<-round(rf.pred)
table(rf.pred.rounded, kid.test$classification) #.9917 predictive accuracy

#Partial dependence plot of the random forest
library(pdp)
library(tidyr)

#Try CART and RF without rf imputation on data
load(“kidney.Rdata”)
kid<- kid[-1]#delete ID column
head(kid)

set.seed(21)
split.kid <- sample(1:nrow(kid), round(.70*nrow(kid)), replace = FALSE)
kid.train <- kid[split.kid, ]
kid.test <- kid[-split.kid, ]
dim(kid.train) #280 26
dim(kid.test) #120 26

#Decision tree, handles missing data well 
library(rpart)
kid.tree<- rpart(classification ~., method=”class”, data=kid.train)
kid.tree
#n= 280 
#node), split, n, loss, yval, (yprob)
#* denotes terminal node
#1) root 280 105 1 (0.37500000 0.62500000) 
#2) hemo>=13.05 129 27 0 (0.79069767 0.20930233) 
#4) sg>=1.0175 107 5 0 (0.95327103 0.04672897) *
#5) sg< 1.0175 22 0 1 (0.00000000 1.00000000) *
#3) hemo< 13.05 151 3 1 (0.01986755 0.98013245) *

library(rpart.plot)
rpart.plot(kid.tree)

pred.tree<- predict(kid.tree, kid.test, type=”class”)
kid.tree.table<- table(kid.test$classification, pred.tree)
kid.tree.table
sum(diag(kid.tree.table))/nrow(kid.test) #0.975 predictive accuracy

#Random forest 
library(randomForest)
kid.train$classification
#randomForest will not run with NA values,
#have to add na.action to run model 
rf.kid <- randomForest(classification ~ ., data=kid.train, na.action = na.exclude)
rf.kid

rf.pred<-predict(rf.kid, newdata=kid.test)
rf.predictions.rounded<-round(rf.pred)

table(rf.predictions.rounded, kid.test$classification)

#Try mean imputation on data set
#to compare the differences in results

for(i in 1:ncol(kid.train)) {
 kid.train[ , i][is.na(kid.train[ , i])] <- mean(kid.train[ , i], na.rm = TRUE)
}
kid.train

for(i in 1:ncol(kid.test)) {
 kid.test[ , i][is.na(kid.test[ , i])] <- mean(kid.test[ , i], na.rm = TRUE)
}
kid.test

#Principle component regression
kid.train$classification<- as.numeric(kid.train$classification)
kid.test$classification<- as.numeric(kid.test$classification)
pcr.model <- pcr(classification~., data = kid.train, scale =TRUE, validation = “CV”)
summary(pcr.model)

validationplot(pcr.model) #RMSE plot given number of principle components 
#12 principle components look good

validationplot(pcr.model, val.type=”MSEP”) #mean square error given number of principle components
#12

validationplot(pcr.model, val.type = “R2”) #plot the r2

pcr.pred <- predict(pcr.model, kid.test, ncomp=12)
pcr.pred

#calculate RMSE
sqrt(mean((pcr.pred — kid.test$classification)²)) #0.2969505

#KNN regression
library(class)

#k=5
kid.knn<- knn(train= kid.train, test= kid.test, cl=kid.train$classification, k=5)
kid.knn
knn.cm <- table(kid.test$classification, kid.knn)
knn.cm
knn.error<- mean(kid.knn != kid.test$classification)
knn.error #0.3333333 LESS than rf imputation
print(paste(‘Accuracy =’, 1-knn.error)) #”Accuracy = 0.666666666666667″

#k=10
kid.knn<- knn(train= kid.train, test= kid.test, cl=kid.train$classification, k=10)
kid.knn
knn.cm <- table(kid.test$classification, kid.knn)
knn.cm
knn.error<- mean(kid.knn != kid.test$classification)
knn.error #0.2583333 less than rf imputation
print(paste(‘Accuracy =’, 1-knn.error)) #”Accuracy = 0.741666666666667″

#k=15
kid.knn<- knn(train= kid.train, test= kid.test, cl=kid.train$classification, k=15)
kid.knn
knn.cm <- table(kid.test$classification, kid.knn)
knn.cm
knn.error<- mean(kid.knn != kid.test$classification)
knn.error #0.2083333
print(paste(‘Accuracy =’, 1-knn.error)) #”Accuracy = 0.791666666666667″

#Find best k
library(factoextra)
kid.total.data <- rbind(kid.train, kid.test)
set.seed(21)
dev.off()
# function to compute total within-cluster sum of squares
fviz_nbclust(kid.total.data, kmeans, method = “wss”, k.max = 18) + theme_minimal() + ggtitle(“the Elbow Method”)

#try k=14
kid.knn<- knn(train= kid.train, test= kid.test, cl=kid.train$classification, k=14)
kid.knn
knn.cm <- table(kid.test$classification, kid.knn)
knn.cm
knn.error<- mean(kid.knn != kid.test$classification)
knn.error #0.325
print(paste(‘Accuracy =’, 1-knn.error)) #”Accuracy = 0.675″
#big difference between 14 and 15 — not made clear by the elbow method
#look into Elbow method pros and cons

#CART
kid.tree<- rpart(classification ~., method=”class”, data=kid.train)
kid.tree

library(rpart.plot)
rpart.plot(kid.tree)
#n= 280 
#node), split, n, loss, yval, (yprob)
#* denotes terminal node
#1) root 280 105 1 (0.37500000 0.62500000) 
#2) hemo>=13.05 112 14 0 (0.87500000 0.12500000) 
#4) sg>=1.01602 100 2 0 (0.98000000 0.02000000) *
#5) sg< 1.01602 12 0 1 (0.00000000 1.00000000) *
#3) hemo< 13.05 168 7 1 (0.04166667 0.95833333) *

pred.tree<- predict(kid.tree, kid.test, type=”class”)
kid.tree.table<- table(kid.test$classification, pred.tree)
kid.tree.table
#pred.tree #95.83% predictive accuracy

#Random forest
rf.kid <- randomForest(classification ~ ., data=kid.train)
rf.kid

rf.pred<-predict(rf.kid, newdata=kid.test)
rf.predictions.rounded<-round(rf.pred)

table(rf.predictions.rounded, kid.test$classification)
#rf.predictions.rounded #98.3% predictive accuracy

#Remove highly correlated variables and compare model results 
load(“kidney.Rdata”)
kid
#remove ID column from dataset
kid<- kid[-1] 
head(kid)
#make classification a factor to run rf
kid$classification<- as.factor(kid$classification) 
kid$classification

#take out highly correlated variables
#find correlations in dataset, first fill in missing data by rf
kid.cor<- rfImpute(classification ~ ., data=kid)
cor(kid.cor[, c(‘age’,’bp’,’sg’,’al’,’bgr’,’bu’,’sc’,’sod’,’pot’,’hemo’,’pcv’,’wc’,’rc’)])
pairs(kid.cor[, c(‘age’,’bp’,’sg’,’al’,’bgr’,’bu’,’sc’,’sod’,’pot’,’hemo’,’pcv’,’wc’,’rc’)])

#PC and PCC, remove PCC
kid<-kid[-4] #remove albumin 
head(kid)
kid<- kid[-7] #remove red count
head(kid)
kid<- kid[-7] #remove PCC
head(kid)
kid<-kid[-14] #remove pcv
head(kid)
cor(kid.cor[, c(‘age’,’bp’,’sg’,’bgr’,’bu’,’sc’,’sod’,’pot’,’hemo’,’wc’)])
pairs(kid.cor[, c(‘age’,’bp’,’sg’,’bgr’,’bu’,’sc’,’sod’,’pot’,’hemo’,’wc’)])
#reduction in multicolliniarity

#split data into training and testing sets
set.seed(21)
split.kid <- sample(1:nrow(kid), round(.70*nrow(kid)), replace = FALSE)
kid.train <- kid[split.kid, ]
kid.test <- kid[-split.kid, ]
dim(kid.train) #280 26
dim(kid.test) #120 26

#random forest imputation on training and testing sets 
library(missForest)
library(randomForest)
set.seed(16)
kid.train <- rfImpute(classification ~ ., data=kid.train) #random forest data imputation
kid.test <- rfImpute(classification ~ ., data=kid.test) #random forest data imputation
kid.y.train <- kid.train$classification
kid.y.test <- kid.test$classification

library(pls)
sapply(kid.train, class)
kid.train$classification<- as.numeric(kid.train$classification)
kid.test$classification<- as.numeric(kid.test$classification)
pcr.model <- pcr(classification~., data = kid.train, scale =TRUE, validation = “CV”)
summary(pcr.model)
#Validation
#This table tells us the test RMSE calculated by the k-fold cross validation.
#12 PC

#Training
#This table tells us the percentage of the variance in the response variable explained by the principal components.

validationplot(pcr.model) #RMSE plot given number of principle components 
#12 principle components look good

validationplot(pcr.model, val.type=”MSEP”) #mean square error given number of principle components
#12

validationplot(pcr.model, val.type = “R2”) #plot the r2, notice 12

pcr.pred <- predict(pcr.model, kid.test, ncomp=12)
pcr.pred

#calculate RMSE
sqrt(mean((pcr.pred — kid.test$classification)²)) #0.2454366, INCREASED slightly
#not unexcepted considering PCR does well with correlated variables
#This is the average deviation between the predicted value for classification 
#and the observed value for hp for the observations in the testing set.

#reset data here for KNN

#We know KNN does NOT do well with multicolliniarity 
#Curious to see if there will be an improvement here
#KNN regression
library(class)

#k=5
kid.knn<- knn(train= kid.train, test= kid.test, cl=kid.train$classification, k=5)
kid.knn
knn.cm <- table(kid.test$classification, kid.knn)
knn.cm
knn.error<- mean(kid.knn != kid.test$classification)
knn.error #0.1416667
print(paste(‘Accuracy =’, 1-knn.error)) #”Accuracy = 0.858333333333333″
#Significantly improved

#k=10
kid.knn<- knn(train= kid.train, test= kid.test, cl=kid.train$classification, k=10)
kid.knn
knn.cm <- table(kid.test$classification, kid.knn)
knn.cm
knn.error<- mean(kid.knn != kid.test$classification)
knn.error #0.1416667
print(paste(‘Accuracy =’, 1-knn.error)) #”Accuracy = 0.858333333333333″

#k=15
kid.knn<- knn(train= kid.train, test= kid.test, cl=kid.train$classification, k=15)
kid.knn
knn.cm <- table(kid.test$classification, kid.knn)
knn.cm
knn.error<- mean(kid.knn != kid.test$classification)
knn.error #0.15
print(paste(‘Accuracy =’, 1-knn.error)) #”Accuracy = 0.85″

#Find best k
library(factoextra)
kid.total.data <- rbind(kid.train, kid.test)
set.seed(21)
dev.off()
# function to compute total within-cluster sum of squares
fviz_nbclust(kid.total.data, kmeans, method = “wss”, k.max = 18) + theme_minimal() + ggtitle(“the Elbow Method”)

#Decision tree
library(rpart)
kid.tree<- rpart(classification ~., method=”class”, data=kid.train)
kid.tree

library(rpart.plot)
rpart.plot(kid.tree)
#n= 280 
#node), split, n, loss, yval, (yprob)
#* denotes terminal node

#1) root 280 105 1 (0.37500000 0.62500000) 
#2) hemo>=13.05 116 15 0 (0.87068966 0.12931034) 
#4) rbc< 0.007932388 103 2 0 (0.98058252 0.01941748) *
#5) rbc>=0.007932388 13 0 1 (0.00000000 1.00000000) *
#3) hemo< 13.05 164 4 1 (0.02439024 0.97560976) *

pred.tree<- predict(kid.tree, kid.test, type=”class”)
kid.tree.table<- table(kid.test$classification, pred.tree)
kid.tree.table
sum(diag(kid.tree.table))/nrow(kid.test) #0.975 predictive accuracy

#Random forest 
library(randomForest)
rf.kid <- randomForest(classification ~ ., data=kid.train)
rf.kid

rf.pred<-predict(rf.kid, newdata=kid.test)
table(rf.pred, kid.test$classification) #.9917 predictive accuracy

By Aspen Gulley on .



Leave a Reply

WORK & VOLUNTEER EXPERIENCE

Data Analyst
CenCore, LLC
2024 – Current

Mental Health Crisis Counselor
Crisis Text Line
2023 – 2024

Contributing Data Science Writer
Dev Genius
2022 – 2024

Research Assistant & Academic Writer
Utah State University
2019 – 2020

Behavior Technician
Wasatch Behavioral Health
2018 – 2019

Discover more from Aspen Gulley

Subscribe now to keep reading and get access to the full archive.

Continue reading