Aspen Gulley

Data Scientist | Behavior Analyst in Training


Exploring the State and Arrests Data with Hierarchical Clustering, Stars Plots, and a Self-Organizing Map


In the first part of this analysis I am going to show an example of hierarchical clustering and how correlation can help aid in the understanding of dendrogram results. I will then explore some star plots.

State data released from the US department of Commerce, Bureau of the Census is available in R. I will cluster this data using hierarchical clustering and focus on the variables Population, Income, Illiteracy, Life Exp, Murder, HS Grad, Frost, and Area. I will then interpret the dendrogram with the aid of correlation.

data(state)
state<-state.x77
library(corrplot)
cor.matrix<-cor(as.matrix(state))
cor.matrix
 
quartz()
corrplot(cor.matrix)

?hclust
clust <- hclust(dist(cor.matrix), “ave”)
quartz()
plot(clust)

Results:

The correlation plot aids in the interpretation of the relationships between variables in the dendrogram.

Starting on the dendrogram with HS Grad, I can see that it is closely related to income. The correlation shows that these variables have a positive correlation of approximately 62%. Next closest to HS Grad on the dendrogram is Life Exp, with a positive correlation of approximately 58%. This makes sense because people with an education and income are likely going to have better access to healthcare, healthier foods, etc. and therefore an increase in life expectancy is to be likely. Interestingly, Frost and HS Grad have a positive correlation of 36% and are also on that same first large branch of the dendrogram. On the second branch, Illiteracy and Murder are positively correlated at 70%. Murder and Population, next closest on that branch are positively correlated at 34%. Area and Murder have a positive correlation at 22%.

Negative correlations are revealed between the branches of the dendrogram. For example, HS Grad and Illiteracy have a negative correlation of approximately 66%. HS Grad and Murder are also negatively correlated at 48%, suggesting that education level could be an important variable in predicting homicides. Income and Illiteracy are negatively correlated at 43%, so as illiteracy drops, income rises and vice versa. Frost and Population have a negative correlation of 33%.

The dendrogram/correlation findings can be modeled using star plots for each state.

?stars
quartz()
stars(state, key.loc = c(16, 1.25), draw.segments = T)

Add regions to the data and sort the states by their regions.

region<- as.data.frame(state.region)
state<-cbind(state, region)
head(state)

plot(state$state.region)

which(state$state.region == “South”)
S<- state[c(1, 4, 8, 9, 10, 17, 18, 20, 24, 33, 36, 40, 42, 43, 46, 48),]
S<- S[-9] #delete region
stars(S, key.loc = c(.1, 10), draw.segments = T)

which(state$state.region == “West”)
W<- state[c(2, 3, 5, 6, 11, 12, 26, 28, 31, 37, 44, 47, 50),]
W<- W[-9]
stars(W, key.loc = c(8, 2), draw.segments = T)

which(state$state.region == “Northeast”)
NE<- state[c(7, 19, 21, 29, 30, 32, 38, 39, 45),]
NE<- NE[-9]
stars(NE, key.loc = c(.03, 7), draw.segments = T)

which(state$state.region == “North Central”)
NC<- state[c(13, 14, 15, 16, 22, 23, 25, 27, 34, 35, 41, 49),]
NC<- NC[-9]
stars(NC, key.loc = c(.05, 10), draw.segments = T)

Being a true crime junkie, I want to look more closely at the murder rates in the states. I will sort the rates in descending order.

murd <- state[order(-state$Murder),]
murd[1:10,]

Here are the top 10 states with the highest murder rates and their regions:

  1. Alabama, South
  2. Georgia, South
  3. Louisiana, South
  4. Mississippi, South
  5. Texas, South
  6. South Carolina, South
  7. Nevada, West
  8. Alaska, West
  9. Michigan, North Central
  10. North Carolina, South

Top 10 states with highest illiteracy rates and their regions:

  1. Louisiana, South
  2. Mississippi, South
  3. South Carolina, South
  4. New Mexico, West
  5. Texas, South
  6. Alabama, South
  7. Georgia, South
  8. Arkansas, South
  9. Hawaii, West
  10. Arizona, West

In the dendrogram murder and illiteracy were highly correlated; moreover, the star plots showed that the South has the highest murder and illiteracy rates. These lists confirm that finding.

Here are the states hierarchical clustered using complete linkage:

s<- scale(s)
clust <- hclust(dist(s), “complete”)
quartz()
plot(clust)

5 clusters of states can be identified:

quartz()
aweSOMdendrogram(clust, 5)

The first cluster consists of just Alaska.

a<- s[c(“Alaska”),]
a<-as.data.frame(a)
a
quartz()
stars(a, draw.segments = T)

The second cluster includes Alabama, Georgia, Louisiana, Mississippi, South Carolina, New Mexico, West Virginia, Arkansas, North Carolina, Kentucky, and Tennessee.

While examining the stars plots, you can see that the smaller branches within the clusters are grouping based on variable similarity within the states; the smaller branches are identifying more nuanced information. For example, in within cluster one Alabama and Georgia are on a branch, while Louisiana, Mississippi, and South Carolina are on another branch. Then there is new Mexico branching off with West Virginia, Arkansas, North Carolina, Kentucky, and Tennessee all in their own little sub-group of the same branch.

c2<- s[c(“Alabama”, “Georgia”, “Louisiana”, “Mississippi”, “South Carolina”, “New Mexico”, 
 “West Virginia”, “Arkansas”, “North Carolina”, “Kentucky”, “Tennessee”),]
quartz()
stars(c2, key.loc= c(8,2), draw.segments = T)

The third cluster of states consists of Nevada, Rhode Island, Connecticut, North Dakota, South Dakota, Maine, New Hampshire, Vermont, “Colorado”, Montana, Wyoming, Minnesota, Wisconsin, Kansas, Iowa, Nebraska, Idaho, and Utah.

c3<- s[c(“Nevada”, “Rhode Island”, “Connecticut”,”North Dakota”, “South Dakota”,
 “Maine”, “New Hampshire”, “Vermont”, “Colorado”, “Montana”, “Wyoming”, “Minnesota”,
 “Wisconsin”, “Kansas”, “Iowa”, “Nebraska”, “Idaho”, “Utah”),]
quartz()
stars(c3, key.loc= c(8,1.8), draw.segments = T)

The fourth cluster includes states New York, Ohio, Pennsylvania, Illinois, Michigan, California, and Texas.

c4<- s[c(“New York”, “Ohio”, “Pennsylvania”, “Illinois”, “Michigan”, 
 “California”, “Texas”),]
quartz()
stars(c4, key.loc= c(6,1.8), draw.segments = T)

The fifth cluster includes states Hawaii, Oregon, Washington, Arizona, Florida, Virginia, Oklahoma, Indiana, Missouri, Delaware, Maryland, Massachusetts, and New Jersey.

c5<- s[c(“Hawaii”, “Oregon”, “Washington”, “Arizona”, “Florida”, 
 “Virginia”, “Oklahoma”, “Indiana”, “Missouri”, “Delaware”, “Maryland”,
 “Massachusetts”, “New Jersey”),]
quartz()
stars(c5, key.loc= c(6,1.8), draw.segments = T)

For the second part if this analysis, I am going to transition into examining the 1973 arrests data. This data includes all 50 states and their observation’s on 4 variables, including murder arrests per 100,000, assault arrests per 100,000, urban population percent, and rape arrests per 100,000. First I will explore the arrests data though hierarchical clustering and then I will examine it using a self-organizing map.

data(“USArrests”)
arrests<-USArrests

Hierarchical cluster dendrogram with complete linkage, computed with Euclidean distance:

d <- dist(arrests, method = “euclidean”)
hc <- hclust(d, method = “complete” )
hc
quartz()
plot(hc)

If I cut this tree at 150, I see three major clusters in this data.

c1<-arrests[c(“Florida”,”North Carolina”,”Delaware”,”Alabama”, ”Louisiana”,”Alaska”,“Mississippi”, “South Carolina”, “Maryland”, “Arizona”, “New Mexico”,”California”, “Illinois”, “New York”, “Michigan”, “Nevada”),]

c2<-arrests[c(“Missouri”,”Arkansas”,”Tennessee”,”Georgia”,”Colorado”, ”Texas”, ”Rhode Island”, “Wyoming”, “Oregon”, “Oklahoma”, “Virginia”,”Washington”,”Massachusetts”,
 “New Jersey”),]

c3<- arrests[c(“Ohio”, “Utah”, “Connecticut”,”Pennsylvania”,
 “Nebraska”, “Kentucky”, “Montana”, “Idaho”, “Indiana”, “Kansas”,
 “Hawaii”, “Minnesota”, “Wisconsin”, “Iowa”, “New Hampshire”, “West Virginia”,“Maine”,”South Dakota”, “North Dakota”, “Vermont”),]

The stars plots really helps in illuminating why states were put into their respective clusters.Here is a look at the first cluster:

quartz()
stars(c1, key.loc = c(12, 2), draw.segments = T)

Florida and North Carolina are being clustered on their own little branch on the far left because they both have a large rate of assault and murder. Delaware, Alabama, and Louisiana have similar population sizes but Alabama and Louisiana are on their own small branch because they are also similar in murder rates. Alaska, Mississippi, and South Carolina have similar assault rates. Maryland, Arizona, and New Mexico have similarity in rape and assault rates. California, Illinois, New York, Michigan, and Nevada all have high urban populations.

The second cluster:

quartz()
stars(c2, key.loc = c(10, 2), draw.segments = T)

Missouri, Arkansas, Tennessee, Georgia, Colorado, and Texas all have similarities in assault rates. Colorado, Texas, and Rhode Island have similarities in urban population, but Rhode Island distinguishes itself with extremely low rape and murder rates, which is why it is on a different branch. Wyoming, Oregon, Oklahoma, and Virginia have some similarities in rape rates, with smaller urban populations. Washington, Massachusetts, and New Jersey all have higher urban populations, which results in that group being created in the dendrogram.

The third cluster:

quartz()
stars(c3, key.loc = c(13, 2), draw.segments = T)

There is a subsection in the dendrogram that consists of Ohio, Utah, Connecticut, Pennsylvania, Nebraska, Kentucky, Montana, Idaho, Indiana, and Kansas. These states all have similar assault rates. Overall, their urban populations are also comparable. I can break down the smaller branches and see more similarity. For example, Ohio and Utah are more similar between the three variables of assault, urban population, and rape, while on a different branch Connecticut and Pennsylvania are similar between assault and urban population. The second subsection includes the states Hawaii, Minnesota, Wisconsin, Iowa, New Hampshire, West Virginia, Maine, South Dakota, North Dakota, and Vermont. The states Hawaii, Minnesota, Wisconsin, Iowa, and New Hampshire have similarities in urban population, but the first three are unique enough to have their own subbranches, while Iowa and New Hampshire are branched together because they are more similar across variables. West Virginia, Maine and South Dakota are being branched together due to similarities in assault rates, while North Dakota and Vermont have their own little subbranch on the far right.

I am going to fit a SOM to the data now.

arrests.scaled<- scale(arrests)
set.seed(7)
som.grid <- somgrid(xdim = 4, ydim = 4, topo = “hexagonal”)
arrests.som <- som(arrests.scaled, grid = som.grid, rlen = 1000)
arrests.som$codes
codes <- arrests.som$codes[[1]]
codes 
arrests.som$unit.classif

quartz()
plot(arrests.som, main = “Arrest Data”)

The iterations reach convergence around 800.

quartz()
plot(arrests.som, type = “changes”, main = “Arrest Data”)

quartz()
plot(arrests.som, type = “count”)

quartz()
plot(arrests.som, type = “mapping”)

The following U-matrix shows that four groups might be appropriate. The red node is being flagged as very dissimilar from the group. This likely represents the node that Alaska is mapped to because Alaska is a unique state.

coolBlueHotRed <- function(n, alpha = 1){rainbow(n, end=4/6, alpha = alpha)[n:1]}
quartz()
plot(arrests.som, type = “dist.neighbours”, palette.name = coolBlueHotRed)

Here are the variables broken down through a heat map:

for (i in 1:4){
 quartz()
 plot(arrests.som, type = “property”, property=codes[,i], main = colnames(codes)[i])
}

On the bottom right it can be seen on these heat maps that urban population being higher is associated with higher assault, murder, and rape rates, and the inverse of that is also true, as can be seen on the upper left of the graphs. The top right has high murder and assault rates while the bottom left has higher urban population.

Hierarchical Clustering:
d <- dist(codes)
hc <- hclust(d)

quartz()
plot(hc)

In order of appearance on the dendrogram from left to right, here are the groups of states and which node those states are being mapped to:

Maine 13 
North Dakota 13
South Dakota 13 
Vermont 13 
West Virginia 13

Iowa 9 
Minnesota 9 
New Hampshire 9 
Wisconsin 9

Idaho 14 
Montana 14 
Nebraska 14

Rhode Island 1 
Massachusetts 1 
New Jersey 1

Connecticut 5 
Hawaii 5 
Utah 5

Delaware 2

Missouri 7 
Oregon 7 
Washington 7

Ohio 6 
Pennsylvania 6

Indiana 10 
Kansas 10
Oklahoma 10 
Virginia 10 
Wyoming 10

Arkansas 15 
Kentucky 15

Alabama 11 
Georgia 11 
Louisiana 11 
Tennessee 11

Mississippi 16 
North Carolina 16 
South Carolina 16

Alaska 12

California 4 
Colorado 4 
Nevada 4

Texas 3 
Arizona 3 
Illinois 3 
 New York 3

Florida 8 
Maryland 8 
Michigan 8 
New Mexico 8

Comparing the SOM to the hierarchical clustering results, group 13 is similar to the bottom right group on the dendrogram. Next on the bottom left is group 9, which is also mapping similarly. Idaho, Montana, and Nebraska are being grouped together on the SOM dendrogram for group 14. This makes sense because these three states are very similar in assault, rape, and urban population. Rhode Island, Massachusetts, and New Jersey are on the same branch but are separated on the complete linkage dendrogram but in the SOM these are being grouped together due to similarities in urban populations. Connecticut, Utah, and Hawaii, group 5 in the SOM, are being grouped together based on urban population rates. Hawaii is the odd ball in this group when considering the other variables. The complete clustering focused on separating out Hawaii with respect to the other variables this case. Delaware has low crime, so its mapping onto its own little group. Oregon, Missouri, and Washington are on the same branch but are pretty separated in the cluster dendrogram. The SOM maps these states into group based on similar rape rates. The SOM is mapping Ohio and Pennsylvania together based on assault and urban population. Group 10 includes Indiana, Kansas, Oklahoma, Virginia, and Wyoming. Indiana and Kansas are very similar with high assault rates. Oklahoma and Virginia are also very similar with low assault rates. The complete hierarchical clustering better differentiates these states in comparison to the SOM grouping. Arkansas and Kentucky are being grouped together by the SOM based on similar assault and rape rates. This differs from the complete dendrogram results. Alabama, Georgia, Louisiana, and Tennessee are being grouped together by the SOM based on murder rates. Mississippi, North Carolina, and South Carolina, group 16, are also being grouped together based on murder rates. Alaska is in its own group because it is a very unique state; it has high rape rates with a very low urban population. Group 4, California, Colorado, and Nevada are being grouped based on high urban population and rape rates. Texas, Arizona, Illinois, and New York are being grouped together based on urban population. While group 8, Florida, Maryland, Michigan, New Mexico, are being grouped together based on urban population, rape, and murder rates.

Overall, I see consistency between what cluster the states were assigned to in the complete hierarchical clustering dendrogram and whether the SOM put those states into a group. This exercise emphasizes how the SOM simplifies the state’s groups, sometimes by focusing on just one variable. The hierarchical clustering dendrogram was a little more nuanced in the results. The nuance is where the inconsistency between results occurs, not that either is incorrect. Because the complete hierarchical clustering results were more nuanced, the star plots were necessary to help me with understanding the results. This demonstrates both an advantage and limitation of hierarchical clustering to the SOM. The SOM nodes sometimes focused on one dominant variable. I would prefer the SOM model when data reduction is necessary. If the data is too noisy, the complete hierarchical clustering will become too complicated to understand, even with other resources to assist interpretation (and its performance would likely be negatively impacted). Alternatively, hierarchical clustering is easy to understand and interpret when the data does not have too many dimensions and there isn’t too much data. The SOM version of hierarchical clustering takes a little more analysis to understand because the information within the node groups isn’t as easily identifiable, but the model is beneficial for data reduction, and when combining the SOM with hierarchical clustering I can still get the benefit of seeing the relationships presented through a dendrogram.

By Aspen Gulley on .



Leave a Reply

WORK & VOLUNTEER EXPERIENCE

Data Analyst
CenCore, LLC
2024 – Current

Mental Health Crisis Counselor
Crisis Text Line
2023 – 2024

Contributing Data Science Writer
Dev Genius
2022 – 2024

Research Assistant & Academic Writer
Utah State University
2019 – 2020

Behavior Technician
Wasatch Behavioral Health
2018 – 2019

Discover more from Aspen Gulley

Subscribe now to keep reading and get access to the full archive.

Continue reading