Project 2: Data Mining, Classification, Prediction
Introduction
The dataset is from Rossi et al. (1980) and investigates criminal recidivism. 432 convicts were followed up one year after release. The data includes the numeric variables of week of arrest, age, number of prior convictions, and education level. The binary variables include whether they were arrested, received financial aid, had prior work experience, were black, were married, released on parole, and job status per week. The main binary variable I will be looking at is whether they were arrested and in the dataset, 318 people were not arrested while 114 were arrested again.
library(tidyverse)
library(dplyr)
library(readr)
<- read_csv("Rossi.csv")
Rossi $employment <- rowSums(Rossi[, 12:63] == "yes", na.rm = T)
Rossi<- Rossi %>%
Rossi select(-(12:63))
%>%
Rossi group_by(arrest) %>%
summarize(n())
## # A tibble: 2 x 2
## arrest `n()`
## <dbl> <int>
## 1 0 318
## 2 1 114
Additionally, I tidied the dataset by aggregating the job status per week into total weeks of employment. Also, if the person was not arrested again, their week of arrest is listed at 52 weeks.
Cluster Analysis
library(cluster)
library(GGally)
<- Rossi %>%
RossiNum select(week, age, prio, educ, employment)
<- vector()
sil_width for (i in 2:10) {
<- kmeans(RossiNum, centers = i)
kms <- silhouette(kms$cluster, dist(RossiNum))
sil <- mean(sil[, 3])
sil_width[i]
}ggplot() + geom_line(aes(x = 1:10, y = sil_width)) + scale_x_continuous(name = "k",
breaks = 1:10)
# 3 clusters
<- RossiNum %>%
pam1 pam(k = 3)
plot(pam1, which = 2, border = NA, col = "red")
# Average sil width is .5 which is weak and could be
# artificial.
<- RossiNum %>%
RossiPam pam(3)
RossiPam
## Medoids:
## ID week age prio educ employment
## [1,] 357 19 22 4 3 1
## [2,] 340 52 26 2 3 40
## [3,] 194 52 22 2 3 8
## Clustering vector:
## [1] 1 1 1 2 2 3 1 3 3 3 3 2 3 2 1 2 1 3 2 3 3 2 1 2 3 2 2 3 2 2 3 2 2 3 2 2 2
## [38] 3 3 3 2 3 1 2 3 2 3 1 2 2 2 3 2 2 2 3 2 2 2 2 1 3 3 3 3 2 2 3 3 3 3 3 2 2
## [75] 2 2 3 2 3 1 1 3 3 3 1 3 3 1 3 2 2 3 2 2 3 1 2 3 3 2
## [ reached getOption("max.print") -- omitted 332 entries ]
## Objective function:
## build swap
## 11.52252 10.58172
##
## Available components:
## [1] "medoids" "id.med" "clustering" "objective" "isolation"
## [6] "clusinfo" "silinfo" "diss" "call" "data"
%>%
RossiNum mutate(cluster = as.factor(pam1$clustering)) %>%
ggpairs(aes(color = cluster))
I examined the numerical variables of week arrested, age, prior convictions, education level, and total weeks of employment. I computed the k-means solution for each potential number of clusters in order to find the highest silhouette width since a higher width means the clusters more cohesive and separated. I found that 3 clusters had the highest silhouette width; however, the 3-cluster solution had a silhouette width of .5 which is weak and could be artificial. The first cluster had an early arrest and less weeks of employment. The second cluster had a late/no arrest, higher age range, and more weeks of employment. The third cluster had a late/no arrest and less weeks of employment.
Dimensionality Reduction with PCA
<- princomp(RossiNum, cor = T)
pca1 summary(pca1, loadings = T)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 1.3067271 1.0236803 0.9597713 0.8850597 0.7348818
## Proportion of Variance 0.3415072 0.2095843 0.1842322 0.1566661 0.1080103
## Cumulative Proportion 0.3415072 0.5510914 0.7353236 0.8919897 1.0000000
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## week 0.530 0.296 0.406 0.319 0.604
## age 0.380 -0.864 -0.140 0.293
## prio -0.413 0.425 -0.291 0.750
## educ 0.229 -0.798 0.551
## employment 0.593 0.303 0.114 -0.737
$loadings[1:5, 1:2] %>%
pca1%>%
as.data.frame %>%
rownames_to_column ggplot() + geom_hline(aes(yintercept = 0), lty = 2) + geom_vline(aes(xintercept = 0),
lty = 2) + ylab("PC2") + xlab("PC1") + geom_segment(aes(x = 0,
y = 0, xend = Comp.1, yend = Comp.2), arrow = arrow(), col = "red") +
geom_label(aes(x = Comp.1 * 1.1, y = Comp.2 * 1.1, label = rowname))
library(factoextra)
fviz_pca_biplot(pca1)
$scores %>%
pca1%>%
as.data.frame mutate(week = RossiNum$week) %>%
ggplot(aes(x = Comp.1, y = Comp.2)) + geom_point(aes(color = week)) +
geom_smooth(method = "lm")
# Variance
<- pca1$sdev^2
eigval <- round(eigval/sum(eigval), 2)
varprop round(cumsum(eigval)/sum(eigval), 2)
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## 0.34 0.55 0.74 0.89 1.00
When running the Principal Component Analysis (PCA), I calculated the relationship between variables into separate components in order to reduce the dimensions of the dataset. I chose to retain the first 4 principal components since they explain 89% of the total variance. A strong first principal component score means a late/no arrest, young age, low number of prior convictions, and a high total weeks of employment. A strong second principal component score means a low education level and high number of prior convictions. A strong third principal component score means a young age and a late/no arrest. A strong fourth principal component means a high number of prior convictions and high education level.
Linear Classifier
# Logistic Regression
<- glm(arrest ~ age, data = Rossi, family = "binomial")
logistic_fit <- predict(logistic_fit, type = "response")
prob_reg class_diag(prob_reg, truth = Rossi$arrest, positive = 1)
## acc sens spec ppv f1 ba auc
## Metrics 0.7361 0 1 NaN NaN 0.5 0.6403
# .6403
<- glm(arrest ~ prio, data = Rossi, family = "binomial")
logistic_fit2 <- predict(logistic_fit2, type = "response")
prob_reg2 class_diag(prob_reg2, truth = Rossi$arrest, positive = 1)
## acc sens spec ppv f1 ba auc
## Metrics 0.7384 0.0351 0.9906 0.5714 0.0661 0.5128 0.5964
# .5964
<- glm(arrest ~ educ, data = Rossi, family = "binomial")
logistic_fit3 <- predict(logistic_fit3, type = "response")
prob_reg3 class_diag(prob_reg3, truth = Rossi$arrest, positive = 1)
## acc sens spec ppv f1 ba auc
## Metrics 0.7361 0 1 NaN NaN 0.5 0.5658
# .5658
<- glm(arrest ~ employment, data = Rossi, family = "binomial")
logistic_fit4 <- predict(logistic_fit4, type = "response")
prob_reg4 class_diag(prob_reg4, truth = Rossi$arrest, positive = 1)
## acc sens spec ppv f1 ba auc
## Metrics 0.7454 0.3333 0.8931 0.5278 0.4086 0.6132 0.7619
# .7619
<- glm(arrest ~ age + employment + prio + educ,
logistic_fit5 data = Rossi, family = "binomial")
<- predict(logistic_fit5, type = "response")
prob_reg5 class_diag(prob_reg5, truth = Rossi$arrest, positive = 1)
## acc sens spec ppv f1 ba auc
## Metrics 0.7616 0.3684 0.9025 0.5753 0.4492 0.6355 0.7794
# .7794
# Confusion Matrix
<- Rossi$arrest
y <- factor(y, levels = c(1, 0))
y <- prob_reg5 * 100
x <- vector()
accuracy <- 1:100
cutoff for (i in cutoff) {
<- ifelse(x > i, 1, 0)
y_hat <- mean(y == y_hat)
accuracy[i]
}qplot(y = accuracy) + geom_line() + scale_x_continuous(n.breaks = 10)
max(accuracy)
## [1] 0.7638889
which.max(accuracy)] cutoff[
## [1] 47
<- ifelse(x > 47, 1, 0)
y_hat <- factor(y_hat, levels = c(1, 0))
y_hat table(actual = y, predicted = y_hat) %>%
addmargins
## predicted
## actual 1 0 Sum
## 1 51 63 114
## 0 39 279 318
## Sum 90 342 432
# k-fold CV
= 10
k <- Rossi[sample(nrow(Rossi)), ]
data <- cut(seq(1:nrow(Rossi)), breaks = k, labels = F)
folds <- NULL
diags for (i in 1:k) {
<- Rossi[folds != i, ]
train <- Rossi[folds == i, ]
test <- test$arrest
truth <- glm(arrest ~ age + employment + prio + educ, data = train,
fit family = "binomial")
<- predict(fit, newdata = test, type = "response")
probs <- rbind(diags, class_diag(probs, truth, positive = 1))
diags
}summarize_all(diags, mean)
## acc sens spec ppv f1 ba auc
## 1 0.75011 0.32855 0.90057 0.52865 0.39289 0.61457 0.76415
After running a logistic regression on each numeric variables against the binary variable of arrest, the variables with the highest area under the curve was age and total weeks of employment which means they are the best factors in predicting whether someone gets arrested again. I ran another logistic regression with all the numeric variables and the area under the curve was .7794 which means the model is fairly good. The confusion matrix shows that 330 participants out of 432 were accurately identified as arrested or not. After performing a k-fold cross-validation with 10 folds, the area under the curve is .76415 which is similar to the logistic regression model in performing fairly. There does not seem to be signs of overfitting since the cross-validation did not perform significantly worse than the logistic regression model.
Non-Parametric Classifier
library(caret)
<- knn3(arrest ~ age + employment + prio + educ, data = Rossi)
knn_fit <- predict(knn_fit, Rossi)[, 2]
prob_knn class_diag(prob_knn, Rossi$arrest, positive = 1)
## acc sens spec ppv f1 ba auc
## Metrics 0.8102 0.5175 0.9151 0.686 0.59 0.7163 0.8527
# Confusion Matrix
<- Rossi$arrest
y2 <- factor(y2, levels = c(1, 0))
y2 <- prob_reg5 * 100
x2 <- vector()
accuracy <- 1:100
cutoff for (i in cutoff) {
<- ifelse(x2 > i, 1, 0)
y_hat2 <- mean(y2 == y_hat2)
accuracy[i]
}qplot(y = accuracy) + geom_line() + scale_x_continuous(n.breaks = 10)
max(accuracy)
## [1] 0.7638889
which.max(accuracy)] cutoff[
## [1] 47
<- ifelse(x2 > 47, 1, 0)
y_hat2 <- factor(y_hat2, levels = c(1, 0))
y_hat2 table(actual = y2, predicted = y_hat2) %>%
addmargins
## predicted
## actual 1 0 Sum
## 1 51 63 114
## 0 39 279 318
## Sum 90 342 432
# k-fold CV
= 10
k2 <- Rossi[sample(nrow(Rossi)), ]
data2 <- cut(seq(1:nrow(Rossi)), breaks = k2, labels = F)
folds2 <- NULL
diags2 for (i in 1:k2) {
<- Rossi[folds2 != i, ]
train2 <- Rossi[folds2 == i, ]
test2 <- test2$arrest
truth2 <- glm(arrest ~ age + employment + prio + educ, data = train,
fit2 family = "binomial")
<- predict(fit2, newdata = test2, type = "response")
probs2 <- rbind(diags2, class_diag(probs2, truth2, positive = 1))
diags2
}summarize_all(diags2, mean)
## acc sens spec ppv f1 ba auc
## 1 0.7571 0.31599 0.91234 0.54214 0.38802 0.61415 0.77712
Using k-nearest-neighbors with all numeric variables, the area under the curve was .8527 which means a good performance from the model. The confusion matrix shows that the model identified 330 people correctly as arrested or not. However, the cross-validation shows that the area under the curve dropped to .77712 which means there is likely overfitting. Both cross-validations from the nonparametric model and the linear model performed similarly.
Regression/Numeric Prediction
# Linear Regression
<- lm(arrest ~ age + employment + prio + educ, data = Rossi)
fit3 <- predict(fit3)
yhat3 mean((Rossi$arrest - yhat3)^2)
## [1] 0.1603791
# k-fold CV
= 10
k3 <- Rossi[sample(nrow(Rossi)), ]
data3 <- cut(seq(1:nrow(Rossi)), breaks = k3, labels = F)
folds3 <- NULL
diags3 for (i in 1:k3) {
<- data3[folds3 != i, ]
train3 <- data3[folds3 == i, ]
test3 <- lm(arrest ~ ., data = train3)
fit3k <- predict(fit3k, newdata = test3)
yhat3k <- mean((test3$arrest - yhat3k)^2)
diags3
}mean(diags3)
## [1] 0.08572541
The mean squared error (MSE) of the linear regression was 0.16 which means it is a very good fit. The MSE for the cross-validation of the linear regression was even smaller at .03 which is good because it means overfitting is unlikely.
Python
library(reticulate)
# use_python('/usr/local/bin/python3')
py_install(packages = "matplotlib")
py_install(packages = "numpy")
py_install(packages = "pandas")
py_install(packages = "seaborn")
<- import("matplotlib")
matplotlib $use("Agg", force = TRUE) matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
"age", "employment", c="arrest", data=r.Rossi)
plt.scatter('Age')
plt.xlabel('Weeks of Employment')
plt.ylabel('Arrests for Age vs Employment')
plt.title( plt.show()
=r.Rossi
pRossi"pRossi.csv")
pRossi.to_csv(= pd.read_csv("pRossi.csv", index_col=0)
pRossi
= (pRossi.query('arrest == 1').filter(['age', 'employment']))
aRossi = (pRossi.query('arrest == 0').filter(['age', 'employment'])) xRossi
library(gt)
$aRossi %>%
pysummarize(Age = mean(age), Employment = mean(employment)) %>%
gt() %>%
fmt_number(columns = 1:2, decimals = 2) %>%
tab_header(title = "Mean Statistics of Arrested")
Mean Statistics of Arrested | |
---|---|
Age | Employment |
22.76 | 9.78 |
$xRossi %>%
pysummarize(Age = mean(age), Employment = mean(employment)) %>%
gt() %>%
fmt_number(columns = 1:2, decimals = 2) %>%
tab_header(title = "Mean Statistics of Not Arrested")
Mean Statistics of Not Arrested | |
---|---|
Age | Employment |
25.25 | 25.67 |
To demonstrate sharing objects between R and Python, I called the Rossi datset from R in Python in order to use matplotlib to create a scatterplot of age vs employment. Then, I wrangled the data in Python and called the subsets of data in R where I continued my analysis of age and employment for arrest.
Concluding Remarks
With various machine learning models, I was able to analyze factors in recidivism. My cluster analysis shows that people who are not employed and young are more likely to get arrested again. This shows the importance of reducing stigma against convicts and encourage companies to hire people previously convicted. My principal component analysis shows that young, employed people with low prior convictions and a late/no arrest consisted of over a third of the variance in my data. My models for the linear and nonparametric classifiers performed fairly in predicting whether a person will get arrested again and if I were to repeat this project, I would like to explore more types of models to see if I can maximize accuracy. With Python and R, I was able to analyze how the major factors of age and total weeks of employment correlated with getting arrested again. As seen in the tables, people who are arrested again tend to have a younger age and significantly less weeks of employment. Overall, I learned the young and unemployed are at a higher risk of getting arrested again and it is important to provide career services to convicts in order to reduce recidivism.