PSY341K Machine Learning: Ecommerce Project
Introduction
Due to the outbreak of COVID-19 and lockdown, e-commerce increased in popularity. Retailers and manufacturers had to be adaptable to fulfill changing demands, manage inventory levels, and improve their online shopping experience as buyers stayed at homes and minimized their travels to physical locations. According to the US Census Bureau (2022), e-commerce accounted for around 11% of all retail sales in 2019. By the end of the second quarter of 2020, e-commerce had risen to more than 16% of overall retail sales.
In order to keep up with demand and sustain inventory levels fueled by the e-commerce boom, retailers generated different strategies for efficient, dependable, and cost-effective means to move a massive volume of freight. As a result, discovering the factors influencing whether a product will arrive on time is crucial. In this project, we aim to investigate the factors influencing whether a package will arrive on time utilizing machine learning algorithms. We conducted exploratory analysis, feature selection, dimensionality reduction, predictive modeling, and boosting to improve algorithm accuracy.
Research Question: What factors influence whether a package arrives on time or not?
Machine Learning Solution: Create a model predicting whether a package will arrive on time
Data Wrangling
library(readr)
<- read_csv("DataSets/Train.csv") # original dataset
ecom head(ecom)
## # A tibble: 6 x 12
## ID Warehouse_block Mode_of_Shipment Customer_care_calls Customer_rating
## <dbl> <chr> <chr> <dbl> <dbl>
## 1 1 D Flight 4 2
## 2 2 F Flight 4 5
## 3 3 A Flight 2 2
## 4 4 B Flight 3 3
## 5 5 C Flight 2 2
## 6 6 F Flight 3 1
## # ... with 7 more variables: Cost_of_the_Product <dbl>, Prior_purchases <dbl>,
## # Product_importance <chr>, Gender <chr>, Discount_offered <dbl>,
## # Weight_in_gms <dbl>, Reached.on.Time_Y.N <dbl>
The dataset consists of 12 variables. First, there is the ID variable identifying the package order. There are the categorical variables of warehouse where the package was stored (labeled A to F), mode of shipment (Flight, Ship, or Road), gender of customer, and product importance rated by customer (Low, Medium, High). Then, there are the numerical variables of the number of customer care calls received about the package, customer rating of the package, cost of the product, number of prior purchases of the customer, discount offered on the product, and weight of the package. Lastly, there is the binary variable of whether the package arrived on time and this is the variable we will be conducting our binary classification.
The data wrangling was conducted in R and Excel. The ordinal categorical variables were converted into numerical variables. For example, product importance ranges from low, medium, to high and we converted all lows to 0, mediums to 1, and highs to 2. Nominal categorical variables such as warehouse and mode were removed for models analyzing numerical variables.
The dataset was also randomized and split into a training dataset of 7000 points and a testing set of 3999 points for the predictive models.
Exploratory Analysis: Cluster Analysis
I ran a cluster analysis with numerical variables and ordinal categorical variables (so removing warehouse and mode).
library(gt)
library(cluster)
library(GGally)
library(dplyr)
<- read_csv("DataSets/numerEcom.csv") # dataset with vars converted to numeric
ecomNum <- ecomNum %>% select(calls, rating, cost, purchases, importance, Gender, discount, weight, onTime) ecomNum
<- vector()
sil_width for (i in 2:10) {
<- kmeans(ecomNum, centers = i)
kms <- silhouette(kms$cluster, dist(ecomNum))
sil <- mean(sil[, 3])
sil_width[i]
}ggplot() + geom_line(aes(x = 1:10, y = sil_width)) + scale_x_continuous(name = "k", breaks = 1:10)
We can see that 2 clusters is most optimal by measuring the silhouette width.
= 2
k <- kmeans(ecomNum, k)
myclusters
$size myclusters
## [1] 6596 4401
$centers myclusters
## calls rating cost purchases importance Gender discount weight
## 1 3.811552 2.992116 204.8560 3.374318 0.6517586 0.5001516 8.214827 4860.131
## 2 4.418087 2.987957 218.1934 3.857532 0.5339696 0.5098841 21.106567 1796.336
## onTime
## 1 0.4860522
## 2 0.7623267
Here we can see that one cluster has 4,402 observations while the other cluster has 6597 observations. The center of the clusters vary most with discount offered, weight in grams, and whether the package reached on time.
Feature Selection: Scatterplot Correlation Matrix
%>% mutate(cluster = as.factor(myclusters$cluster)) %>%
ecomNum ggpairs(aes(color = cluster), upper = list(continuous = wrap("cor", size = 2.5))) +
theme(axis.line=element_blank(),
axis.text=element_blank(),
axis.ticks=element_blank()) + labs(title = "Scatterplot Correlation Matrix")
Whether the package arrived on time is significantly correlated with calls, cost, past purchases, discount, and weight. From the correlation with whether the package reached on time, it seems that rating and gender have an insignificant correlation and importance of the package has a weak correlation.
Findings
After running the cluster analysis, cluster 2 stood out because it was correlated with high percentage of the package being received on time and has the characteristics of high discount and low weight compared to cluster 1.
To keep the variance, a Principal Component Analysis will be conducted. However, the rating and gender variables will be removed because there was no correlation with the onTime variable or between clusters.
Other significant findings include that cost and calls are positively correlated, weight and calls are negatively correlated, and weight and discount are negatively correlated.
Dimensionality Reduction: Principal Component Analysis
Calculating PCA in Python
To view the principal component analysis, view the Python Colab Notebook here:
https://colab.research.google.com/drive/1kizyWORPAvHBI1VIdyWItLIV20FMN0Rc?usp=sharing
PCA Loadings
<- ecomNum %>% select(calls, cost, purchases, importance, discount, weight)
ecomNum2 <- princomp(ecomNum2, cor = T)
pcas summary(pcas, loadings = T)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 1.2772097 1.1634978 1.0096215 0.9334929 0.8226014
## Proportion of Variance 0.2718774 0.2256212 0.1698893 0.1452348 0.1127788
## Cumulative Proportion 0.2718774 0.4974986 0.6673879 0.8126227 0.9254016
## Comp.6
## Standard deviation 0.66902218
## Proportion of Variance 0.07459845
## Cumulative Proportion 1.00000000
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
## calls 0.589 0.163 0.216 0.651 0.394
## cost 0.490 0.279 0.452 -0.687
## purchases 0.388 0.106 -0.440 -0.757 -0.213 0.163
## importance -0.146 -0.893 0.403 -0.106
## discount -0.768 0.118 -0.207 0.591
## weight -0.490 0.542 0.676
Here we can see that the first component consists of a high number of calls, costs, and low weight, the second component consists of low discount and high weight, the third component consists of low purchases and low importance, and the fourth component consists of high ocst, low purchases, and high importance.
$scores %>%
pcas%>%
as.data.frame mutate(onTime = ecomNum$onTime) %>%
ggplot(aes(x = Comp.1, y = onTime)) + geom_point() +
geom_smooth(method = "glm")
$scores %>%
pcas%>%
as.data.frame mutate(onTime = ecomNum$onTime) %>%
ggplot(aes(x = Comp.2, y = onTime)) + geom_point() +
geom_smooth(method = "glm")
$scores %>%
pcas%>%
as.data.frame mutate(onTime = ecomNum$onTime) %>%
ggplot(aes(x = Comp.3, y = onTime)) + geom_point() +
geom_smooth(method = "glm")
$scores %>%
pcas%>%
as.data.frame mutate(onTime = ecomNum$onTime) %>%
ggplot(aes(x = Comp.4, y = onTime)) + geom_point() +
geom_smooth(method = "glm")
After analyzing the effect of each component with the onTime variable, the second component stands out. The linear model shows that a stronger second component leads to the package becoming less likely to arrive on time. These results mirror the cluster analysis as the second component is correlated with a low discount, high weight, and the package not arriving on time.
Predictive Modeling: KNN
We used the training dataset to build the classification system and then used the knn to predict/classify the testing dataset as “on time” or not “on time” based on the neighbors in the training dataset.
library(class)
library(gmodels)
<- read.csv("DataSets/ecommerceconverted_1.csv", header = TRUE) ecommerce
We tested different k numbers to figure out which would allow the highest accuracy for our system which we found was k = 3.
Non-Normalized Data
# wrangling
<- ecommerce[1:7000, ]
ecommerce_train <- ecommerce[7001:10999, ]
ecommerce_test <- ecommerce[1:7000, 9]
ecommerce_train_labels <- ecommerce[7001:10999, 9]
ecommerce_test_labels
# finding k
<- vector()
accuracy for (i in seq(1, 11, by = 2)) {
<- knn(train = ecommerce_train, test = ecommerce_test, cl = ecommerce_train_labels, k = i)
ecommerce_pred <- length(which(ecommerce_test_labels == ecommerce_pred)) /length(ecommerce_pred)
accuracy[i]
}<- data.frame(x= 1:11, y = accuracy)
df ggplot(df, aes(x = x, y = y)) + geom_point() + geom_line(data=df[!is.na(df$y),]) + labs(x = "K Neighbors", y = "Accuracy")
When testing the model, it performed with 65.77% accuracy.
# testing
<- knn(train = ecommerce_train, test = ecommerce_test, cl = ecommerce_train_labels, k = 3)
ecommerce_pred CrossTable(x = ecommerce_test_labels, y= ecommerce_pred, prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 3999
##
##
## | ecommerce_pred
## ecommerce_test_labels | 0 | 1 | Row Total |
## ----------------------|-----------|-----------|-----------|
## 0 | 997 | 638 | 1635 |
## | 0.610 | 0.390 | 0.409 |
## | 0.578 | 0.281 | |
## | 0.249 | 0.160 | |
## ----------------------|-----------|-----------|-----------|
## 1 | 729 | 1635 | 2364 |
## | 0.308 | 0.692 | 0.591 |
## | 0.422 | 0.719 | |
## | 0.182 | 0.409 | |
## ----------------------|-----------|-----------|-----------|
## Column Total | 1726 | 2273 | 3999 |
## | 0.432 | 0.568 | |
## ----------------------|-----------|-----------|-----------|
##
##
Normalized Data
# wrangling and normalizing
<- read.csv("DataSets/ecommerceconverted_1.csv", header = TRUE)
ecommerce <- function(x) {
normalize return ((x - min(x)) / (max(x) - min(x)))
}<- as.data.frame(lapply(ecommerce[1:8], normalize))
ecommerce_n <- cbind(ecommerce_n,ecommerce[c(9)])
ecommerce_n_with_time
<- ecommerce_n[1:7000, ]
ecommerce_train <- ecommerce_n[7001:10999, ]
ecommerce_test <- ecommerce_n_with_time[1:7000, 9]
ecommerce_train_labels <- ecommerce_n_with_time[7001:10999, 9]
ecommerce_test_labels
# finding k
<- vector()
accuracy for (i in seq(1, 11, by = 2)) {
<- knn(train = ecommerce_train, test = ecommerce_test, cl = ecommerce_train_labels, k = i)
ecommerce_pred <- length(which(ecommerce_test_labels == ecommerce_pred)) /length(ecommerce_pred)
accuracy[i]
}<- data.frame(x= 1:11, y = accuracy)
df ggplot(df, aes(x = x, y = y)) + geom_point() + geom_line(data=df[!is.na(df$y),]) + labs(x = "K Neighbors", y = "Accuracy")
# testing
<- knn(train = ecommerce_train, test = ecommerce_test, cl = ecommerce_train_labels, k = 3)
ecommerce_pred CrossTable(x = ecommerce_test_labels, y= ecommerce_pred, prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 3999
##
##
## | ecommerce_pred
## ecommerce_test_labels | 0 | 1 | Row Total |
## ----------------------|-----------|-----------|-----------|
## 0 | 928 | 707 | 1635 |
## | 0.568 | 0.432 | 0.409 |
## | 0.549 | 0.306 | |
## | 0.232 | 0.177 | |
## ----------------------|-----------|-----------|-----------|
## 1 | 761 | 1603 | 2364 |
## | 0.322 | 0.678 | 0.591 |
## | 0.451 | 0.694 | |
## | 0.190 | 0.401 | |
## ----------------------|-----------|-----------|-----------|
## Column Total | 1689 | 2310 | 3999 |
## | 0.422 | 0.578 | |
## ----------------------|-----------|-----------|-----------|
##
##
Although the data was normalized, the KNN performed worse at 63.3% accuracy.
Principal Components
# wrangling
<- read.csv("DataSets/trainPCA.csv", header = TRUE)
trainPCA <- read.csv("DataSets/testPCA.csv", header = TRUE)
testPCA
<- trainPCA %>% select(-1)
trainPCA <- testPCA %>% select(-1)
testPCA
<- cbind(trainPCA, onTime = ecommerce[1:7000, 9])
ecommerce_trainpca_with_time <- cbind(testPCA, onTime = ecommerce[7001:10999, 9])
ecommerce_testpca_with_time <- ecommerce_trainpca_with_time[, 5]
ecommerce_pca_train_labels <- ecommerce_testpca_with_time[, 5]
ecommerce_pca_test_labels
# finding k
<- vector()
accuracy for (i in seq(1, 29, by = 2)) {
<- knn(train = trainPCA, test = testPCA, cl = ecommerce_pca_train_labels, k = i)
ecommerce_pred <- length(which(ecommerce_pca_test_labels == ecommerce_pred)) /length(ecommerce_pred)
accuracy[i]
}<- data.frame(x= 1:29, y = accuracy)
df ggplot(df, aes(x = x, y = y)) + geom_point() + geom_line(data=df[!is.na(df$y),]) + labs(x = "K Neighbors", y = "Accuracy")
# testing
<- knn(train = trainPCA, test = testPCA, cl = ecommerce_pca_train_labels, k = 25)
ecommerce_pca_pred CrossTable(x = ecommerce_pca_test_labels, y= ecommerce_pca_pred, prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 3999
##
##
## | ecommerce_pca_pred
## ecommerce_pca_test_labels | 0 | 1 | Row Total |
## --------------------------|-----------|-----------|-----------|
## 0 | 1150 | 485 | 1635 |
## | 0.703 | 0.297 | 0.409 |
## | 0.576 | 0.242 | |
## | 0.288 | 0.121 | |
## --------------------------|-----------|-----------|-----------|
## 1 | 847 | 1517 | 2364 |
## | 0.358 | 0.642 | 0.591 |
## | 0.424 | 0.758 | |
## | 0.212 | 0.379 | |
## --------------------------|-----------|-----------|-----------|
## Column Total | 1997 | 2002 | 3999 |
## | 0.499 | 0.501 | |
## --------------------------|-----------|-----------|-----------|
##
##
The model using the principal components performed better at 66.69% accuracy.
Predictive Modeling: Decision Tree
To see if there is a model that performs better than the KNN, we ran a decision tree model and generated a rule set from the decision tree. Especially since the KNN only considers numerical variables, we aimed to include the categorical variables we have removed initially such as warehouse and mode.
library(C50)
library(dplyr)
library(gmodels)
<- read_csv("DataSets/ecomTrain.csv")
ecomTrain <- read_csv("DataSets/ecomTest.csv") ecomTest
Decision Tree
<- ecomTrain %>% select(-rating, -gender, -onTime)
ecomTrain2 # create decision tree model and classify as onTime
<- C5.0(ecomTrain2, as.factor(ecomTrain$onTime))
model summary(model)
##
## Call:
## C5.0.default(x = ecomTrain2, y = as.factor(ecomTrain$onTime))
##
##
## C5.0 [Release 2.07 GPL Edition] Wed May 18 18:28:50 2022
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 6999 cases (9 attributes) from undefined.data
##
## Decision tree:
##
## discount > 10: 1 (1690)
## discount <= 10:
## :...weight > 4130:
## :...warehouse in {A,F}: 0 (1751/732)
## : warehouse = C:
## : :...calls <= 4: 0 (461/189)
## : : calls > 4:
## : : :...mode = Road: 0 (31/13)
## : : mode = Ship: 1 (89/36)
## : : mode = Flight:
## : : :...importance = medium: 1 (6/2)
## : : importance = high: 0 (1)
## : : importance = low:
## : : :...weight <= 5205: 0 (8/1)
## : : weight > 5205: 1 (3)
## : warehouse = B:
## : :...importance = medium: 0 (278/131)
## : : importance = high:
## : : :...mode = Flight: 0 (9/2)
## : : : mode = Road:
## : : : :...calls <= 4: 0 (6/2)
## : : : : calls > 4: 1 (4)
## : : : mode = Ship:
## : : : :...calls <= 3: 0 (21/8)
## : : : calls > 3: 1 (14/3)
## : : importance = low:
## : : :...mode in {Road,Ship}: 0 (229/88)
## : : mode = Flight:
## : : :...calls <= 2: 0 (3)
## : : calls > 2:
## : : :...calls <= 5: 1 (39/13)
## : : calls > 5: 0 (2)
## : warehouse = D:
## : :...purchases > 5: 0 (54/13)
## : purchases <= 5:
## : :...discount > 9:
## : :...calls <= 2: 1 (3)
## : : calls > 2: 0 (51/10)
## : discount <= 9:
## : :...calls <= 4: 0 (369/153)
## : calls > 4:
## : :...purchases <= 2: 0 (24/8)
## : purchases > 2:
## : :...cost <= 125: 0 (5)
## : cost > 125: 1 (67/22)
## weight <= 4130:
## :...purchases <= 3:
## :...calls <= 5: 1 (419/91)
## : calls > 5:
## : :...cost > 191: 1 (6)
## : cost <= 191:
## : :...cost <= 135: 1 (2)
## : cost > 135: 0 (8)
## purchases > 3:
## :...weight > 2020:
## :...weight <= 4002: 1 (41)
## : weight > 4002:
## : :...warehouse in {C,F,D}: 1 (53/20)
## : warehouse = A: 0 (10/3)
## : warehouse = B:
## : :...purchases <= 5: 0 (9/1)
## : purchases > 5: 1 (5/1)
## weight <= 2020:
## :...cost <= 153: 1 (34/5)
## cost > 153:
## :...importance = low: 0 (726/358)
## importance = high:
## :...calls <= 5: 1 (6/2)
## : calls > 5:
## : :...cost <= 290: 0 (8)
## : cost > 290: 1 (2)
## importance = medium:
## :...warehouse = C: 0 (77/27)
## warehouse = B: 1 (80/36)
## warehouse = D:
## :...calls <= 4: 1 (16/5)
## : calls > 4: 0 (56/21)
## warehouse = F:
## :...purchases > 5:
## : :...cost <= 256: 0 (10/4)
## : : cost > 256: 1 (8)
## : purchases <= 5:
## : :...purchases > 4: 0 (65/22)
## : purchases <= 4:
## : :...cost <= 286: 0 (49/20)
## : cost > 286: 1 (17/5)
## warehouse = A:
## :...mode = Road:
## :...cost <= 257: 1 (3)
## : cost > 257: 0 (10/1)
## mode = Flight:
## :...cost <= 279: 1 (11/3)
## : cost > 279: 0 (3)
## mode = Ship:
## :...purchases > 4: 0 (19/5)
## purchases <= 4:
## :...calls <= 5: 1 (3)
## calls > 5:
## :...discount <= 6: 0 (12/2)
## discount > 6: 1 (13/5)
##
##
## Evaluation on training data (6999 cases):
##
## Decision Tree
## ----------------
## Size Errors
##
## 57 2063(29.5%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 2551 249 (a): class 0
## 1814 2385 (b): class 1
##
##
## Attribute usage:
##
## 100.00% discount
## 75.85% weight
## 57.97% warehouse
## 33.63% purchases
## 25.96% importance
## 25.12% calls
## 18.80% cost
## 7.70% mode
##
##
## Time: 0.0 secs
# Algorithm Accuracy: 70.5%
# predict with test data set
<- ecomTest %>% select(-rating, -gender, -onTime)
ecomTest2 <- predict(model, ecomTest2)
pred # Tested Accuracy : 68.7%
CrossTable(ecomTest$onTime, pred, prop.chisq = FALSE, prop.c = FALSE,
prop.r = FALSE, dnn = c('Actual', 'Predicted'))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 3999
##
##
## | Predicted
## Actual | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## 0 | 1417 | 218 | 1635 |
## | 0.354 | 0.055 | |
## -------------|-----------|-----------|-----------|
## 1 | 1032 | 1332 | 2364 |
## | 0.258 | 0.333 | |
## -------------|-----------|-----------|-----------|
## Column Total | 2449 | 1550 | 3999 |
## -------------|-----------|-----------|-----------|
##
##
# adaptive boosting by 10 trials
<- C5.0(ecomTrain2, as.factor(ecomTrain$onTime), trials = 10)
boost10 #summary(boost10)
$boostResults boost10
## Trial Size Errors Percent Data
## 1 1 57 2063 29.5 Training Set
## 2 2 16 2358 33.7 Training Set
## 3 3 11 2554 36.5 Training Set
## 4 4 15 2493 35.6 Training Set
## 5 5 11 2360 33.7 Training Set
## 6 6 3 2639 37.7 Training Set
## 7 7 4 2684 38.3 Training Set
## 8 8 5 2217 31.7 Training Set
## 9 9 21 2474 35.3 Training Set
## 10 10 7 2272 32.5 Training Set
# Algorithm Accuracy: 70.8%
<- predict(boost10, ecomTest2)
predboost10 CrossTable(ecomTest$onTime, predboost10, prop.chisq = FALSE, prop.c = FALSE,
prop.r = FALSE, dnn = c('Actual', 'Predicted'))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 3999
##
##
## | Predicted
## Actual | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## 0 | 1245 | 390 | 1635 |
## | 0.311 | 0.098 | |
## -------------|-----------|-----------|-----------|
## 1 | 886 | 1478 | 2364 |
## | 0.222 | 0.370 | |
## -------------|-----------|-----------|-----------|
## Column Total | 2131 | 1868 | 3999 |
## -------------|-----------|-----------|-----------|
##
##
# Result: 68.1% Accuracy
The decision tree performed slightly better than the KNN model with 70.5% accuracy of the model and when using Adaptive Boosting, the model accurately predicted 70.8% of the data. However, when testing the performance of the models with the test dataset, the decision tree without boosting performed the best at 68.7% accuracy.
The variables with the most information gain was the discount, weight, and warehouse of the package.
It was odd that the boosting algorithm only improved the model slightly. Taking a closer look at the decision tree we created, there were 57 nodes but only 16 of those nodes were supported by data. The rest of the nodes used a default value. In order to improve this model, more data is needed as our dataset consists of packages with similar characteristics.
Rule Set
To try a variation of the decision tree, we created a rule set that was derived from the structure of the tree. Compared to the decision tree that follows one rule and goes down one branch, rule sets apply multiples rules to the data point and weighs each rule to classify the data.
<- ecomTrain %>% select(-rating, -gender, -onTime, -mode)
ecomTrain3
# create decision tree model and classify as onTime
<- C5.0(ecomTrain3, as.factor(ecomTrain$onTime), rules = TRUE)
model3 summary(model3)
##
## Call:
## C5.0.default(x = ecomTrain3, y = as.factor(ecomTrain$onTime), rules = TRUE)
##
##
## C5.0 [Release 2.07 GPL Edition] Wed May 18 18:28:51 2022
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 6999 cases (8 attributes) from undefined.data
##
## Rules:
##
## Rule 1: (8, lift 2.2)
## calls > 5
## cost > 135
## cost <= 191
## purchases <= 3
## discount <= 10
## weight <= 4130
## -> class 0 [0.900]
##
## Rule 2: (15/2, lift 2.1)
## calls > 4
## cost > 153
## purchases > 5
## importance = medium
## discount <= 10
## weight <= 1632
## -> class 0 [0.824]
##
## Rule 3: (18/3, lift 2.0)
## warehouse = B
## calls <= 4
## purchases > 3
## purchases <= 5
## importance = medium
## discount <= 10
## weight <= 4130
## -> class 0 [0.800]
##
## Rule 4: (5309/2509, lift 1.3)
## discount <= 10
## -> class 0 [0.527]
##
## Rule 5: (1690, lift 1.7)
## discount > 10
## -> class 1 [0.999]
##
## Rule 6: (1468/18, lift 1.6)
## calls <= 5
## purchases <= 3
## weight <= 3989
## -> class 1 [0.987]
##
## Rule 7: (444/11, lift 1.6)
## cost <= 153
## weight <= 2020
## -> class 1 [0.973]
##
## Rule 8: (27, lift 1.6)
## warehouse = D
## calls <= 2
## purchases <= 5
## discount > 9
## -> class 1 [0.966]
##
## Rule 9: (77/4, lift 1.6)
## cost > 153
## purchases > 5
## importance = medium
## weight > 1632
## weight <= 4130
## -> class 1 [0.937]
##
## Rule 10: (26/1, lift 1.5)
## warehouse = B
## purchases > 4
## importance = medium
## discount > 9
## -> class 1 [0.929]
##
## Rule 11: (1382/112, lift 1.5)
## weight > 2020
## weight <= 4130
## -> class 1 [0.918]
##
## Rule 12: (70/5, lift 1.5)
## warehouse = B
## calls <= 3
## purchases <= 4
## importance = medium
## weight <= 4260
## -> class 1 [0.917]
##
## Rule 13: (136/12, lift 1.5)
## warehouse = D
## calls <= 4
## purchases <= 5
## importance = medium
## weight <= 4130
## -> class 1 [0.906]
##
## Rule 14: (258/52, lift 1.3)
## warehouse = B
## importance = medium
## weight <= 4130
## -> class 1 [0.796]
##
## Rule 15: (95/24, lift 1.2)
## calls <= 3
## cost > 214
## purchases > 3
## importance = low
## weight <= 5272
## -> class 1 [0.742]
##
## Rule 16: (252/80, lift 1.1)
## warehouse = B
## calls > 3
## purchases <= 4
## importance = medium
## -> class 1 [0.681]
##
## Default class: 1
##
##
## Evaluation on training data (6999 cases):
##
## Rules
## ----------------
## No Errors
##
## 16 2128(30.4%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 2559 241 (a): class 0
## 1887 2312 (b): class 1
##
##
## Attribute usage:
##
## 100.00% discount
## 33.72% weight
## 27.79% purchases
## 26.46% calls
## 10.03% importance
## 9.07% cost
## 7.92% warehouse
##
##
## Time: 0.0 secs
# Algorithm Accuracy: 69.6%
# predict with test dataset
<- ecomTest %>% select(-rating, -gender, -onTime, -mode)
ecomTest3 <- predict(model3, ecomTest3)
pred3
# Tested Accuracy: 68.1%
CrossTable(ecomTest$onTime, pred3, prop.chisq = FALSE, prop.c = FALSE,
prop.r = FALSE, dnn = c('Actual', 'Predicted'))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 3999
##
##
## | Predicted
## Actual | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## 0 | 1442 | 193 | 1635 |
## | 0.361 | 0.048 | |
## -------------|-----------|-----------|-----------|
## 1 | 1084 | 1280 | 2364 |
## | 0.271 | 0.320 | |
## -------------|-----------|-----------|-----------|
## Column Total | 2526 | 1473 | 3999 |
## -------------|-----------|-----------|-----------|
##
##
# adaptive boosting by 14 trials
<- C5.0(ecomTrain3, as.factor(ecomTrain$onTime), trials = 14, rules = TRUE)
boost14 #summary(boost14)
$boostResults boost14
## Trial Size Errors Percent Data
## 1 1 16 2128 30.4 Training Set
## 2 2 8 2614 37.3 Training Set
## 3 3 6 2247 32.1 Training Set
## 4 4 13 2638 37.7 Training Set
## 5 5 7 2502 35.7 Training Set
## 6 6 9 2530 36.1 Training Set
## 7 7 3 2675 38.2 Training Set
## 8 8 6 2715 38.8 Training Set
## 9 9 2 2509 35.8 Training Set
## 10 10 2 2800 40.0 Training Set
## 11 11 4 2503 35.8 Training Set
## 12 12 6 2219 31.7 Training Set
## 13 13 10 2674 38.2 Training Set
## 14 14 16 2171 31.0 Training Set
# Boosted Algorithm Accuracy: 70%
<- predict(boost14, ecomTest3)
predboost14 CrossTable(ecomTest$onTime, predboost14, prop.chisq = FALSE, prop.c = FALSE,
prop.r = FALSE, dnn = c('Actual', 'Predicted'))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 3999
##
##
## | Predicted
## Actual | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## 0 | 1173 | 462 | 1635 |
## | 0.293 | 0.116 | |
## -------------|-----------|-----------|-----------|
## 1 | 837 | 1527 | 2364 |
## | 0.209 | 0.382 | |
## -------------|-----------|-----------|-----------|
## Column Total | 2010 | 1989 | 3999 |
## -------------|-----------|-----------|-----------|
##
##
# Result: 67.5% Accuracy
The rule set did not perform significantly better at 67.5% accuracy.
Results
Of the three predictive models we created - KNN, Decision Tree, and Rule Set, the decision tree model performed the best in testing at 68.7% accuracy. From this model and the analysis we ran, we can see that discount, weight, and warehouse most signficantly impact whether a package arrives on time. We were surprised to see that mode of shipment had little effect on the arrival of the package as we expected more variation between the different modes.
In more detail, a package is likely not to arrive on time if it there was less than a 10% discount offered on the package, the weight was less than 2000 or greater than 4000, and the package came from warehouse F.
# warehouse
<- data.frame(table(warehouse = ecom$Warehouse_block, onTime = as.logical(ecom$Reached.on.Time_Y.N)))
tab %>% ggplot() + geom_bar(aes(x = warehouse, y = Freq, fill = onTime), stat = "identity", position = "dodge") + ggtitle("Distribution of Packages per Warehouse") + xlab("Warehouses") + ylab("Total Packages") + labs(fill= "Package Arrived on Time") tab
# discount v weight
%>% ggplot(aes(x = Discount_offered, y = Weight_in_gms)) + geom_point(aes(color = as.logical(Reached.on.Time_Y.N))) + ggtitle("Discount Vs Weight") + xlab("Discount Offered") + ylab("Weight (grams)") + labs(color = "Package Arrived on Time") ecom
The implications of research suggest that warehouse F is overloaded with packages compared to other warehouses but does not necessarily have a higher rate of packages not arriving on time. Moreover, a little to no discount on an item most likely suggests that there is not much of the item in supply compared to an item with a high discount offered. It is possibly that a lower discount signifies less of the product available, leading to the package not arriving on time. Additionally, packages that are extremely light could have been misplaced while packages that are extremely heavy could have had difficulties in transportation which lead to the package not arriving on time.
Overall, it is the interaction of discount, weight, and warehouse among many variables that account for the arrival of the package. With the predictive models created, the company can identify packages that are at risk and take more care in seeing that it arrives on time.
Reflection
Through this project, we were able to see the strengths and weaknesses of each algorithm. With cluster analysis and principal component analysis, we were limited to numerical variables but able to understand more about the patterns in the data. With the KNN and decision tree model, more data is needed to improve the accuracy of the classification algorithm and our highest performance was 68.7% in accuracy which is slightly better than random guessing of 50%. The decision tree model performed slightly better than the KNN model and this could be due to the inclusion of categorical variables in analysis.
We were surprised to see that techniques such as normalization of the data in the KNN and AdaBoost for the decision tree did not lead to a major improvement in algorithm accuracy. This could be due to human error or that our dataset was too small.
For future study, we would like to gather more data to train our model and investigate confounding variables such as weather or location of customer. It would also be interesting to see time-series data of the packages as online ordering tends to increase around the holiday season which could impact arrival time.
In this project, we learned the applications of Big Data in machine learning and the importance of trying different algorithms to find the highest accuracy in predictive modeling.