PSY341K Machine Learning: Ecommerce Project

Introduction

Due to the outbreak of COVID-19 and lockdown, e-commerce increased in popularity. Retailers and manufacturers had to be adaptable to fulfill changing demands, manage inventory levels, and improve their online shopping experience as buyers stayed at homes and minimized their travels to physical locations. According to the US Census Bureau (2022), e-commerce accounted for around 11% of all retail sales in 2019. By the end of the second quarter of 2020, e-commerce had risen to more than 16% of overall retail sales.

In order to keep up with demand and sustain inventory levels fueled by the e-commerce boom, retailers generated different strategies for efficient, dependable, and cost-effective means to move a massive volume of freight. As a result, discovering the factors influencing whether a product will arrive on time is crucial. In this project, we aim to investigate the factors influencing whether a package will arrive on time utilizing machine learning algorithms. We conducted exploratory analysis, feature selection, dimensionality reduction, predictive modeling, and boosting to improve algorithm accuracy.

Research Question: What factors influence whether a package arrives on time or not?

Machine Learning Solution: Create a model predicting whether a package will arrive on time

Data Wrangling

library(readr)

ecom <- read_csv("DataSets/Train.csv") # original dataset
head(ecom)

## # A tibble: 6 x 12
##      ID Warehouse_block Mode_of_Shipment Customer_care_calls Customer_rating
##   <dbl> <chr>           <chr>                          <dbl>           <dbl>
## 1     1 D               Flight                             4               2
## 2     2 F               Flight                             4               5
## 3     3 A               Flight                             2               2
## 4     4 B               Flight                             3               3
## 5     5 C               Flight                             2               2
## 6     6 F               Flight                             3               1
## # ... with 7 more variables: Cost_of_the_Product <dbl>, Prior_purchases <dbl>,
## #   Product_importance <chr>, Gender <chr>, Discount_offered <dbl>,
## #   Weight_in_gms <dbl>, Reached.on.Time_Y.N <dbl>

The dataset consists of 12 variables. First, there is the ID variable identifying the package order. There are the categorical variables of warehouse where the package was stored (labeled A to F), mode of shipment (Flight, Ship, or Road), gender of customer, and product importance rated by customer (Low, Medium, High). Then, there are the numerical variables of the number of customer care calls received about the package, customer rating of the package, cost of the product, number of prior purchases of the customer, discount offered on the product, and weight of the package. Lastly, there is the binary variable of whether the package arrived on time and this is the variable we will be conducting our binary classification.

The data wrangling was conducted in R and Excel. The ordinal categorical variables were converted into numerical variables. For example, product importance ranges from low, medium, to high and we converted all lows to 0, mediums to 1, and highs to 2. Nominal categorical variables such as warehouse and mode were removed for models analyzing numerical variables.

The dataset was also randomized and split into a training dataset of 7000 points and a testing set of 3999 points for the predictive models.

Exploratory Analysis: Cluster Analysis

I ran a cluster analysis with numerical variables and ordinal categorical variables (so removing warehouse and mode).

library(gt)
library(cluster)
library(GGally)
library(dplyr)

ecomNum <- read_csv("DataSets/numerEcom.csv") # dataset with vars converted to numeric
ecomNum <- ecomNum %>% select(calls, rating, cost, purchases, importance, Gender, discount, weight, onTime)

sil_width <- vector()
for (i in 2:10) {
  kms <- kmeans(ecomNum, centers = i)
  sil <- silhouette(kms$cluster, dist(ecomNum))
  sil_width[i] <- mean(sil[, 3])
}
ggplot() + geom_line(aes(x = 1:10, y = sil_width)) + scale_x_continuous(name = "k", breaks = 1:10)

We can see that 2 clusters is most optimal by measuring the silhouette width.

k = 2
myclusters <- kmeans(ecomNum, k)

myclusters$size

## [1] 6596 4401

myclusters$centers

##      calls   rating     cost purchases importance    Gender  discount   weight
## 1 3.811552 2.992116 204.8560  3.374318  0.6517586 0.5001516  8.214827 4860.131
## 2 4.418087 2.987957 218.1934  3.857532  0.5339696 0.5098841 21.106567 1796.336
##      onTime
## 1 0.4860522
## 2 0.7623267

Here we can see that one cluster has 4,402 observations while the other cluster has 6597 observations. The center of the clusters vary most with discount offered, weight in grams, and whether the package reached on time.

Feature Selection: Scatterplot Correlation Matrix

ecomNum %>% mutate(cluster = as.factor(myclusters$cluster)) %>% 
    ggpairs(aes(color = cluster), upper = list(continuous = wrap("cor", size = 2.5)))  + 
  theme(axis.line=element_blank(),
        axis.text=element_blank(),
        axis.ticks=element_blank()) + labs(title = "Scatterplot Correlation Matrix")

Whether the package arrived on time is significantly correlated with calls, cost, past purchases, discount, and weight. From the correlation with whether the package reached on time, it seems that rating and gender have an insignificant correlation and importance of the package has a weak correlation.

Findings

After running the cluster analysis, cluster 2 stood out because it was correlated with high percentage of the package being received on time and has the characteristics of high discount and low weight compared to cluster 1.

To keep the variance, a Principal Component Analysis will be conducted. However, the rating and gender variables will be removed because there was no correlation with the onTime variable or between clusters.

Other significant findings include that cost and calls are positively correlated, weight and calls are negatively correlated, and weight and discount are negatively correlated.

Dimensionality Reduction: Principal Component Analysis

Calculating PCA in Python

To view the principal component analysis, view the Python Colab Notebook here:

https://colab.research.google.com/drive/1kizyWORPAvHBI1VIdyWItLIV20FMN0Rc?usp=sharing

PCA Loadings

ecomNum2 <- ecomNum %>% select(calls, cost, purchases, importance, discount, weight)
pcas <- princomp(ecomNum2, cor = T)
summary(pcas, loadings = T)

## Importance of components:
##                           Comp.1    Comp.2    Comp.3    Comp.4    Comp.5
## Standard deviation     1.2772097 1.1634978 1.0096215 0.9334929 0.8226014
## Proportion of Variance 0.2718774 0.2256212 0.1698893 0.1452348 0.1127788
## Cumulative Proportion  0.2718774 0.4974986 0.6673879 0.8126227 0.9254016
##                            Comp.6
## Standard deviation     0.66902218
## Proportion of Variance 0.07459845
## Cumulative Proportion  1.00000000
## 
## Loadings:
##            Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
## calls       0.589  0.163         0.216  0.651  0.394
## cost        0.490  0.279         0.452 -0.687       
## purchases   0.388  0.106 -0.440 -0.757 -0.213  0.163
## importance -0.146        -0.893  0.403        -0.106
## discount          -0.768         0.118 -0.207  0.591
## weight     -0.490  0.542                       0.676

Here we can see that the first component consists of a high number of calls, costs, and low weight, the second component consists of low discount and high weight, the third component consists of low purchases and low importance, and the fourth component consists of high ocst, low purchases, and high importance.

pcas$scores %>%
  as.data.frame %>%
  mutate(onTime = ecomNum$onTime) %>%
  ggplot(aes(x = Comp.1, y = onTime)) + geom_point() +
  geom_smooth(method = "glm")

pcas$scores %>%
  as.data.frame %>%
  mutate(onTime = ecomNum$onTime) %>%
  ggplot(aes(x = Comp.2, y = onTime)) + geom_point() +
  geom_smooth(method = "glm")

pcas$scores %>%
  as.data.frame %>%
  mutate(onTime = ecomNum$onTime) %>%
  ggplot(aes(x = Comp.3, y = onTime)) + geom_point() +
  geom_smooth(method = "glm")

pcas$scores %>%
  as.data.frame %>%
  mutate(onTime = ecomNum$onTime) %>%
  ggplot(aes(x = Comp.4, y = onTime)) + geom_point() +
  geom_smooth(method = "glm")

After analyzing the effect of each component with the onTime variable, the second component stands out. The linear model shows that a stronger second component leads to the package becoming less likely to arrive on time. These results mirror the cluster analysis as the second component is correlated with a low discount, high weight, and the package not arriving on time.

Predictive Modeling: KNN

We used the training dataset to build the classification system and then used the knn to predict/classify the testing dataset as “on time” or not “on time” based on the neighbors in the training dataset.

library(class)
library(gmodels)

ecommerce <- read.csv("DataSets/ecommerceconverted_1.csv", header = TRUE)

We tested different k numbers to figure out which would allow the highest accuracy for our system which we found was k = 3.

Non-Normalized Data

# wrangling
ecommerce_train <- ecommerce[1:7000, ]
ecommerce_test <- ecommerce[7001:10999, ]
ecommerce_train_labels <- ecommerce[1:7000, 9]
ecommerce_test_labels <- ecommerce[7001:10999, 9]

# finding k
accuracy <- vector()
for (i in seq(1, 11, by = 2)) {
  ecommerce_pred <- knn(train = ecommerce_train, test = ecommerce_test, cl = ecommerce_train_labels, k = i)
  accuracy[i] <- length(which(ecommerce_test_labels == ecommerce_pred)) /length(ecommerce_pred)
}
df <- data.frame(x= 1:11, y = accuracy)
ggplot(df, aes(x = x, y = y)) + geom_point() + geom_line(data=df[!is.na(df$y),]) + labs(x = "K Neighbors", y = "Accuracy")

When testing the model, it performed with 65.77% accuracy.

# testing
ecommerce_pred <- knn(train = ecommerce_train, test = ecommerce_test, cl = ecommerce_train_labels, k = 3)
CrossTable(x = ecommerce_test_labels, y= ecommerce_pred, prop.chisq = FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  3999 
## 
##  
##                       | ecommerce_pred 
## ecommerce_test_labels |         0 |         1 | Row Total | 
## ----------------------|-----------|-----------|-----------|
##                     0 |       997 |       638 |      1635 | 
##                       |     0.610 |     0.390 |     0.409 | 
##                       |     0.578 |     0.281 |           | 
##                       |     0.249 |     0.160 |           | 
## ----------------------|-----------|-----------|-----------|
##                     1 |       729 |      1635 |      2364 | 
##                       |     0.308 |     0.692 |     0.591 | 
##                       |     0.422 |     0.719 |           | 
##                       |     0.182 |     0.409 |           | 
## ----------------------|-----------|-----------|-----------|
##          Column Total |      1726 |      2273 |      3999 | 
##                       |     0.432 |     0.568 |           | 
## ----------------------|-----------|-----------|-----------|
## 
##

Normalized Data

# wrangling and normalizing
ecommerce <- read.csv("DataSets/ecommerceconverted_1.csv", header = TRUE)
normalize <- function(x) {
  return ((x - min(x)) / (max(x) - min(x)))
}
ecommerce_n <- as.data.frame(lapply(ecommerce[1:8], normalize))
ecommerce_n_with_time <- cbind(ecommerce_n,ecommerce[c(9)])

ecommerce_train <- ecommerce_n[1:7000, ]
ecommerce_test <- ecommerce_n[7001:10999, ]
ecommerce_train_labels <- ecommerce_n_with_time[1:7000, 9]
ecommerce_test_labels <- ecommerce_n_with_time[7001:10999, 9]

# finding k
accuracy <- vector()
for (i in seq(1, 11, by = 2)) {
  ecommerce_pred <- knn(train = ecommerce_train, test = ecommerce_test, cl = ecommerce_train_labels, k = i)
  accuracy[i] <- length(which(ecommerce_test_labels == ecommerce_pred)) /length(ecommerce_pred)
}
df <- data.frame(x= 1:11, y = accuracy)
ggplot(df, aes(x = x, y = y)) + geom_point() + geom_line(data=df[!is.na(df$y),]) + labs(x = "K Neighbors", y = "Accuracy")

# testing
ecommerce_pred <- knn(train = ecommerce_train, test = ecommerce_test, cl = ecommerce_train_labels, k = 3)
CrossTable(x = ecommerce_test_labels, y= ecommerce_pred, prop.chisq = FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  3999 
## 
##  
##                       | ecommerce_pred 
## ecommerce_test_labels |         0 |         1 | Row Total | 
## ----------------------|-----------|-----------|-----------|
##                     0 |       928 |       707 |      1635 | 
##                       |     0.568 |     0.432 |     0.409 | 
##                       |     0.549 |     0.306 |           | 
##                       |     0.232 |     0.177 |           | 
## ----------------------|-----------|-----------|-----------|
##                     1 |       761 |      1603 |      2364 | 
##                       |     0.322 |     0.678 |     0.591 | 
##                       |     0.451 |     0.694 |           | 
##                       |     0.190 |     0.401 |           | 
## ----------------------|-----------|-----------|-----------|
##          Column Total |      1689 |      2310 |      3999 | 
##                       |     0.422 |     0.578 |           | 
## ----------------------|-----------|-----------|-----------|
## 
##

Although the data was normalized, the KNN performed worse at 63.3% accuracy.

Principal Components

# wrangling
trainPCA <- read.csv("DataSets/trainPCA.csv", header = TRUE)
testPCA <- read.csv("DataSets/testPCA.csv", header = TRUE)

trainPCA <- trainPCA %>% select(-1)
testPCA <- testPCA %>% select(-1)

ecommerce_trainpca_with_time <- cbind(trainPCA, onTime = ecommerce[1:7000, 9])
ecommerce_testpca_with_time <- cbind(testPCA, onTime = ecommerce[7001:10999, 9])
ecommerce_pca_train_labels <- ecommerce_trainpca_with_time[, 5]
ecommerce_pca_test_labels <- ecommerce_testpca_with_time[, 5]

# finding k
accuracy <- vector()
for (i in seq(1, 29, by = 2)) {
  ecommerce_pred <- knn(train = trainPCA, test = testPCA, cl = ecommerce_pca_train_labels, k = i)
  accuracy[i] <- length(which(ecommerce_pca_test_labels == ecommerce_pred)) /length(ecommerce_pred)
}
df <- data.frame(x= 1:29, y = accuracy)
ggplot(df, aes(x = x, y = y)) + geom_point() + geom_line(data=df[!is.na(df$y),]) + labs(x = "K Neighbors", y = "Accuracy")

# testing
ecommerce_pca_pred <- knn(train = trainPCA, test = testPCA, cl = ecommerce_pca_train_labels, k = 25)
CrossTable(x = ecommerce_pca_test_labels, y= ecommerce_pca_pred, prop.chisq = FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  3999 
## 
##  
##                           | ecommerce_pca_pred 
## ecommerce_pca_test_labels |         0 |         1 | Row Total | 
## --------------------------|-----------|-----------|-----------|
##                         0 |      1150 |       485 |      1635 | 
##                           |     0.703 |     0.297 |     0.409 | 
##                           |     0.576 |     0.242 |           | 
##                           |     0.288 |     0.121 |           | 
## --------------------------|-----------|-----------|-----------|
##                         1 |       847 |      1517 |      2364 | 
##                           |     0.358 |     0.642 |     0.591 | 
##                           |     0.424 |     0.758 |           | 
##                           |     0.212 |     0.379 |           | 
## --------------------------|-----------|-----------|-----------|
##              Column Total |      1997 |      2002 |      3999 | 
##                           |     0.499 |     0.501 |           | 
## --------------------------|-----------|-----------|-----------|
## 
##

The model using the principal components performed better at 66.69% accuracy.

Predictive Modeling: Decision Tree

To see if there is a model that performs better than the KNN, we ran a decision tree model and generated a rule set from the decision tree. Especially since the KNN only considers numerical variables, we aimed to include the categorical variables we have removed initially such as warehouse and mode.

library(C50)
library(dplyr)
library(gmodels)

ecomTrain <- read_csv("DataSets/ecomTrain.csv")
ecomTest <- read_csv("DataSets/ecomTest.csv")

Decision Tree

ecomTrain2 <- ecomTrain %>% select(-rating, -gender, -onTime)
# create decision tree model and classify as onTime
model <- C5.0(ecomTrain2, as.factor(ecomTrain$onTime))
summary(model)

## 
## Call:
## C5.0.default(x = ecomTrain2, y = as.factor(ecomTrain$onTime))
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Wed May 18 18:28:50 2022
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 6999 cases (9 attributes) from undefined.data
## 
## Decision tree:
## 
## discount > 10: 1 (1690)
## discount <= 10:
## :...weight > 4130:
##     :...warehouse in {A,F}: 0 (1751/732)
##     :   warehouse = C:
##     :   :...calls <= 4: 0 (461/189)
##     :   :   calls > 4:
##     :   :   :...mode = Road: 0 (31/13)
##     :   :       mode = Ship: 1 (89/36)
##     :   :       mode = Flight:
##     :   :       :...importance = medium: 1 (6/2)
##     :   :           importance = high: 0 (1)
##     :   :           importance = low:
##     :   :           :...weight <= 5205: 0 (8/1)
##     :   :               weight > 5205: 1 (3)
##     :   warehouse = B:
##     :   :...importance = medium: 0 (278/131)
##     :   :   importance = high:
##     :   :   :...mode = Flight: 0 (9/2)
##     :   :   :   mode = Road:
##     :   :   :   :...calls <= 4: 0 (6/2)
##     :   :   :   :   calls > 4: 1 (4)
##     :   :   :   mode = Ship:
##     :   :   :   :...calls <= 3: 0 (21/8)
##     :   :   :       calls > 3: 1 (14/3)
##     :   :   importance = low:
##     :   :   :...mode in {Road,Ship}: 0 (229/88)
##     :   :       mode = Flight:
##     :   :       :...calls <= 2: 0 (3)
##     :   :           calls > 2:
##     :   :           :...calls <= 5: 1 (39/13)
##     :   :               calls > 5: 0 (2)
##     :   warehouse = D:
##     :   :...purchases > 5: 0 (54/13)
##     :       purchases <= 5:
##     :       :...discount > 9:
##     :           :...calls <= 2: 1 (3)
##     :           :   calls > 2: 0 (51/10)
##     :           discount <= 9:
##     :           :...calls <= 4: 0 (369/153)
##     :               calls > 4:
##     :               :...purchases <= 2: 0 (24/8)
##     :                   purchases > 2:
##     :                   :...cost <= 125: 0 (5)
##     :                       cost > 125: 1 (67/22)
##     weight <= 4130:
##     :...purchases <= 3:
##         :...calls <= 5: 1 (419/91)
##         :   calls > 5:
##         :   :...cost > 191: 1 (6)
##         :       cost <= 191:
##         :       :...cost <= 135: 1 (2)
##         :           cost > 135: 0 (8)
##         purchases > 3:
##         :...weight > 2020:
##             :...weight <= 4002: 1 (41)
##             :   weight > 4002:
##             :   :...warehouse in {C,F,D}: 1 (53/20)
##             :       warehouse = A: 0 (10/3)
##             :       warehouse = B:
##             :       :...purchases <= 5: 0 (9/1)
##             :           purchases > 5: 1 (5/1)
##             weight <= 2020:
##             :...cost <= 153: 1 (34/5)
##                 cost > 153:
##                 :...importance = low: 0 (726/358)
##                     importance = high:
##                     :...calls <= 5: 1 (6/2)
##                     :   calls > 5:
##                     :   :...cost <= 290: 0 (8)
##                     :       cost > 290: 1 (2)
##                     importance = medium:
##                     :...warehouse = C: 0 (77/27)
##                         warehouse = B: 1 (80/36)
##                         warehouse = D:
##                         :...calls <= 4: 1 (16/5)
##                         :   calls > 4: 0 (56/21)
##                         warehouse = F:
##                         :...purchases > 5:
##                         :   :...cost <= 256: 0 (10/4)
##                         :   :   cost > 256: 1 (8)
##                         :   purchases <= 5:
##                         :   :...purchases > 4: 0 (65/22)
##                         :       purchases <= 4:
##                         :       :...cost <= 286: 0 (49/20)
##                         :           cost > 286: 1 (17/5)
##                         warehouse = A:
##                         :...mode = Road:
##                             :...cost <= 257: 1 (3)
##                             :   cost > 257: 0 (10/1)
##                             mode = Flight:
##                             :...cost <= 279: 1 (11/3)
##                             :   cost > 279: 0 (3)
##                             mode = Ship:
##                             :...purchases > 4: 0 (19/5)
##                                 purchases <= 4:
##                                 :...calls <= 5: 1 (3)
##                                     calls > 5:
##                                     :...discount <= 6: 0 (12/2)
##                                         discount > 6: 1 (13/5)
## 
## 
## Evaluation on training data (6999 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##      57 2063(29.5%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##    2551   249    (a): class 0
##    1814  2385    (b): class 1
## 
## 
##  Attribute usage:
## 
##  100.00% discount
##   75.85% weight
##   57.97% warehouse
##   33.63% purchases
##   25.96% importance
##   25.12% calls
##   18.80% cost
##    7.70% mode
## 
## 
## Time: 0.0 secs

# Algorithm Accuracy: 70.5%

# predict with test data set
ecomTest2 <- ecomTest %>% select(-rating, -gender, -onTime)
pred <- predict(model, ecomTest2)
# Tested Accuracy : 68.7%
CrossTable(ecomTest$onTime, pred, prop.chisq = FALSE, prop.c = FALSE, 
           prop.r = FALSE, dnn = c('Actual', 'Predicted'))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  3999 
## 
##  
##              | Predicted 
##       Actual |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            0 |      1417 |       218 |      1635 | 
##              |     0.354 |     0.055 |           | 
## -------------|-----------|-----------|-----------|
##            1 |      1032 |      1332 |      2364 | 
##              |     0.258 |     0.333 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |      2449 |      1550 |      3999 | 
## -------------|-----------|-----------|-----------|
## 
##

# adaptive boosting by 10 trials
boost10 <- C5.0(ecomTrain2, as.factor(ecomTrain$onTime), trials = 10)
#summary(boost10)
boost10$boostResults

##    Trial Size Errors Percent         Data
## 1      1   57   2063    29.5 Training Set
## 2      2   16   2358    33.7 Training Set
## 3      3   11   2554    36.5 Training Set
## 4      4   15   2493    35.6 Training Set
## 5      5   11   2360    33.7 Training Set
## 6      6    3   2639    37.7 Training Set
## 7      7    4   2684    38.3 Training Set
## 8      8    5   2217    31.7 Training Set
## 9      9   21   2474    35.3 Training Set
## 10    10    7   2272    32.5 Training Set

# Algorithm Accuracy: 70.8%
predboost10 <- predict(boost10, ecomTest2)
CrossTable(ecomTest$onTime, predboost10, prop.chisq = FALSE, prop.c = FALSE, 
           prop.r = FALSE, dnn = c('Actual', 'Predicted'))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  3999 
## 
##  
##              | Predicted 
##       Actual |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            0 |      1245 |       390 |      1635 | 
##              |     0.311 |     0.098 |           | 
## -------------|-----------|-----------|-----------|
##            1 |       886 |      1478 |      2364 | 
##              |     0.222 |     0.370 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |      2131 |      1868 |      3999 | 
## -------------|-----------|-----------|-----------|
## 
##

# Result: 68.1% Accuracy

The decision tree performed slightly better than the KNN model with 70.5% accuracy of the model and when using Adaptive Boosting, the model accurately predicted 70.8% of the data. However, when testing the performance of the models with the test dataset, the decision tree without boosting performed the best at 68.7% accuracy.

The variables with the most information gain was the discount, weight, and warehouse of the package.

It was odd that the boosting algorithm only improved the model slightly. Taking a closer look at the decision tree we created, there were 57 nodes but only 16 of those nodes were supported by data. The rest of the nodes used a default value. In order to improve this model, more data is needed as our dataset consists of packages with similar characteristics.

Rule Set

To try a variation of the decision tree, we created a rule set that was derived from the structure of the tree. Compared to the decision tree that follows one rule and goes down one branch, rule sets apply multiples rules to the data point and weighs each rule to classify the data.

ecomTrain3 <- ecomTrain %>% select(-rating, -gender, -onTime, -mode)

# create decision tree model and classify as onTime
model3 <- C5.0(ecomTrain3, as.factor(ecomTrain$onTime), rules = TRUE)
summary(model3)

## 
## Call:
## C5.0.default(x = ecomTrain3, y = as.factor(ecomTrain$onTime), rules = TRUE)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Wed May 18 18:28:51 2022
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 6999 cases (8 attributes) from undefined.data
## 
## Rules:
## 
## Rule 1: (8, lift 2.2)
##  calls > 5
##  cost > 135
##  cost <= 191
##  purchases <= 3
##  discount <= 10
##  weight <= 4130
##  ->  class 0  [0.900]
## 
## Rule 2: (15/2, lift 2.1)
##  calls > 4
##  cost > 153
##  purchases > 5
##  importance = medium
##  discount <= 10
##  weight <= 1632
##  ->  class 0  [0.824]
## 
## Rule 3: (18/3, lift 2.0)
##  warehouse = B
##  calls <= 4
##  purchases > 3
##  purchases <= 5
##  importance = medium
##  discount <= 10
##  weight <= 4130
##  ->  class 0  [0.800]
## 
## Rule 4: (5309/2509, lift 1.3)
##  discount <= 10
##  ->  class 0  [0.527]
## 
## Rule 5: (1690, lift 1.7)
##  discount > 10
##  ->  class 1  [0.999]
## 
## Rule 6: (1468/18, lift 1.6)
##  calls <= 5
##  purchases <= 3
##  weight <= 3989
##  ->  class 1  [0.987]
## 
## Rule 7: (444/11, lift 1.6)
##  cost <= 153
##  weight <= 2020
##  ->  class 1  [0.973]
## 
## Rule 8: (27, lift 1.6)
##  warehouse = D
##  calls <= 2
##  purchases <= 5
##  discount > 9
##  ->  class 1  [0.966]
## 
## Rule 9: (77/4, lift 1.6)
##  cost > 153
##  purchases > 5
##  importance = medium
##  weight > 1632
##  weight <= 4130
##  ->  class 1  [0.937]
## 
## Rule 10: (26/1, lift 1.5)
##  warehouse = B
##  purchases > 4
##  importance = medium
##  discount > 9
##  ->  class 1  [0.929]
## 
## Rule 11: (1382/112, lift 1.5)
##  weight > 2020
##  weight <= 4130
##  ->  class 1  [0.918]
## 
## Rule 12: (70/5, lift 1.5)
##  warehouse = B
##  calls <= 3
##  purchases <= 4
##  importance = medium
##  weight <= 4260
##  ->  class 1  [0.917]
## 
## Rule 13: (136/12, lift 1.5)
##  warehouse = D
##  calls <= 4
##  purchases <= 5
##  importance = medium
##  weight <= 4130
##  ->  class 1  [0.906]
## 
## Rule 14: (258/52, lift 1.3)
##  warehouse = B
##  importance = medium
##  weight <= 4130
##  ->  class 1  [0.796]
## 
## Rule 15: (95/24, lift 1.2)
##  calls <= 3
##  cost > 214
##  purchases > 3
##  importance = low
##  weight <= 5272
##  ->  class 1  [0.742]
## 
## Rule 16: (252/80, lift 1.1)
##  warehouse = B
##  calls > 3
##  purchases <= 4
##  importance = medium
##  ->  class 1  [0.681]
## 
## Default class: 1
## 
## 
## Evaluation on training data (6999 cases):
## 
##          Rules     
##    ----------------
##      No      Errors
## 
##      16 2128(30.4%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##    2559   241    (a): class 0
##    1887  2312    (b): class 1
## 
## 
##  Attribute usage:
## 
##  100.00% discount
##   33.72% weight
##   27.79% purchases
##   26.46% calls
##   10.03% importance
##    9.07% cost
##    7.92% warehouse
## 
## 
## Time: 0.0 secs

# Algorithm Accuracy: 69.6%

# predict with test dataset
ecomTest3 <- ecomTest %>% select(-rating, -gender, -onTime, -mode)
pred3 <- predict(model3, ecomTest3)

# Tested Accuracy: 68.1%
CrossTable(ecomTest$onTime, pred3, prop.chisq = FALSE, prop.c = FALSE, 
           prop.r = FALSE, dnn = c('Actual', 'Predicted'))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  3999 
## 
##  
##              | Predicted 
##       Actual |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            0 |      1442 |       193 |      1635 | 
##              |     0.361 |     0.048 |           | 
## -------------|-----------|-----------|-----------|
##            1 |      1084 |      1280 |      2364 | 
##              |     0.271 |     0.320 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |      2526 |      1473 |      3999 | 
## -------------|-----------|-----------|-----------|
## 
##

# adaptive boosting by 14 trials
boost14 <- C5.0(ecomTrain3, as.factor(ecomTrain$onTime), trials = 14, rules = TRUE)
#summary(boost14)
boost14$boostResults

##    Trial Size Errors Percent         Data
## 1      1   16   2128    30.4 Training Set
## 2      2    8   2614    37.3 Training Set
## 3      3    6   2247    32.1 Training Set
## 4      4   13   2638    37.7 Training Set
## 5      5    7   2502    35.7 Training Set
## 6      6    9   2530    36.1 Training Set
## 7      7    3   2675    38.2 Training Set
## 8      8    6   2715    38.8 Training Set
## 9      9    2   2509    35.8 Training Set
## 10    10    2   2800    40.0 Training Set
## 11    11    4   2503    35.8 Training Set
## 12    12    6   2219    31.7 Training Set
## 13    13   10   2674    38.2 Training Set
## 14    14   16   2171    31.0 Training Set

# Boosted Algorithm Accuracy: 70%
predboost14 <- predict(boost14, ecomTest3)
CrossTable(ecomTest$onTime, predboost14, prop.chisq = FALSE, prop.c = FALSE, 
           prop.r = FALSE, dnn = c('Actual', 'Predicted'))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  3999 
## 
##  
##              | Predicted 
##       Actual |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            0 |      1173 |       462 |      1635 | 
##              |     0.293 |     0.116 |           | 
## -------------|-----------|-----------|-----------|
##            1 |       837 |      1527 |      2364 | 
##              |     0.209 |     0.382 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |      2010 |      1989 |      3999 | 
## -------------|-----------|-----------|-----------|
## 
##

# Result: 67.5% Accuracy

The rule set did not perform significantly better at 67.5% accuracy.

Results

Of the three predictive models we created - KNN, Decision Tree, and Rule Set, the decision tree model performed the best in testing at 68.7% accuracy. From this model and the analysis we ran, we can see that discount, weight, and warehouse most signficantly impact whether a package arrives on time. We were surprised to see that mode of shipment had little effect on the arrival of the package as we expected more variation between the different modes.

In more detail, a package is likely not to arrive on time if it there was less than a 10% discount offered on the package, the weight was less than 2000 or greater than 4000, and the package came from warehouse F.

# warehouse
tab <- data.frame(table(warehouse = ecom$Warehouse_block, onTime = as.logical(ecom$Reached.on.Time_Y.N)))
tab %>% ggplot() + geom_bar(aes(x = warehouse, y = Freq, fill = onTime), stat = "identity", position = "dodge") + ggtitle("Distribution of Packages per Warehouse") + xlab("Warehouses") + ylab("Total Packages") + labs(fill= "Package Arrived on Time")

# discount v weight
ecom %>% ggplot(aes(x = Discount_offered, y = Weight_in_gms)) + geom_point(aes(color = as.logical(Reached.on.Time_Y.N))) + ggtitle("Discount Vs Weight") + xlab("Discount Offered") + ylab("Weight (grams)") + labs(color = "Package Arrived on Time")

The implications of research suggest that warehouse F is overloaded with packages compared to other warehouses but does not necessarily have a higher rate of packages not arriving on time. Moreover, a little to no discount on an item most likely suggests that there is not much of the item in supply compared to an item with a high discount offered. It is possibly that a lower discount signifies less of the product available, leading to the package not arriving on time. Additionally, packages that are extremely light could have been misplaced while packages that are extremely heavy could have had difficulties in transportation which lead to the package not arriving on time.

Overall, it is the interaction of discount, weight, and warehouse among many variables that account for the arrival of the package. With the predictive models created, the company can identify packages that are at risk and take more care in seeing that it arrives on time.

Reflection

Through this project, we were able to see the strengths and weaknesses of each algorithm. With cluster analysis and principal component analysis, we were limited to numerical variables but able to understand more about the patterns in the data. With the KNN and decision tree model, more data is needed to improve the accuracy of the classification algorithm and our highest performance was 68.7% in accuracy which is slightly better than random guessing of 50%. The decision tree model performed slightly better than the KNN model and this could be due to the inclusion of categorical variables in analysis.

We were surprised to see that techniques such as normalization of the data in the KNN and AdaBoost for the decision tree did not lead to a major improvement in algorithm accuracy. This could be due to human error or that our dataset was too small.

For future study, we would like to gather more data to train our model and investigate confounding variables such as weather or location of customer. It would also be interesting to see time-series data of the packages as online ordering tends to increase around the holiday season which could impact arrival time.

In this project, we learned the applications of Big Data in machine learning and the importance of trying different algorithms to find the highest accuracy in predictive modeling.