DataHack- Upworthy
Research Question
- Did Upworthy really change their clickbait publishing practices?
- What was the impact of that change?
- Do you recommend an investment in Upworthy?
Read CSV files
= read.csv("country-data.csv") # aggregates user analytics by country
country_df = read.csv("daily-user-info.csv") # user analytics from 2013 to 2015
daily_user_df = read.csv("upworthy-archive.csv") # A/B testing results archive_df
Import packages
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(stringr)
Initial Analysis with Generalized Linear Models
We first defined clickbait as something that is attention-grabbing, emotionally arousing, and misleading. Since, there are three different datasets with a total of 31 variables, we were able to consider different metrics for clickbait such as the clicks per impressions, the language of the headline, and the bounce rate of the article.
# Data Wrangling
# create numeric variable of winner for logistic regression
$bWinner <- as.numeric(as.logical(archive_df$winner))
archive_df# create clicks per impression variable
$click_imp <- archive_df$clicks / archive_df$impressions archive_df
Data Analysis on Clicks per Impression
= glm(bWinner ~ click_imp, data = archive_df)
click_glm summary(click_glm)
##
## Call:
## glm(formula = bWinner ~ click_imp, data = archive_df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.45666 -0.05921 -0.04063 -0.02840 0.98741
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0125876 0.0009084 13.86 <2e-16 ***
## click_imp 2.4213110 0.0452874 53.47 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.04733763)
##
## Null deviance: 7274.5 on 150816 degrees of freedom
## Residual deviance: 7139.2 on 150815 degrees of freedom
## AIC: -32056
##
## Number of Fisher Scoring iterations: 2
= glm(bWinner ~ score, data = archive_df)
score_glm summary(score_glm)
##
## Call:
## glm(formula = bWinner ~ score, data = archive_df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.14370 -0.09941 -0.01599 0.01108 1.01280
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.280e-02 7.752e-04 -16.52 <2e-16 ***
## score 1.565e-03 1.363e-05 114.84 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.04435583)
##
## Null deviance: 7274.5 on 150816 degrees of freedom
## Residual deviance: 6689.5 on 150815 degrees of freedom
## AIC: -41868
##
## Number of Fisher Scoring iterations: 2
cor.test(archive_df$bWinner, archive_df$click_imp)
##
## Pearson's product-moment correlation
##
## data: archive_df$bWinner and archive_df$click_imp
## t = 53.465, df = 150815, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1314308 0.1413368
## sample estimates:
## cor
## 0.1363872
cor.test(archive_df$bWinner, archive_df$score)
##
## Pearson's product-moment correlation
##
## data: archive_df$bWinner and archive_df$score
## t = 114.84, df = 150815, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2789364 0.2882184
## sample estimates:
## cor
## 0.2835841
ggplot(archive_df, aes(x = created_at, y = click_imp)) + geom_point(aes(color = winner)) + geom_smooth() + labs(title = "Clicks per Impressions over Time", x= "Date of A/B Test", y = "Clicks per Impressions")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
After running a logistic regression model of winner and clicks per impression as well as score of the A/B test, the results show that there is a significant relationship between the two variables. However, the correlation is rather low and brings up the question why Upworthy did not select the article with the highest A/B testing score or clicks per impression.
Text Mining Headlines
# using punctuation to evaluate clickbait
<- c(question = "?",
punctuations.dict quote = "'",
dollar = "$",
exclaim = "!")
$created_at <- as.Date(archive_df$created_at)
archive_df$headline_nchar <- archive_df$package_headline %>% nchar()
archive_df$headline_nwords <- archive_df$package_headline %>% str_count(" ") + 1
archive_df
for(i in punctuations.dict %>% names()){archive_df[[i]] <- ifelse(grepl(paste0("[",punctuations.dict[i],"]"), archive_df$package_headline), T, F) }
Data Visualization on Punctuation
%>% ggplot(aes(created_at, clicks/impressions, color = dollar)) + geom_smooth() + labs(title = "Clicks with Dollar Marks in Headline", x= "Date", y= "Total Clicks Per Impressions") archive_df
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
%>% ggplot(aes(created_at, clicks/impressions, color = question)) + geom_smooth()+ labs(title = "Clicks with Question Marks in Headline", x= "Date", y= "Total Clicks Per Impressions") archive_df
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
%>% ggplot(aes(created_at, clicks/impressions, color = quote)) + geom_smooth() + labs(title = "Clicks with Quotes in Headline", x= "Date", y= "Total Clicks Per Impressions") archive_df
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
%>% ggplot(aes(created_at, clicks/impressions, color = exclaim)) + geom_smooth() + labs(title = "Clicks with Exclaimation Marks in Headline", x= "Date", y= "Total Clicks Per Impressions") archive_df
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
We wanted to analyze the headlines of each article in the A/B testing packages and used punctuation as a metric for clickbait. Contrary to our original belief, the punctuation marks actually garnered less clicks during A/B testing.
.1 <- lm(click_imp ~ question+exclaim+dollar+quote+headline_nchar+headline_nwords, data = archive_df)
modelsummary(model.1)
##
## Call:
## lm(formula = click_imp ~ question + exclaim + dollar + quote +
## headline_nchar + headline_nwords, data = archive_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.017568 -0.008247 -0.003231 0.004563 0.168565
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.236e-02 1.827e-04 67.614 < 2e-16 ***
## questionTRUE -1.842e-03 8.843e-05 -20.827 < 2e-16 ***
## exclaimTRUE -3.056e-03 2.342e-04 -13.047 < 2e-16 ***
## dollarTRUE -3.694e-03 3.939e-04 -9.380 < 2e-16 ***
## quoteTRUE -5.645e-04 6.508e-05 -8.674 < 2e-16 ***
## headline_nchar 7.186e-05 4.141e-06 17.353 < 2e-16 ***
## headline_nwords -1.237e-04 1.980e-05 -6.250 4.12e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01232 on 150810 degrees of freedom
## Multiple R-squared: 0.008937, Adjusted R-squared: 0.008897
## F-statistic: 226.7 on 6 and 150810 DF, p-value: < 2.2e-16
After running a linear model, we confirmed that the punctuation marks led to significantly less clicks per impression with the exception of number of characters in the headline.
Natural Language Processing Sentiment Analysis
To view the NLP sentiment analysis model, view the Colab Notebook here: https://colab.research.google.com/drive/11nhDAOWKevdZQ1J4-eo2g0YU_heuWFjF?usp=sharing
In summary, using pre-trained neural networks of HuggingFace, we identified clickbait using sentiment analysis models to see if a collection of text is positive or negative.
We evaluated that the overall strength of the sentiment does not decrease, but the headlines may be getting more positive near the date of the clickbait pledge.
In a breakdown of each emotion, we found that most articles either conveyed joy or anger. Additionally, there was an overall increase in sentiments for winner articles overtime.
Time Series Analysis
Loess Regression on Users
# Wrangle date variable and prepare loess
$dayX <- as.Date(daily_user_df$day,format="%m/%d/%y")
daily_user_df$index <- 1:nrow(daily_user_df)
daily_user_df
<- loess(users ~ index, data = daily_user_df, span=0.25)
loess_users <- predict(loess_users)
smoothed_users
#create Loess regression
plot(y = daily_user_df$users, x = daily_user_df$dayX, type="l", main="Loess Smoothing and Prediction on Users", xlab="Date", ylab="Users")
lines(smoothed_users, x=daily_user_df$dayX, col="blue")
Loess Regression on Bounce Rate
<- loess(bounce_rate ~ index, data = daily_user_df, span=0.25)
loess_bounce <- predict(loess_bounce)
smoothed_bounce
#create Loess regression
plot(y = daily_user_df$bounce_rate, x = daily_user_df$dayX, type="l", main="Loess Smoothing and Prediction on Bounce Rate", xlab="Date", ylab="Bounce Rate")
lines(smoothed_bounce, x=daily_user_df$dayX, col="blue")
After evaluating the clickbait in articles, we then investigated its impact on the company. Using a Loess Regression model, it is predicted that the bounce rate of the articles will stay consistent while the amount of users for the website will decrease.
Conclusion
After analyzing different clickbait metrics such as clicks per impression, sentiment analysis, and bounce rate, we determined that Upworthy has not changed their clickbait publishing practices. However, the impact of the clickbait free pledge on the company has led to declining user engagement, retention, and new users. Due to the negative forecast on the website’s user analytics, we do not recomment an investment in Upworthy.