Project 1: Wrangling, Exploration, Visualization

Introduction

library(ggplot2)
library(dplyr)
library(stringr)
library(gt)
library(tidyr)
library(coronavirus)
# My data
data("covid19_vaccine")
data("coronavirus")

These datasets are in the coronavirus R-package and I wanted to learn more about the current statistics of COVID-19 given the current pandemic that is affecting everyone. Having to start my college experience in a pandemic was not ideal but I want to explore the COVID-19 data and see if better times are to come. The first dataset includes the the amount of vaccinated people per country and other demographics. The second dataset includes the amount and type of COVID-19 cases per country.

Tidying: Reshaping

If your datasets are tidy already, demonstrate that you can reshape data with pivot wider/longer here (e.g., untidy and then retidy). Alternatively, it may be easier to wait until the wrangling section so you can reshape your summary statistics. Note here if you are going to do this.

The datasets are already tidy. I will be using the pivot functions to manipulate the summary statistics data tables. I used pivot wider to arrange the months as columns and pivot longer to arrange the statistics as rows. This way, I can read the change in data from left-to-right as time progresses.

Joining/Merging

vax <- covid19_vaccine %>% mutate(country = country_region) %>% 
    subset(select = -country_region) %>% select(date, people_fully_vaccinated, 
    country, population)
covid <- coronavirus %>% select(date, country, lat, long, type, 
    cases)
data <- inner_join(covid, vax, by = "country")
# 2.1
nrow(vax)

## [1] 89822

nrow(covid)

## [1] 518682

# 2.2
vax %>% select(country) %>% group_by(country) %>% summarize(n())

## # A tibble: 164 x 2
##    country             `n()`
##    <chr>               <int>
##  1 Afghanistan           216
##  2 Albania               275
##  3 Algeria               178
##  4 Andorra               255
##  5 Angola                210
##  6 Antigua and Barbuda   222
##  7 Argentina            3288
##  8 Armenia               314
##  9 Australia            1006
## 10 Austria               290
## # … with 154 more rows

covid %>% select(country) %>% group_by(country) %>% summarize(n())

## # A tibble: 195 x 2
##    country             `n()`
##    <chr>               <int>
##  1 Afghanistan          1893
##  2 Albania              1893
##  3 Algeria              1893
##  4 Andorra              1893
##  5 Angola               1893
##  6 Antigua and Barbuda  1893
##  7 Argentina            1893
##  8 Armenia              1893
##  9 Australia           15144
## 10 Austria              1893
## # … with 185 more rows

# 2.3
nrow(vax) - sum(vax$country %in% covid$country)

## [1] 304

nrow(covid) - sum(covid$country %in% vax$country)

## [1] 60576

setdiff(vax$country, covid$country)

## [1] "World"

# 2.4
data %>% select(country) %>% group_by(country) %>% summarize(n())

## # A tibble: 163 x 2
##    country                `n()`
##    <chr>                  <int>
##  1 Afghanistan           408888
##  2 Albania               520575
##  3 Algeria               336954
##  4 Andorra               482715
##  5 Angola                397530
##  6 Antigua and Barbuda   420246
##  7 Argentina            6224184
##  8 Armenia               594402
##  9 Australia           15234864
## 10 Austria               548970
## # … with 153 more rows

89822 + 518682

## [1] 608504

Both data sets use the date variable as their ID variable, but I did not join them by date. For the vaccine data set, it is the date of the vaccine record for the country while for the cases data set, it is the amount of cases for that day. I wanted to analyze the countries from both data sets, so I joined the data sets in that way. I also selected the variables that I wanted to analyze or use in visualization.

The vaccine data set has 89822 observations and 164 countries while the cases data set has 518682 observations and 195 countries. There are 304 records from countries in the vaccine data that are not in the cases data; additionally, there are 60576 records from countries in the cases data that are not in the vaccine data. The country not in common was ‘World’ which is probably a summary statistic from the data set. The 608504 records dropped after joining were from countries included in one data set but not the other and it leads to the potential problem of not being able to evaluate COVID-19 from a full global scale. After joining, there are 163 countries with 320700702 observations that I will be analyzing and each country has dated records of vaccines and COVID-19 cases from January 2020 to October 2021. The size of the joined data set is considerably larger than the original data sets, so I will be narrowing my focus as I wrangle and visualize the data.

Wrangling

# Analyzing US Cases

# function to turn numeric month into word for more visually
# appealing tables
fixMonth <- function(x) {
    x %>% mutate(Month = str_replace_all(Month, "01", "January")) %>% 
        mutate(Month = str_replace_all(Month, "02", "February")) %>% 
        mutate(Month = str_replace_all(Month, "03", "March")) %>% 
        mutate(Month = str_replace_all(Month, "04", "April")) %>% 
        mutate(Month = str_replace_all(Month, "05", "May")) %>% 
        mutate(Month = str_replace_all(Month, "06", "June")) %>% 
        mutate(Month = str_replace_all(Month, "07", "July")) %>% 
        mutate(Month = str_replace_all(Month, "08", "August")) %>% 
        mutate(Month = str_replace_all(Month, "09", "September")) %>% 
        mutate(Month = str_replace_all(Month, "10", "October")) %>% 
        mutate(Month = str_replace_all(Month, "11", "November")) %>% 
        mutate(Month = str_replace_all(Month, "12", "December"))
}

# function to pivot table for summary statistics
sumPiv <- function(x) {
    x %>% pivot_longer(cols = c(Mean, SD, Variance, Days, Minimum, 
        Maximum), names_to = "Statistics") %>% pivot_wider(names_from = Month, 
        values_from = value)
}

table2 <- data %>% filter(country == "US", type == "death") %>% 
    distinct(date.x, .keep_all = TRUE) %>% select(date.x, cases) %>% 
    filter(str_detect(date.x, "^2020(.+)")) %>% separate(col = "date.x", 
    sep = "-", into = c("Year", "Month", "Day")) %>% subset(select = -Year) %>% 
    fixMonth()

table2 %>% pivot_wider(names_from = Month, values_from = cases) %>% 
    arrange(Day) %>% gt() %>% fmt_number(columns = 2:13, sep_mark = ",", 
    decimals = 0) %>% tab_header(title = "2020 US Covid Deaths")

2020 US Covid Deaths

Day	January	February	March	April	May	June	July	August	September	October	November	December
01	NA	0	0	1,274	1,911	768	700	1,087	1,027	869	507	2,545
02	NA	0	5	1,515	1,735	981	710	438	1,073	854	572	2,818
03	NA	0	1	1,407	1,199	1,007	670	583	1,030	683	1,578	2,937
04	NA	0	4	1,609	1,377	998	315	1,284	957	374	1,107	2,697
05	NA	0	1	1,632	2,207	889	326	1,440	757	483	1,165	2,364
06	NA	0	2	1,737	2,321	642	377	1,237	469	685	1,225	1,363
07	NA	0	3	2,570	1,919	460	1,159	1,243	314	917	1,090	1,600
08	NA	0	4	2,154	1,759	508	822	1,070	457	991	574	2,621
09	NA	0	1	2,215	1,469	908	1,019	521	1,159	969	775	3,181
10	NA	0	6	2,188	991	877	814	634	922	659	1,416	2,994
11	NA	0	5	2,138	1,037	826	729	1,024	1,154	471	1,435	3,407
12	NA	0	10	1,892	1,597	827	472	1,504	710	399	1,209	2,478
13	NA	0	8	1,982	1,723	739	447	1,064	428	800	1,200	1,637
14	NA	0	7	2,453	1,775	344	915	1,337	434	987	1,347	1,650
15	NA	0	12	2,591	1,665	398	981	1,000	1,199	837	780	3,086
16	NA	0	27	2,198	1,201	810	952	615	957	944	835	3,678
17	NA	0	37	2,113	829	739	924	493	857	766	1,711	3,502
18	NA	0	60	1,940	1,228	691	857	1,227	889	491	1,915	2,966
19	NA	0	72	1,989	1,452	629	485	1,335	730	483	2,054	2,679
20	NA	0	106	2,244	1,499	556	562	1,105	267	925	1,960	1,732
21	NA	0	103	2,459	1,187	320	1,085	1,083	424	1,123	1,623	1,900
22	0	0	128	2,473	1,225	406	1,223	909	1,028	891	1,038	3,399
23	0	0	189	2,435	1,080	748	1,068	546	1,062	963	1,101	3,423
24	0	0	241	2,178	640	737	1,104	476	898	954	2,131	2,896
25	0	0	336	1,724	598	538	932	1,223	941	461	2,260	1,458
26	0	0	416	1,413	678	628	531	1,163	769	533	1,387	1,891
27	0	0	519	1,495	1,453	507	1,127	1,113	315	972	1,554	1,437
28	0	0	673	2,228	1,093	316	1,331	975	356	1,033	1,366	2,002
29	0	1	592	2,398	1,126	386	1,408	916	863	1,004	1,042	3,619
30	0	NA	711	2,181	954	555	1,236	441	954	1,057	1,344	3,792
31	0	NA	1,082	NA	618	NA	1,237	542	NA	925	NA	3,428

table2 %>% group_by(Month) %>% summarize(Mean = mean(cases), 
    SD = sd(cases), Variance = var(cases), Days = n(), Minimum = min(cases), 
    Maximum = max(cases)) %>% sumPiv() %>% subset(select = c(Statistics, 
    January, February, March, April, May, June, July, August, 
    September, October, November, December)) %>% gt() %>% fmt_number(columns = 2:13, 
    sep_mark = ",", decimals = 2) %>% tab_header(title = "2020 US Covid Deaths Monthly Statistics")

2020 US Covid Deaths Monthly Statistics

Statistics	January	February	March	April	May	June	July	August	September	October	November	December
Mean	0.00	0.03	172.94	2,027.50	1,340.19	657.93	855.42	955.74	780.00	790.42	1,310.03	2,618.71
SD	0.00	0.19	273.94	378.47	453.49	210.33	310.42	329.72	290.24	224.16	462.40	761.54
Variance	0.00	0.03	75,041.60	143,238.33	205,657.09	44,237.93	96,358.58	108,717.20	84,237.52	50,248.58	213,818.31	579,948.61
Days	10.00	29.00	31.00	30.00	31.00	30.00	31.00	31.00	30.00	31.00	30.00	31.00
Minimum	0.00	0.00	0.00	1,274.00	598.00	316.00	315.00	438.00	267.00	374.00	507.00	1,363.00
Maximum	0.00	1.00	1,082.00	2,591.00	2,321.00	1,007.00	1,408.00	1,504.00	1,199.00	1,123.00	2,260.00	3,792.00

# Analyzing Vaccines
vaxData <- data %>% filter(country == "US") %>% distinct(date.y, 
    .keep_all = TRUE) %>% select(date.y, people_fully_vaccinated, 
    population) %>% filter(str_detect(date.y, "^2021(.+)")) %>% 
    separate(col = "date.y", sep = "-", into = c("Year", "Month", 
        "Day")) %>% subset(select = -Year) %>% fixMonth() %>% 
    group_by(Month) %>% mutate(VaxPercentage = people_fully_vaccinated/population)

tableV <- vaxData %>% subset(select = -c(people_fully_vaccinated, 
    population)) %>% pivot_wider(names_from = Month, values_from = VaxPercentage)

tableV %>% gt() %>% fmt_number(columns = 2:11, decimals = 4) %>% 
    tab_header(title = "2021 US Covid-19 Percentage of Population Vaccinated")

2021 US Covid-19 Percentage of Population Vaccinated

Day	January	February	March	April	May	June	July	August	September	October
01	0.0000	0.0180	0.0773	0.1702	0.3139	0.4124	0.4731	0.5001	0.5299	0.5611
02	0.0000	0.0184	0.0794	0.1760	0.3180	0.4133	0.4743	0.5006	0.5311	0.5620
03	0.0000	0.0195	0.0818	0.1817	0.3203	0.4147	0.4765	0.5011	0.5328	0.5630
04	0.0000	0.0210	0.0844	0.1864	0.3222	0.4172	0.4775	0.5018	0.5341	0.5639
05	0.0000	0.0228	0.0871	0.1894	0.3258	0.4192	0.4775	0.5027	0.5341	0.5647
06	0.0000	0.0252	0.0904	0.1913	0.3306	0.4218	0.4785	0.5036	0.5341	0.5657
07	0.0000	0.0278	0.0931	0.1955	0.3365	0.4242	0.4793	0.5045	0.5362	0.5664
08	0.0000	0.0289	0.0956	0.2009	0.3418	0.4263	0.4804	0.5053	0.5376	0.5673
09	0.0000	0.0299	0.0974	0.2070	0.3468	0.4279	0.4815	0.5058	0.5385	0.5682
10	0.0000	0.0318	0.0999	0.2146	0.3507	0.4297	0.4825	0.5065	0.5400	0.5682
11	0.0000	0.0340	0.1028	0.2205	0.3538	0.4313	0.4834	0.5072	0.5413	0.5682
12	0.0000	0.0367	0.1062	0.2248	0.3571	0.4344	0.4841	0.5080	0.5424	0.5698
13	0.0000	0.0397	0.1121	0.2286	0.3612	0.4368	0.4846	0.5090	0.5433	0.5704
14	0.0041	0.0427	0.1137	0.2327	0.3650	0.4399	0.4860	0.5102	0.5442	NA
15	0.0049	0.0427	0.1164	0.2383	0.3696	0.4424	0.4869	0.5110	0.5454	NA
16	0.0049	0.0456	0.1185	0.2447	0.3742	0.4445	0.4877	0.5120	0.5466	NA
17	0.0049	0.0470	0.1214	0.2503	0.3758	0.4485	0.4887	0.5126	0.5481	NA
18	0.0049	0.0491	0.1244	0.2558	0.3777	0.4506	0.4894	0.5135	0.5495	NA
19	0.0061	0.0517	0.1273	0.2591	0.3808	0.4526	0.4901	0.5148	0.5505	NA
20	0.0066	0.0543	0.1306	0.2617	0.3843	0.4543	0.4906	0.5160	0.5516	NA
21	0.0073	0.0573	0.1340	0.2659	0.3878	0.4554	0.4914	0.5172	0.5524	NA
22	0.0084	0.0590	0.1363	0.2709	0.3916	0.4566	0.4922	0.5185	0.5536	NA
23	0.0092	0.0603	0.1382	0.2767	0.3946	0.4577	0.4930	0.5193	0.5542	NA
24	0.0098	0.0625	0.1407	0.2825	0.3964	0.4591	0.4939	0.5201	0.5553	NA
25	0.0102	0.0654	0.1439	0.2877	0.3979	0.4602	0.4948	0.5214	0.5565	NA
26	0.0106	0.0686	0.1478	0.2910	0.4002	0.4619	0.4953	0.5226	0.5575	NA
27	0.0115	0.0719	0.1522	0.2936	0.4030	0.4645	0.4957	0.5240	0.5581	NA
28	0.0129	0.0752	0.1566	0.2976	0.4053	0.4667	0.4965	0.5254	0.5623	NA
29	0.0145	NA	0.1597	0.3025	0.4080	0.4680	0.4974	0.5267	0.5595	NA
30	0.0160	NA	0.1622	0.3078	0.4080	0.4701	0.4983	0.5276	0.5603	NA
31	0.0172	NA	0.1657	NA	0.4100	NA	0.4991	0.5285	NA	NA

vaxData %>% group_by(Month) %>% summarize(Mean = mean(VaxPercentage), 
    SD = sd(VaxPercentage), Variance = var(VaxPercentage), Days = n(), 
    Minimum = min(VaxPercentage), Maximum = max(VaxPercentage)) %>% 
    sumPiv() %>% subset(select = c(Statistics, January, February, 
    March, April, May, June, July, August, September, October)) %>% 
    gt() %>% fmt_number(columns = 2:11, sep_mark = ",", decimals = 2) %>% 
    tab_header(title = "2021 Percent of US Population Vaccinated Statistics")

2021 Percent of US Population Vaccinated Statistics

Statistics	January	February	March	April	May	June	July	August	September	October
Mean	0.01	0.04	0.12	0.24	0.37	0.44	0.49	0.51	0.55	0.57
SD	0.01	0.02	0.03	0.04	0.03	0.02	0.01	0.01	0.01	0.00
Variance	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
Days	31.00	28.00	31.00	30.00	31.00	30.00	31.00	31.00	30.00	13.00
Minimum	0.00	0.02	0.08	0.17	0.31	0.41	0.47	0.50	0.53	0.56
Maximum	0.02	0.08	0.17	0.31	0.41	0.47	0.50	0.53	0.56	0.57

# Table of Top 10 Countries with COVID-19 Cases
data %>% filter(type == "confirmed") %>% group_by(country) %>% 
    summarize(total = sum(cases)) %>% arrange(-total) %>% slice(1:10) %>% 
    gt() %>% cols_label(total = "Total Cases", country = "Country") %>% 
    fmt_number(columns = 2, sep_mark = ",", decimals = 0) %>% 
    tab_header(title = "Top 10 Countries with COVID-19 Cases")

Top 10 Countries with COVID-19 Cases

Country	Total Cases
India	203,682,110,510
Brazil	96,715,615,622
Colombia	27,042,690,360
United Kingdom	17,912,038,905
Argentina	17,323,331,064
Italy	16,790,179,329
Indonesia	13,458,957,326
US	13,315,876,700
Ukraine	8,722,667,184
Poland	7,788,652,900

# Categorical variable with type of case
data %>% filter(country == c("US", "India", "Brazil")) %>% select(type, 
    cases, country) %>% group_by(country, type) %>% summarize(total = sum(cases))

## # A tibble: 9 x 3
## # Groups:   country [3]
##   country type            total
##   <chr>   <chr>           <dbl>
## 1 Brazil  confirmed 32238678261
## 2 Brazil  death       897947197
## 3 Brazil  recovery     11635260
## 4 India   confirmed 67894020791
## 5 India   death       900912364
## 6 India   recovery     20589877
## 7 US      confirmed  4438293986
## 8 US      death        71477928
## 9 US      recovery      2145291

data %>% select(type, cases) %>% mutate(cases = abs(cases)) %>% 
    group_by(type) %>% summarize(total = sum(cases)) %>% gt() %>% 
    fmt_number(columns = 2, sep_mark = ",", decimals = 0) %>% 
    tab_header(title = "Global Total of Cases")

Global Total of Cases

type	total
confirmed	473,816,203,792
death	9,287,821,994
recovery	754,925,335,092

My data has 9 variables: country, date recorded for vaccine, cumulative sum of vaccinated people, date recorded for case number, number of COVID-19 cases on that date, type of case, latitude, longitude, and population. Because it did not make sense to summarize all variables(such as dates) and because my data set is so large, I decided to narrow my data set to focus on the US for the numeric variables and summarized death cases for 2020 and percentage of population vaccinated in 2021. Additionally, for the categorical variables, I put the counts for the top ten countries with COVID-19 cases and the counts for type of cases- looking at the US and the top two countries on the previous table.

I thought that the top ten countries for COVID-19 cases was interesting. I expected more European countries since they are densely populated. I was also shocked to see India and Brazil at the top by a major gap. Another piece of data I found was that only 57% of the US population is vaccinated since October which is disappointing since the vaccine is more accessible in the US compared to other countries.

Visualizing

deathsD <- data %>% select(country, type, cases, date.x) %>% 
    filter(type == "death") %>% group_by(country) %>% summarize(Total = sum(cases))
casesD <- data %>% select(country, cases) %>% group_by(country) %>% 
    summarize(Total = sum(cases))
vaccinesD <- data %>% select(country, people_fully_vaccinated, 
    population) %>% group_by(country) %>% summarize(VaxPercent = max(people_fully_vaccinated/population), 
    Population = max(population)) %>% na.omit()
data2 <- inner_join(casesD, vaccinesD, by = "country") %>% na.omit()
data20 <- data2 %>% mutate(CasesPercent = (Total/Population)) %>% 
    subset(select = -c(Total, Population))


ggplot(data20, aes(x = VaxPercent, y = CasesPercent)) + geom_point(aes(color = country)) + 
    geom_smooth(method = "lm") + labs(title = "Percent of Population Vaccinated Vs Covid-19 Cases per Population", 
    x = "Percentage of Population Vaccinated ", y = "COVID-19 Cases per Population") + 
    scale_y_continuous(breaks = seq(0, 100, 10)) + theme_light() + 
    theme(legend.position = "none")

As COVID-19 cases per population increases, the country has a higher percentage of people vaccinated. This means a country is more likely to get vaccinated if they have more COVID-19 cases. There are also some countries along the x-axis meaning that they have a low amount of cases per population but a high vaccination rate; fortunately, there are none vice-versa. I was curious to see where the US lies which leads to me to plot 2. There is also a cluster of data around the origin which have a low amount of vaccinated people and COVID-19 cases which makes me investigate in plot 3.

dataP <- data %>% filter(country == "US") %>% select(date.x, 
    cases, date.y, people_fully_vaccinated)
dataP2 <- dataP %>% filter(cases > 0)

ggplot(dataP, aes(x = date.y, y = people_fully_vaccinated)) + 
    geom_point() + geom_line(color = "red") + scale_y_continuous(labels = scales::number) + 
    labs(title = "2021 US Population Vaccinated", x = "Month", 
        y = "Total Number of People Vaccinated") + theme_gray()

ggplot(dataP2, aes(x = date.x)) + geom_bar(aes(y = cases), stat = "summary", 
    fun = mean) + geom_line(aes(y = cases), stat = "summary", 
    fun = mean) + scale_y_continuous(breaks = seq(0, 150000, 
    25000)) + theme_bw() + labs(title = "US COVID-19 Cases", 
    x = "Date", y = "Number of Cases")

After seeing the global relationship of vaccination rate and COVID-19 cases, I wanted to specifically examine the United States. The graph of US COVID-19 cases display that there were major spikes around the inital March 2020 outbreak, beginning of school in August 2020 and 2021, holiday season around January 2021, and summer break of 2021. It is interesting to see that even after vaccines were released, there was the lowest amount of cases followed by a major spike in cases in fall 2021. It looks like we are currently in a downward trend, but I would not be surprised to see if we go up again based on past data for holiday season.

Additionally, I made another plot to depict the total number of people vaccinated over the course of 2021. I was interested to see the S-curve and how the country is slowing in vaccinating people. The age limit on COVID-19 vaccines is lowering, so I hope people continue getting vaccinated and achieve herd immunity.

data3 <- data %>% filter(type == "confirmed") %>% group_by(country) %>% 
    summarize(Longitude = max(long), Latitude = max(lat), Population = max(population), 
        Vax = max(people_fully_vaccinated)) %>% na.omit()

world <- map_data("world")

ggplot(data3) + geom_polygon(aes(long, lat, group = group), data = world, 
    fill = "black") + geom_point(aes(Longitude, Latitude, color = Vax/Population)) + 
    coord_map() + scale_color_gradient(low = "yellow", high = "green", 
    name = "Percentage of Population \n Vaccinated") + theme_dark() + 
    labs(title = "Vaccination Percentages of Countries Around the World", 
        x = "Longitude", y = "Latitude")

After the first plot, I wanted to see if location played a role in vaccination status. This plot depicts that European countries are high in percentage of population vaccinated while African countries are low in contrast. The US is also in the middle for percentage compared to other countries which is surprising since we were one of the first who developed a vaccine but not surprising considering the current vaccine-hesitancy. There are also islands that have high vaccinated percentages which is probably due to their low population.

Concluding Remarks

In conclusion, I still believe the United States has a long way to go in order to recover from the pandemic. We are in top 10 for COVID-19 cases yet we are lagging behind in percentage of vaccinated population. In comparison with other first-world countries, we are definitely behind and more needs to be done to encourage vaccination. Based on the cases data for the United States, I predict another wave of cases as we hit holiday season. In future studies, I believe more research can be done on how to flatten the bell curve for cases and this data could be used to identify countries that have more success as well as predict human behavior in spreading disease.