Data Wrangling, Exploration, Visualization

Jas Dinh jmd6289

Introduction

library(ggplot2)
library(dplyr)
library(stringr)
library(gt)
library(tidyr)
library(coronavirus)
# My data
data("covid19_vaccine")
data("coronavirus")

These datasets are in the coronavirus R-package and I wanted to learn more about the current statistics of COVID-19 given the current pandemic that is affecting everyone. Having to start my college experience in a pandemic was not ideal but I want to explore the COVID-19 data and see if better times are to come. The first dataset includes the the amount of vaccinated people per country and other demographics. The second dataset includes the amount and type of COVID-19 cases per country.

Tidying: Reshaping

If your datasets are tidy already, demonstrate that you can reshape data with pivot wider/longer here (e.g., untidy and then retidy). Alternatively, it may be easier to wait until the wrangling section so you can reshape your summary statistics. Note here if you are going to do this.

The datasets are already tidy. I will be using the pivot functions to manipulate the summary statistics data tables. I used pivot wider to arrange the months as columns and pivot longer to arrange the statistics as rows. This way, I can read the change in data from left-to-right as time progresses.

Joining/Merging

vax <- covid19_vaccine %>% mutate(country = country_region) %>% 
    subset(select = -country_region) %>% select(date, people_fully_vaccinated, 
    country, population)
covid <- coronavirus %>% select(date, country, lat, long, type, 
    cases)
data <- inner_join(covid, vax, by = "country")
# 2.1
nrow(vax)
## [1] 89822
nrow(covid)
## [1] 518682
# 2.2
vax %>% select(country) %>% group_by(country) %>% summarize(n())
## # A tibble: 164 x 2
##    country             `n()`
##    <chr>               <int>
##  1 Afghanistan           216
##  2 Albania               275
##  3 Algeria               178
##  4 Andorra               255
##  5 Angola                210
##  6 Antigua and Barbuda   222
##  7 Argentina            3288
##  8 Armenia               314
##  9 Australia            1006
## 10 Austria               290
## # … with 154 more rows
covid %>% select(country) %>% group_by(country) %>% summarize(n())
## # A tibble: 195 x 2
##    country             `n()`
##    <chr>               <int>
##  1 Afghanistan          1893
##  2 Albania              1893
##  3 Algeria              1893
##  4 Andorra              1893
##  5 Angola               1893
##  6 Antigua and Barbuda  1893
##  7 Argentina            1893
##  8 Armenia              1893
##  9 Australia           15144
## 10 Austria              1893
## # … with 185 more rows
# 2.3
nrow(vax) - sum(vax$country %in% covid$country)
## [1] 304
nrow(covid) - sum(covid$country %in% vax$country)
## [1] 60576
setdiff(vax$country, covid$country)
## [1] "World"
# 2.4
data %>% select(country) %>% group_by(country) %>% summarize(n())
## # A tibble: 163 x 2
##    country                `n()`
##    <chr>                  <int>
##  1 Afghanistan           408888
##  2 Albania               520575
##  3 Algeria               336954
##  4 Andorra               482715
##  5 Angola                397530
##  6 Antigua and Barbuda   420246
##  7 Argentina            6224184
##  8 Armenia               594402
##  9 Australia           15234864
## 10 Austria               548970
## # … with 153 more rows
89822 + 518682
## [1] 608504

Both data sets use the date variable as their ID variable, but I did not join them by date. For the vaccine data set, it is the date of the vaccine record for the country while for the cases data set, it is the amount of cases for that day. I wanted to analyze the countries from both data sets, so I joined the data sets in that way. I also selected the variables that I wanted to analyze or use in visualization.

The vaccine data set has 89822 observations and 164 countries while the cases data set has 518682 observations and 195 countries. There are 304 records from countries in the vaccine data that are not in the cases data; additionally, there are 60576 records from countries in the cases data that are not in the vaccine data. The country not in common was ‘World’ which is probably a summary statistic from the data set. The 608504 records dropped after joining were from countries included in one data set but not the other and it leads to the potential problem of not being able to evaluate COVID-19 from a full global scale. After joining, there are 163 countries with 320700702 observations that I will be analyzing and each country has dated records of vaccines and COVID-19 cases from January 2020 to October 2021. The size of the joined data set is considerably larger than the original data sets, so I will be narrowing my focus as I wrangle and visualize the data.

Wrangling

# Analyzing US Cases

# function to turn numeric month into word for more visually
# appealing tables
fixMonth <- function(x) {
    x %>% mutate(Month = str_replace_all(Month, "01", "January")) %>% 
        mutate(Month = str_replace_all(Month, "02", "February")) %>% 
        mutate(Month = str_replace_all(Month, "03", "March")) %>% 
        mutate(Month = str_replace_all(Month, "04", "April")) %>% 
        mutate(Month = str_replace_all(Month, "05", "May")) %>% 
        mutate(Month = str_replace_all(Month, "06", "June")) %>% 
        mutate(Month = str_replace_all(Month, "07", "July")) %>% 
        mutate(Month = str_replace_all(Month, "08", "August")) %>% 
        mutate(Month = str_replace_all(Month, "09", "September")) %>% 
        mutate(Month = str_replace_all(Month, "10", "October")) %>% 
        mutate(Month = str_replace_all(Month, "11", "November")) %>% 
        mutate(Month = str_replace_all(Month, "12", "December"))
}

# function to pivot table for summary statistics
sumPiv <- function(x) {
    x %>% pivot_longer(cols = c(Mean, SD, Variance, Days, Minimum, 
        Maximum), names_to = "Statistics") %>% pivot_wider(names_from = Month, 
        values_from = value)
}

table2 <- data %>% filter(country == "US", type == "death") %>% 
    distinct(date.x, .keep_all = TRUE) %>% select(date.x, cases) %>% 
    filter(str_detect(date.x, "^2020(.+)")) %>% separate(col = "date.x", 
    sep = "-", into = c("Year", "Month", "Day")) %>% subset(select = -Year) %>% 
    fixMonth()

table2 %>% pivot_wider(names_from = Month, values_from = cases) %>% 
    arrange(Day) %>% gt() %>% fmt_number(columns = 2:13, sep_mark = ",", 
    decimals = 0) %>% tab_header(title = "2020 US Covid Deaths")
2020 US Covid Deaths
Day January February March April May June July August September October November December
01 NA 0 0 1,274 1,911 768 700 1,087 1,027 869 507 2,545
02 NA 0 5 1,515 1,735 981 710 438 1,073 854 572 2,818
03 NA 0 1 1,407 1,199 1,007 670 583 1,030 683 1,578 2,937
04 NA 0 4 1,609 1,377 998 315 1,284 957 374 1,107 2,697
05 NA 0 1 1,632 2,207 889 326 1,440 757 483 1,165 2,364
06 NA 0 2 1,737 2,321 642 377 1,237 469 685 1,225 1,363
07 NA 0 3 2,570 1,919 460 1,159 1,243 314 917 1,090 1,600
08 NA 0 4 2,154 1,759 508 822 1,070 457 991 574 2,621
09 NA 0 1 2,215 1,469 908 1,019 521 1,159 969 775 3,181
10 NA 0 6 2,188 991 877 814 634 922 659 1,416 2,994
11 NA 0 5 2,138 1,037 826 729 1,024 1,154 471 1,435 3,407
12 NA 0 10 1,892 1,597 827 472 1,504 710 399 1,209 2,478
13 NA 0 8 1,982 1,723 739 447 1,064 428 800 1,200 1,637
14 NA 0 7 2,453 1,775 344 915 1,337 434 987 1,347 1,650
15 NA 0 12 2,591 1,665 398 981 1,000 1,199 837 780 3,086
16 NA 0 27 2,198 1,201 810 952 615 957 944 835 3,678
17 NA 0 37 2,113 829 739 924 493 857 766 1,711 3,502
18 NA 0 60 1,940 1,228 691 857 1,227 889 491 1,915 2,966
19 NA 0 72 1,989 1,452 629 485 1,335 730 483 2,054 2,679
20 NA 0 106 2,244 1,499 556 562 1,105 267 925 1,960 1,732
21 NA 0 103 2,459 1,187 320 1,085 1,083 424 1,123 1,623 1,900
22 0 0 128 2,473 1,225 406 1,223 909 1,028 891 1,038 3,399
23 0 0 189 2,435 1,080 748 1,068 546 1,062 963 1,101 3,423
24 0 0 241 2,178 640 737 1,104 476 898 954 2,131 2,896
25 0 0 336 1,724 598 538 932 1,223 941 461 2,260 1,458
26 0 0 416 1,413 678 628 531 1,163 769 533 1,387 1,891
27 0 0 519 1,495 1,453 507 1,127 1,113 315 972 1,554 1,437
28 0 0 673 2,228 1,093 316 1,331 975 356 1,033 1,366 2,002
29 0 1 592 2,398 1,126 386 1,408 916 863 1,004 1,042 3,619
30 0 NA 711 2,181 954 555 1,236 441 954 1,057 1,344 3,792
31 0 NA 1,082 NA 618 NA 1,237 542 NA 925 NA 3,428
table2 %>% group_by(Month) %>% summarize(Mean = mean(cases), 
    SD = sd(cases), Variance = var(cases), Days = n(), Minimum = min(cases), 
    Maximum = max(cases)) %>% sumPiv() %>% subset(select = c(Statistics, 
    January, February, March, April, May, June, July, August, 
    September, October, November, December)) %>% gt() %>% fmt_number(columns = 2:13, 
    sep_mark = ",", decimals = 2) %>% tab_header(title = "2020 US Covid Deaths Monthly Statistics")
2020 US Covid Deaths Monthly Statistics
Statistics January February March April May June July August September October November December
Mean 0.00 0.03 172.94 2,027.50 1,340.19 657.93 855.42 955.74 780.00 790.42 1,310.03 2,618.71
SD 0.00 0.19 273.94 378.47 453.49 210.33 310.42 329.72 290.24 224.16 462.40 761.54
Variance 0.00 0.03 75,041.60 143,238.33 205,657.09 44,237.93 96,358.58 108,717.20 84,237.52 50,248.58 213,818.31 579,948.61
Days 10.00 29.00 31.00 30.00 31.00 30.00 31.00 31.00 30.00 31.00 30.00 31.00
Minimum 0.00 0.00 0.00 1,274.00 598.00 316.00 315.00 438.00 267.00 374.00 507.00 1,363.00
Maximum 0.00 1.00 1,082.00 2,591.00 2,321.00 1,007.00 1,408.00 1,504.00 1,199.00 1,123.00 2,260.00 3,792.00
# Analyzing Vaccines
vaxData <- data %>% filter(country == "US") %>% distinct(date.y, 
    .keep_all = TRUE) %>% select(date.y, people_fully_vaccinated, 
    population) %>% filter(str_detect(date.y, "^2021(.+)")) %>% 
    separate(col = "date.y", sep = "-", into = c("Year", "Month", 
        "Day")) %>% subset(select = -Year) %>% fixMonth() %>% 
    group_by(Month) %>% mutate(VaxPercentage = people_fully_vaccinated/population)

tableV <- vaxData %>% subset(select = -c(people_fully_vaccinated, 
    population)) %>% pivot_wider(names_from = Month, values_from = VaxPercentage)

tableV %>% gt() %>% fmt_number(columns = 2:11, decimals = 4) %>% 
    tab_header(title = "2021 US Covid-19 Percentage of Population Vaccinated")
2021 US Covid-19 Percentage of Population Vaccinated
Day January February March April May June July August September October
01 0.0000 0.0180 0.0773 0.1702 0.3139 0.4124 0.4731 0.5001 0.5299 0.5611
02 0.0000 0.0184 0.0794 0.1760 0.3180 0.4133 0.4743 0.5006 0.5311 0.5620
03 0.0000 0.0195 0.0818 0.1817 0.3203 0.4147 0.4765 0.5011 0.5328 0.5630
04 0.0000 0.0210 0.0844 0.1864 0.3222 0.4172 0.4775 0.5018 0.5341 0.5639
05 0.0000 0.0228 0.0871 0.1894 0.3258 0.4192 0.4775 0.5027 0.5341 0.5647
06 0.0000 0.0252 0.0904 0.1913 0.3306 0.4218 0.4785 0.5036 0.5341 0.5657
07 0.0000 0.0278 0.0931 0.1955 0.3365 0.4242 0.4793 0.5045 0.5362 0.5664
08 0.0000 0.0289 0.0956 0.2009 0.3418 0.4263 0.4804 0.5053 0.5376 0.5673
09 0.0000 0.0299 0.0974 0.2070 0.3468 0.4279 0.4815 0.5058 0.5385 0.5682
10 0.0000 0.0318 0.0999 0.2146 0.3507 0.4297 0.4825 0.5065 0.5400 0.5682
11 0.0000 0.0340 0.1028 0.2205 0.3538 0.4313 0.4834 0.5072 0.5413 0.5682
12 0.0000 0.0367 0.1062 0.2248 0.3571 0.4344 0.4841 0.5080 0.5424 0.5698
13 0.0000 0.0397 0.1121 0.2286 0.3612 0.4368 0.4846 0.5090 0.5433 0.5704
14 0.0041 0.0427 0.1137 0.2327 0.3650 0.4399 0.4860 0.5102 0.5442 NA
15 0.0049 0.0427 0.1164 0.2383 0.3696 0.4424 0.4869 0.5110 0.5454 NA
16 0.0049 0.0456 0.1185 0.2447 0.3742 0.4445 0.4877 0.5120 0.5466 NA
17 0.0049 0.0470 0.1214 0.2503 0.3758 0.4485 0.4887 0.5126 0.5481 NA
18 0.0049 0.0491 0.1244 0.2558 0.3777 0.4506 0.4894 0.5135 0.5495 NA
19 0.0061 0.0517 0.1273 0.2591 0.3808 0.4526 0.4901 0.5148 0.5505 NA
20 0.0066 0.0543 0.1306 0.2617 0.3843 0.4543 0.4906 0.5160 0.5516 NA
21 0.0073 0.0573 0.1340 0.2659 0.3878 0.4554 0.4914 0.5172 0.5524 NA
22 0.0084 0.0590 0.1363 0.2709 0.3916 0.4566 0.4922 0.5185 0.5536 NA
23 0.0092 0.0603 0.1382 0.2767 0.3946 0.4577 0.4930 0.5193 0.5542 NA
24 0.0098 0.0625 0.1407 0.2825 0.3964 0.4591 0.4939 0.5201 0.5553 NA
25 0.0102 0.0654 0.1439 0.2877 0.3979 0.4602 0.4948 0.5214 0.5565 NA
26 0.0106 0.0686 0.1478 0.2910 0.4002 0.4619 0.4953 0.5226 0.5575 NA
27 0.0115 0.0719 0.1522 0.2936 0.4030 0.4645 0.4957 0.5240 0.5581 NA
28 0.0129 0.0752 0.1566 0.2976 0.4053 0.4667 0.4965 0.5254 0.5623 NA
29 0.0145 NA 0.1597 0.3025 0.4080 0.4680 0.4974 0.5267 0.5595 NA
30 0.0160 NA 0.1622 0.3078 0.4080 0.4701 0.4983 0.5276 0.5603 NA
31 0.0172 NA 0.1657 NA 0.4100 NA 0.4991 0.5285 NA NA
vaxData %>% group_by(Month) %>% summarize(Mean = mean(VaxPercentage), 
    SD = sd(VaxPercentage), Variance = var(VaxPercentage), Days = n(), 
    Minimum = min(VaxPercentage), Maximum = max(VaxPercentage)) %>% 
    sumPiv() %>% subset(select = c(Statistics, January, February, 
    March, April, May, June, July, August, September, October)) %>% 
    gt() %>% fmt_number(columns = 2:11, sep_mark = ",", decimals = 2) %>% 
    tab_header(title = "2021 Percent of US Population Vaccinated Statistics")
2021 Percent of US Population Vaccinated Statistics
Statistics January February March April May June July August September October
Mean 0.01 0.04 0.12 0.24 0.37 0.44 0.49 0.51 0.55 0.57
SD 0.01 0.02 0.03 0.04 0.03 0.02 0.01 0.01 0.01 0.00
Variance 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Days 31.00 28.00 31.00 30.00 31.00 30.00 31.00 31.00 30.00 13.00
Minimum 0.00 0.02 0.08 0.17 0.31 0.41 0.47 0.50 0.53 0.56
Maximum 0.02 0.08 0.17 0.31 0.41 0.47 0.50 0.53 0.56 0.57
# Table of Top 10 Countries with COVID-19 Cases
data %>% filter(type == "confirmed") %>% group_by(country) %>% 
    summarize(total = sum(cases)) %>% arrange(-total) %>% slice(1:10) %>% 
    gt() %>% cols_label(total = "Total Cases", country = "Country") %>% 
    fmt_number(columns = 2, sep_mark = ",", decimals = 0) %>% 
    tab_header(title = "Top 10 Countries with COVID-19 Cases")
Top 10 Countries with COVID-19 Cases
Country Total Cases
India 203,682,110,510
Brazil 96,715,615,622
Colombia 27,042,690,360
United Kingdom 17,912,038,905
Argentina 17,323,331,064
Italy 16,790,179,329
Indonesia 13,458,957,326
US 13,315,876,700
Ukraine 8,722,667,184
Poland 7,788,652,900
# Categorical variable with type of case
data %>% filter(country == c("US", "India", "Brazil")) %>% select(type, 
    cases, country) %>% group_by(country, type) %>% summarize(total = sum(cases))
## # A tibble: 9 x 3
## # Groups:   country [3]
##   country type            total
##   <chr>   <chr>           <dbl>
## 1 Brazil  confirmed 32238678261
## 2 Brazil  death       897947197
## 3 Brazil  recovery     11635260
## 4 India   confirmed 67894020791
## 5 India   death       900912364
## 6 India   recovery     20589877
## 7 US      confirmed  4438293986
## 8 US      death        71477928
## 9 US      recovery      2145291
data %>% select(type, cases) %>% mutate(cases = abs(cases)) %>% 
    group_by(type) %>% summarize(total = sum(cases)) %>% gt() %>% 
    fmt_number(columns = 2, sep_mark = ",", decimals = 0) %>% 
    tab_header(title = "Global Total of Cases")
Global Total of Cases
type total
confirmed 473,816,203,792
death 9,287,821,994
recovery 754,925,335,092

My data has 9 variables: country, date recorded for vaccine, cumulative sum of vaccinated people, date recorded for case number, number of COVID-19 cases on that date, type of case, latitude, longitude, and population. Because it did not make sense to summarize all variables(such as dates) and because my data set is so large, I decided to narrow my data set to focus on the US for the numeric variables and summarized death cases for 2020 and percentage of population vaccinated in 2021. Additionally, for the categorical variables, I put the counts for the top ten countries with COVID-19 cases and the counts for type of cases- looking at the US and the top two countries on the previous table.

I thought that the top ten countries for COVID-19 cases was interesting. I expected more European countries since they are densely populated. I was also shocked to see India and Brazil at the top by a major gap. Another piece of data I found was that only 57% of the US population is vaccinated since October which is disappointing since the vaccine is more accessible in the US compared to other countries.

Visualizing

deathsD <- data %>% select(country, type, cases, date.x) %>% 
    filter(type == "death") %>% group_by(country) %>% summarize(Total = sum(cases))
casesD <- data %>% select(country, cases) %>% group_by(country) %>% 
    summarize(Total = sum(cases))
vaccinesD <- data %>% select(country, people_fully_vaccinated, 
    population) %>% group_by(country) %>% summarize(VaxPercent = max(people_fully_vaccinated/population), 
    Population = max(population)) %>% na.omit()
data2 <- inner_join(casesD, vaccinesD, by = "country") %>% na.omit()
data20 <- data2 %>% mutate(CasesPercent = (Total/Population)) %>% 
    subset(select = -c(Total, Population))


ggplot(data20, aes(x = VaxPercent, y = CasesPercent)) + geom_point(aes(color = country)) + 
    geom_smooth(method = "lm") + labs(title = "Percent of Population Vaccinated Vs Covid-19 Cases per Population", 
    x = "Percentage of Population Vaccinated ", y = "COVID-19 Cases per Population") + 
    scale_y_continuous(breaks = seq(0, 100, 10)) + theme_light() + 
    theme(legend.position = "none")

As COVID-19 cases per population increases, the country has a higher percentage of people vaccinated. This means a country is more likely to get vaccinated if they have more COVID-19 cases. There are also some countries along the x-axis meaning that they have a low amount of cases per population but a high vaccination rate; fortunately, there are none vice-versa. I was curious to see where the US lies which leads to me to plot 2. There is also a cluster of data around the origin which have a low amount of vaccinated people and COVID-19 cases which makes me investigate in plot 3.

dataP <- data %>% filter(country == "US") %>% select(date.x, 
    cases, date.y, people_fully_vaccinated)
dataP2 <- dataP %>% filter(cases > 0)

ggplot(dataP, aes(x = date.y, y = people_fully_vaccinated)) + 
    geom_point() + geom_line(color = "red") + scale_y_continuous(labels = scales::number) + 
    labs(title = "2021 US Population Vaccinated", x = "Month", 
        y = "Total Number of People Vaccinated") + theme_gray()

ggplot(dataP2, aes(x = date.x)) + geom_bar(aes(y = cases), stat = "summary", 
    fun = mean) + geom_line(aes(y = cases), stat = "summary", 
    fun = mean) + scale_y_continuous(breaks = seq(0, 150000, 
    25000)) + theme_bw() + labs(title = "US COVID-19 Cases", 
    x = "Date", y = "Number of Cases")

After seeing the global relationship of vaccination rate and COVID-19 cases, I wanted to specifically examine the United States. The graph of US COVID-19 cases display that there were major spikes around the inital March 2020 outbreak, beginning of school in August 2020 and 2021, holiday season around January 2021, and summer break of 2021. It is interesting to see that even after vaccines were released, there was the lowest amount of cases followed by a major spike in cases in fall 2021. It looks like we are currently in a downward trend, but I would not be surprised to see if we go up again based on past data for holiday season.

Additionally, I made another plot to depict the total number of people vaccinated over the course of 2021. I was interested to see the S-curve and how the country is slowing in vaccinating people. The age limit on COVID-19 vaccines is lowering, so I hope people continue getting vaccinated and achieve herd immunity.

data3 <- data %>% filter(type == "confirmed") %>% group_by(country) %>% 
    summarize(Longitude = max(long), Latitude = max(lat), Population = max(population), 
        Vax = max(people_fully_vaccinated)) %>% na.omit()

world <- map_data("world")

ggplot(data3) + geom_polygon(aes(long, lat, group = group), data = world, 
    fill = "black") + geom_point(aes(Longitude, Latitude, color = Vax/Population)) + 
    coord_map() + scale_color_gradient(low = "yellow", high = "green", 
    name = "Percentage of Population \n Vaccinated") + theme_dark() + 
    labs(title = "Vaccination Percentages of Countries Around the World", 
        x = "Longitude", y = "Latitude")

After the first plot, I wanted to see if location played a role in vaccination status. This plot depicts that European countries are high in percentage of population vaccinated while African countries are low in contrast. The US is also in the middle for percentage compared to other countries which is surprising since we were one of the first who developed a vaccine but not surprising considering the current vaccine-hesitancy. There are also islands that have high vaccinated percentages which is probably due to their low population.

Concluding Remarks

In conclusion, I still believe the United States has a long way to go in order to recover from the pandemic. We are in top 10 for COVID-19 cases yet we are lagging behind in percentage of vaccinated population. In comparison with other first-world countries, we are definitely behind and more needs to be done to encourage vaccination. Based on the cases data for the United States, I predict another wave of cases as we hit holiday season. In future studies, I believe more research can be done on how to flatten the bell curve for cases and this data could be used to identify countries that have more success as well as predict human behavior in spreading disease.