library(ggplot2)
library(dplyr)
library(stringr)
library(gt)
library(tidyr)
library(coronavirus)
# My data
data("covid19_vaccine")
data("coronavirus")
These datasets are in the coronavirus R-package and I wanted to learn more about the current statistics of COVID-19 given the current pandemic that is affecting everyone. Having to start my college experience in a pandemic was not ideal but I want to explore the COVID-19 data and see if better times are to come. The first dataset includes the the amount of vaccinated people per country and other demographics. The second dataset includes the amount and type of COVID-19 cases per country.
If your datasets are tidy already, demonstrate that you can reshape data with pivot wider/longer here (e.g., untidy and then retidy). Alternatively, it may be easier to wait until the wrangling section so you can reshape your summary statistics. Note here if you are going to do this.
The datasets are already tidy. I will be using the pivot functions to manipulate the summary statistics data tables. I used pivot wider to arrange the months as columns and pivot longer to arrange the statistics as rows. This way, I can read the change in data from left-to-right as time progresses.
vax <- covid19_vaccine %>% mutate(country = country_region) %>%
subset(select = -country_region) %>% select(date, people_fully_vaccinated,
country, population)
covid <- coronavirus %>% select(date, country, lat, long, type,
cases)
data <- inner_join(covid, vax, by = "country")
# 2.1
nrow(vax)
## [1] 89822
nrow(covid)
## [1] 518682
# 2.2
vax %>% select(country) %>% group_by(country) %>% summarize(n())
## # A tibble: 164 x 2
## country `n()`
## <chr> <int>
## 1 Afghanistan 216
## 2 Albania 275
## 3 Algeria 178
## 4 Andorra 255
## 5 Angola 210
## 6 Antigua and Barbuda 222
## 7 Argentina 3288
## 8 Armenia 314
## 9 Australia 1006
## 10 Austria 290
## # … with 154 more rows
covid %>% select(country) %>% group_by(country) %>% summarize(n())
## # A tibble: 195 x 2
## country `n()`
## <chr> <int>
## 1 Afghanistan 1893
## 2 Albania 1893
## 3 Algeria 1893
## 4 Andorra 1893
## 5 Angola 1893
## 6 Antigua and Barbuda 1893
## 7 Argentina 1893
## 8 Armenia 1893
## 9 Australia 15144
## 10 Austria 1893
## # … with 185 more rows
# 2.3
nrow(vax) - sum(vax$country %in% covid$country)
## [1] 304
nrow(covid) - sum(covid$country %in% vax$country)
## [1] 60576
setdiff(vax$country, covid$country)
## [1] "World"
# 2.4
data %>% select(country) %>% group_by(country) %>% summarize(n())
## # A tibble: 163 x 2
## country `n()`
## <chr> <int>
## 1 Afghanistan 408888
## 2 Albania 520575
## 3 Algeria 336954
## 4 Andorra 482715
## 5 Angola 397530
## 6 Antigua and Barbuda 420246
## 7 Argentina 6224184
## 8 Armenia 594402
## 9 Australia 15234864
## 10 Austria 548970
## # … with 153 more rows
89822 + 518682
## [1] 608504
Both data sets use the date variable as their ID variable, but I did not join them by date. For the vaccine data set, it is the date of the vaccine record for the country while for the cases data set, it is the amount of cases for that day. I wanted to analyze the countries from both data sets, so I joined the data sets in that way. I also selected the variables that I wanted to analyze or use in visualization.
The vaccine data set has 89822 observations and 164 countries while the cases data set has 518682 observations and 195 countries. There are 304 records from countries in the vaccine data that are not in the cases data; additionally, there are 60576 records from countries in the cases data that are not in the vaccine data. The country not in common was ‘World’ which is probably a summary statistic from the data set. The 608504 records dropped after joining were from countries included in one data set but not the other and it leads to the potential problem of not being able to evaluate COVID-19 from a full global scale. After joining, there are 163 countries with 320700702 observations that I will be analyzing and each country has dated records of vaccines and COVID-19 cases from January 2020 to October 2021. The size of the joined data set is considerably larger than the original data sets, so I will be narrowing my focus as I wrangle and visualize the data.
# Analyzing US Cases
# function to turn numeric month into word for more visually
# appealing tables
fixMonth <- function(x) {
x %>% mutate(Month = str_replace_all(Month, "01", "January")) %>%
mutate(Month = str_replace_all(Month, "02", "February")) %>%
mutate(Month = str_replace_all(Month, "03", "March")) %>%
mutate(Month = str_replace_all(Month, "04", "April")) %>%
mutate(Month = str_replace_all(Month, "05", "May")) %>%
mutate(Month = str_replace_all(Month, "06", "June")) %>%
mutate(Month = str_replace_all(Month, "07", "July")) %>%
mutate(Month = str_replace_all(Month, "08", "August")) %>%
mutate(Month = str_replace_all(Month, "09", "September")) %>%
mutate(Month = str_replace_all(Month, "10", "October")) %>%
mutate(Month = str_replace_all(Month, "11", "November")) %>%
mutate(Month = str_replace_all(Month, "12", "December"))
}
# function to pivot table for summary statistics
sumPiv <- function(x) {
x %>% pivot_longer(cols = c(Mean, SD, Variance, Days, Minimum,
Maximum), names_to = "Statistics") %>% pivot_wider(names_from = Month,
values_from = value)
}
table2 <- data %>% filter(country == "US", type == "death") %>%
distinct(date.x, .keep_all = TRUE) %>% select(date.x, cases) %>%
filter(str_detect(date.x, "^2020(.+)")) %>% separate(col = "date.x",
sep = "-", into = c("Year", "Month", "Day")) %>% subset(select = -Year) %>%
fixMonth()
table2 %>% pivot_wider(names_from = Month, values_from = cases) %>%
arrange(Day) %>% gt() %>% fmt_number(columns = 2:13, sep_mark = ",",
decimals = 0) %>% tab_header(title = "2020 US Covid Deaths")
2020 US Covid Deaths | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Day | January | February | March | April | May | June | July | August | September | October | November | December |
01 | NA | 0 | 0 | 1,274 | 1,911 | 768 | 700 | 1,087 | 1,027 | 869 | 507 | 2,545 |
02 | NA | 0 | 5 | 1,515 | 1,735 | 981 | 710 | 438 | 1,073 | 854 | 572 | 2,818 |
03 | NA | 0 | 1 | 1,407 | 1,199 | 1,007 | 670 | 583 | 1,030 | 683 | 1,578 | 2,937 |
04 | NA | 0 | 4 | 1,609 | 1,377 | 998 | 315 | 1,284 | 957 | 374 | 1,107 | 2,697 |
05 | NA | 0 | 1 | 1,632 | 2,207 | 889 | 326 | 1,440 | 757 | 483 | 1,165 | 2,364 |
06 | NA | 0 | 2 | 1,737 | 2,321 | 642 | 377 | 1,237 | 469 | 685 | 1,225 | 1,363 |
07 | NA | 0 | 3 | 2,570 | 1,919 | 460 | 1,159 | 1,243 | 314 | 917 | 1,090 | 1,600 |
08 | NA | 0 | 4 | 2,154 | 1,759 | 508 | 822 | 1,070 | 457 | 991 | 574 | 2,621 |
09 | NA | 0 | 1 | 2,215 | 1,469 | 908 | 1,019 | 521 | 1,159 | 969 | 775 | 3,181 |
10 | NA | 0 | 6 | 2,188 | 991 | 877 | 814 | 634 | 922 | 659 | 1,416 | 2,994 |
11 | NA | 0 | 5 | 2,138 | 1,037 | 826 | 729 | 1,024 | 1,154 | 471 | 1,435 | 3,407 |
12 | NA | 0 | 10 | 1,892 | 1,597 | 827 | 472 | 1,504 | 710 | 399 | 1,209 | 2,478 |
13 | NA | 0 | 8 | 1,982 | 1,723 | 739 | 447 | 1,064 | 428 | 800 | 1,200 | 1,637 |
14 | NA | 0 | 7 | 2,453 | 1,775 | 344 | 915 | 1,337 | 434 | 987 | 1,347 | 1,650 |
15 | NA | 0 | 12 | 2,591 | 1,665 | 398 | 981 | 1,000 | 1,199 | 837 | 780 | 3,086 |
16 | NA | 0 | 27 | 2,198 | 1,201 | 810 | 952 | 615 | 957 | 944 | 835 | 3,678 |
17 | NA | 0 | 37 | 2,113 | 829 | 739 | 924 | 493 | 857 | 766 | 1,711 | 3,502 |
18 | NA | 0 | 60 | 1,940 | 1,228 | 691 | 857 | 1,227 | 889 | 491 | 1,915 | 2,966 |
19 | NA | 0 | 72 | 1,989 | 1,452 | 629 | 485 | 1,335 | 730 | 483 | 2,054 | 2,679 |
20 | NA | 0 | 106 | 2,244 | 1,499 | 556 | 562 | 1,105 | 267 | 925 | 1,960 | 1,732 |
21 | NA | 0 | 103 | 2,459 | 1,187 | 320 | 1,085 | 1,083 | 424 | 1,123 | 1,623 | 1,900 |
22 | 0 | 0 | 128 | 2,473 | 1,225 | 406 | 1,223 | 909 | 1,028 | 891 | 1,038 | 3,399 |
23 | 0 | 0 | 189 | 2,435 | 1,080 | 748 | 1,068 | 546 | 1,062 | 963 | 1,101 | 3,423 |
24 | 0 | 0 | 241 | 2,178 | 640 | 737 | 1,104 | 476 | 898 | 954 | 2,131 | 2,896 |
25 | 0 | 0 | 336 | 1,724 | 598 | 538 | 932 | 1,223 | 941 | 461 | 2,260 | 1,458 |
26 | 0 | 0 | 416 | 1,413 | 678 | 628 | 531 | 1,163 | 769 | 533 | 1,387 | 1,891 |
27 | 0 | 0 | 519 | 1,495 | 1,453 | 507 | 1,127 | 1,113 | 315 | 972 | 1,554 | 1,437 |
28 | 0 | 0 | 673 | 2,228 | 1,093 | 316 | 1,331 | 975 | 356 | 1,033 | 1,366 | 2,002 |
29 | 0 | 1 | 592 | 2,398 | 1,126 | 386 | 1,408 | 916 | 863 | 1,004 | 1,042 | 3,619 |
30 | 0 | NA | 711 | 2,181 | 954 | 555 | 1,236 | 441 | 954 | 1,057 | 1,344 | 3,792 |
31 | 0 | NA | 1,082 | NA | 618 | NA | 1,237 | 542 | NA | 925 | NA | 3,428 |
table2 %>% group_by(Month) %>% summarize(Mean = mean(cases),
SD = sd(cases), Variance = var(cases), Days = n(), Minimum = min(cases),
Maximum = max(cases)) %>% sumPiv() %>% subset(select = c(Statistics,
January, February, March, April, May, June, July, August,
September, October, November, December)) %>% gt() %>% fmt_number(columns = 2:13,
sep_mark = ",", decimals = 2) %>% tab_header(title = "2020 US Covid Deaths Monthly Statistics")
2020 US Covid Deaths Monthly Statistics | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Statistics | January | February | March | April | May | June | July | August | September | October | November | December |
Mean | 0.00 | 0.03 | 172.94 | 2,027.50 | 1,340.19 | 657.93 | 855.42 | 955.74 | 780.00 | 790.42 | 1,310.03 | 2,618.71 |
SD | 0.00 | 0.19 | 273.94 | 378.47 | 453.49 | 210.33 | 310.42 | 329.72 | 290.24 | 224.16 | 462.40 | 761.54 |
Variance | 0.00 | 0.03 | 75,041.60 | 143,238.33 | 205,657.09 | 44,237.93 | 96,358.58 | 108,717.20 | 84,237.52 | 50,248.58 | 213,818.31 | 579,948.61 |
Days | 10.00 | 29.00 | 31.00 | 30.00 | 31.00 | 30.00 | 31.00 | 31.00 | 30.00 | 31.00 | 30.00 | 31.00 |
Minimum | 0.00 | 0.00 | 0.00 | 1,274.00 | 598.00 | 316.00 | 315.00 | 438.00 | 267.00 | 374.00 | 507.00 | 1,363.00 |
Maximum | 0.00 | 1.00 | 1,082.00 | 2,591.00 | 2,321.00 | 1,007.00 | 1,408.00 | 1,504.00 | 1,199.00 | 1,123.00 | 2,260.00 | 3,792.00 |
# Analyzing Vaccines
vaxData <- data %>% filter(country == "US") %>% distinct(date.y,
.keep_all = TRUE) %>% select(date.y, people_fully_vaccinated,
population) %>% filter(str_detect(date.y, "^2021(.+)")) %>%
separate(col = "date.y", sep = "-", into = c("Year", "Month",
"Day")) %>% subset(select = -Year) %>% fixMonth() %>%
group_by(Month) %>% mutate(VaxPercentage = people_fully_vaccinated/population)
tableV <- vaxData %>% subset(select = -c(people_fully_vaccinated,
population)) %>% pivot_wider(names_from = Month, values_from = VaxPercentage)
tableV %>% gt() %>% fmt_number(columns = 2:11, decimals = 4) %>%
tab_header(title = "2021 US Covid-19 Percentage of Population Vaccinated")
2021 US Covid-19 Percentage of Population Vaccinated | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Day | January | February | March | April | May | June | July | August | September | October |
01 | 0.0000 | 0.0180 | 0.0773 | 0.1702 | 0.3139 | 0.4124 | 0.4731 | 0.5001 | 0.5299 | 0.5611 |
02 | 0.0000 | 0.0184 | 0.0794 | 0.1760 | 0.3180 | 0.4133 | 0.4743 | 0.5006 | 0.5311 | 0.5620 |
03 | 0.0000 | 0.0195 | 0.0818 | 0.1817 | 0.3203 | 0.4147 | 0.4765 | 0.5011 | 0.5328 | 0.5630 |
04 | 0.0000 | 0.0210 | 0.0844 | 0.1864 | 0.3222 | 0.4172 | 0.4775 | 0.5018 | 0.5341 | 0.5639 |
05 | 0.0000 | 0.0228 | 0.0871 | 0.1894 | 0.3258 | 0.4192 | 0.4775 | 0.5027 | 0.5341 | 0.5647 |
06 | 0.0000 | 0.0252 | 0.0904 | 0.1913 | 0.3306 | 0.4218 | 0.4785 | 0.5036 | 0.5341 | 0.5657 |
07 | 0.0000 | 0.0278 | 0.0931 | 0.1955 | 0.3365 | 0.4242 | 0.4793 | 0.5045 | 0.5362 | 0.5664 |
08 | 0.0000 | 0.0289 | 0.0956 | 0.2009 | 0.3418 | 0.4263 | 0.4804 | 0.5053 | 0.5376 | 0.5673 |
09 | 0.0000 | 0.0299 | 0.0974 | 0.2070 | 0.3468 | 0.4279 | 0.4815 | 0.5058 | 0.5385 | 0.5682 |
10 | 0.0000 | 0.0318 | 0.0999 | 0.2146 | 0.3507 | 0.4297 | 0.4825 | 0.5065 | 0.5400 | 0.5682 |
11 | 0.0000 | 0.0340 | 0.1028 | 0.2205 | 0.3538 | 0.4313 | 0.4834 | 0.5072 | 0.5413 | 0.5682 |
12 | 0.0000 | 0.0367 | 0.1062 | 0.2248 | 0.3571 | 0.4344 | 0.4841 | 0.5080 | 0.5424 | 0.5698 |
13 | 0.0000 | 0.0397 | 0.1121 | 0.2286 | 0.3612 | 0.4368 | 0.4846 | 0.5090 | 0.5433 | 0.5704 |
14 | 0.0041 | 0.0427 | 0.1137 | 0.2327 | 0.3650 | 0.4399 | 0.4860 | 0.5102 | 0.5442 | NA |
15 | 0.0049 | 0.0427 | 0.1164 | 0.2383 | 0.3696 | 0.4424 | 0.4869 | 0.5110 | 0.5454 | NA |
16 | 0.0049 | 0.0456 | 0.1185 | 0.2447 | 0.3742 | 0.4445 | 0.4877 | 0.5120 | 0.5466 | NA |
17 | 0.0049 | 0.0470 | 0.1214 | 0.2503 | 0.3758 | 0.4485 | 0.4887 | 0.5126 | 0.5481 | NA |
18 | 0.0049 | 0.0491 | 0.1244 | 0.2558 | 0.3777 | 0.4506 | 0.4894 | 0.5135 | 0.5495 | NA |
19 | 0.0061 | 0.0517 | 0.1273 | 0.2591 | 0.3808 | 0.4526 | 0.4901 | 0.5148 | 0.5505 | NA |
20 | 0.0066 | 0.0543 | 0.1306 | 0.2617 | 0.3843 | 0.4543 | 0.4906 | 0.5160 | 0.5516 | NA |
21 | 0.0073 | 0.0573 | 0.1340 | 0.2659 | 0.3878 | 0.4554 | 0.4914 | 0.5172 | 0.5524 | NA |
22 | 0.0084 | 0.0590 | 0.1363 | 0.2709 | 0.3916 | 0.4566 | 0.4922 | 0.5185 | 0.5536 | NA |
23 | 0.0092 | 0.0603 | 0.1382 | 0.2767 | 0.3946 | 0.4577 | 0.4930 | 0.5193 | 0.5542 | NA |
24 | 0.0098 | 0.0625 | 0.1407 | 0.2825 | 0.3964 | 0.4591 | 0.4939 | 0.5201 | 0.5553 | NA |
25 | 0.0102 | 0.0654 | 0.1439 | 0.2877 | 0.3979 | 0.4602 | 0.4948 | 0.5214 | 0.5565 | NA |
26 | 0.0106 | 0.0686 | 0.1478 | 0.2910 | 0.4002 | 0.4619 | 0.4953 | 0.5226 | 0.5575 | NA |
27 | 0.0115 | 0.0719 | 0.1522 | 0.2936 | 0.4030 | 0.4645 | 0.4957 | 0.5240 | 0.5581 | NA |
28 | 0.0129 | 0.0752 | 0.1566 | 0.2976 | 0.4053 | 0.4667 | 0.4965 | 0.5254 | 0.5623 | NA |
29 | 0.0145 | NA | 0.1597 | 0.3025 | 0.4080 | 0.4680 | 0.4974 | 0.5267 | 0.5595 | NA |
30 | 0.0160 | NA | 0.1622 | 0.3078 | 0.4080 | 0.4701 | 0.4983 | 0.5276 | 0.5603 | NA |
31 | 0.0172 | NA | 0.1657 | NA | 0.4100 | NA | 0.4991 | 0.5285 | NA | NA |
vaxData %>% group_by(Month) %>% summarize(Mean = mean(VaxPercentage),
SD = sd(VaxPercentage), Variance = var(VaxPercentage), Days = n(),
Minimum = min(VaxPercentage), Maximum = max(VaxPercentage)) %>%
sumPiv() %>% subset(select = c(Statistics, January, February,
March, April, May, June, July, August, September, October)) %>%
gt() %>% fmt_number(columns = 2:11, sep_mark = ",", decimals = 2) %>%
tab_header(title = "2021 Percent of US Population Vaccinated Statistics")
2021 Percent of US Population Vaccinated Statistics | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Statistics | January | February | March | April | May | June | July | August | September | October |
Mean | 0.01 | 0.04 | 0.12 | 0.24 | 0.37 | 0.44 | 0.49 | 0.51 | 0.55 | 0.57 |
SD | 0.01 | 0.02 | 0.03 | 0.04 | 0.03 | 0.02 | 0.01 | 0.01 | 0.01 | 0.00 |
Variance | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
Days | 31.00 | 28.00 | 31.00 | 30.00 | 31.00 | 30.00 | 31.00 | 31.00 | 30.00 | 13.00 |
Minimum | 0.00 | 0.02 | 0.08 | 0.17 | 0.31 | 0.41 | 0.47 | 0.50 | 0.53 | 0.56 |
Maximum | 0.02 | 0.08 | 0.17 | 0.31 | 0.41 | 0.47 | 0.50 | 0.53 | 0.56 | 0.57 |
# Table of Top 10 Countries with COVID-19 Cases
data %>% filter(type == "confirmed") %>% group_by(country) %>%
summarize(total = sum(cases)) %>% arrange(-total) %>% slice(1:10) %>%
gt() %>% cols_label(total = "Total Cases", country = "Country") %>%
fmt_number(columns = 2, sep_mark = ",", decimals = 0) %>%
tab_header(title = "Top 10 Countries with COVID-19 Cases")
Top 10 Countries with COVID-19 Cases | |
---|---|
Country | Total Cases |
India | 203,682,110,510 |
Brazil | 96,715,615,622 |
Colombia | 27,042,690,360 |
United Kingdom | 17,912,038,905 |
Argentina | 17,323,331,064 |
Italy | 16,790,179,329 |
Indonesia | 13,458,957,326 |
US | 13,315,876,700 |
Ukraine | 8,722,667,184 |
Poland | 7,788,652,900 |
# Categorical variable with type of case
data %>% filter(country == c("US", "India", "Brazil")) %>% select(type,
cases, country) %>% group_by(country, type) %>% summarize(total = sum(cases))
## # A tibble: 9 x 3
## # Groups: country [3]
## country type total
## <chr> <chr> <dbl>
## 1 Brazil confirmed 32238678261
## 2 Brazil death 897947197
## 3 Brazil recovery 11635260
## 4 India confirmed 67894020791
## 5 India death 900912364
## 6 India recovery 20589877
## 7 US confirmed 4438293986
## 8 US death 71477928
## 9 US recovery 2145291
data %>% select(type, cases) %>% mutate(cases = abs(cases)) %>%
group_by(type) %>% summarize(total = sum(cases)) %>% gt() %>%
fmt_number(columns = 2, sep_mark = ",", decimals = 0) %>%
tab_header(title = "Global Total of Cases")
Global Total of Cases | |
---|---|
type | total |
confirmed | 473,816,203,792 |
death | 9,287,821,994 |
recovery | 754,925,335,092 |
My data has 9 variables: country, date recorded for vaccine, cumulative sum of vaccinated people, date recorded for case number, number of COVID-19 cases on that date, type of case, latitude, longitude, and population. Because it did not make sense to summarize all variables(such as dates) and because my data set is so large, I decided to narrow my data set to focus on the US for the numeric variables and summarized death cases for 2020 and percentage of population vaccinated in 2021. Additionally, for the categorical variables, I put the counts for the top ten countries with COVID-19 cases and the counts for type of cases- looking at the US and the top two countries on the previous table.
I thought that the top ten countries for COVID-19 cases was interesting. I expected more European countries since they are densely populated. I was also shocked to see India and Brazil at the top by a major gap. Another piece of data I found was that only 57% of the US population is vaccinated since October which is disappointing since the vaccine is more accessible in the US compared to other countries.
deathsD <- data %>% select(country, type, cases, date.x) %>%
filter(type == "death") %>% group_by(country) %>% summarize(Total = sum(cases))
casesD <- data %>% select(country, cases) %>% group_by(country) %>%
summarize(Total = sum(cases))
vaccinesD <- data %>% select(country, people_fully_vaccinated,
population) %>% group_by(country) %>% summarize(VaxPercent = max(people_fully_vaccinated/population),
Population = max(population)) %>% na.omit()
data2 <- inner_join(casesD, vaccinesD, by = "country") %>% na.omit()
data20 <- data2 %>% mutate(CasesPercent = (Total/Population)) %>%
subset(select = -c(Total, Population))
ggplot(data20, aes(x = VaxPercent, y = CasesPercent)) + geom_point(aes(color = country)) +
geom_smooth(method = "lm") + labs(title = "Percent of Population Vaccinated Vs Covid-19 Cases per Population",
x = "Percentage of Population Vaccinated ", y = "COVID-19 Cases per Population") +
scale_y_continuous(breaks = seq(0, 100, 10)) + theme_light() +
theme(legend.position = "none")
As COVID-19 cases per population increases, the country has a higher percentage of people vaccinated. This means a country is more likely to get vaccinated if they have more COVID-19 cases. There are also some countries along the x-axis meaning that they have a low amount of cases per population but a high vaccination rate; fortunately, there are none vice-versa. I was curious to see where the US lies which leads to me to plot 2. There is also a cluster of data around the origin which have a low amount of vaccinated people and COVID-19 cases which makes me investigate in plot 3.
dataP <- data %>% filter(country == "US") %>% select(date.x,
cases, date.y, people_fully_vaccinated)
dataP2 <- dataP %>% filter(cases > 0)
ggplot(dataP, aes(x = date.y, y = people_fully_vaccinated)) +
geom_point() + geom_line(color = "red") + scale_y_continuous(labels = scales::number) +
labs(title = "2021 US Population Vaccinated", x = "Month",
y = "Total Number of People Vaccinated") + theme_gray()
ggplot(dataP2, aes(x = date.x)) + geom_bar(aes(y = cases), stat = "summary",
fun = mean) + geom_line(aes(y = cases), stat = "summary",
fun = mean) + scale_y_continuous(breaks = seq(0, 150000,
25000)) + theme_bw() + labs(title = "US COVID-19 Cases",
x = "Date", y = "Number of Cases")
After seeing the global relationship of vaccination rate and COVID-19 cases, I wanted to specifically examine the United States. The graph of US COVID-19 cases display that there were major spikes around the inital March 2020 outbreak, beginning of school in August 2020 and 2021, holiday season around January 2021, and summer break of 2021. It is interesting to see that even after vaccines were released, there was the lowest amount of cases followed by a major spike in cases in fall 2021. It looks like we are currently in a downward trend, but I would not be surprised to see if we go up again based on past data for holiday season.
Additionally, I made another plot to depict the total number of people vaccinated over the course of 2021. I was interested to see the S-curve and how the country is slowing in vaccinating people. The age limit on COVID-19 vaccines is lowering, so I hope people continue getting vaccinated and achieve herd immunity.
data3 <- data %>% filter(type == "confirmed") %>% group_by(country) %>%
summarize(Longitude = max(long), Latitude = max(lat), Population = max(population),
Vax = max(people_fully_vaccinated)) %>% na.omit()
world <- map_data("world")
ggplot(data3) + geom_polygon(aes(long, lat, group = group), data = world,
fill = "black") + geom_point(aes(Longitude, Latitude, color = Vax/Population)) +
coord_map() + scale_color_gradient(low = "yellow", high = "green",
name = "Percentage of Population \n Vaccinated") + theme_dark() +
labs(title = "Vaccination Percentages of Countries Around the World",
x = "Longitude", y = "Latitude")
After the first plot, I wanted to see if location played a role in vaccination status. This plot depicts that European countries are high in percentage of population vaccinated while African countries are low in contrast. The US is also in the middle for percentage compared to other countries which is surprising since we were one of the first who developed a vaccine but not surprising considering the current vaccine-hesitancy. There are also islands that have high vaccinated percentages which is probably due to their low population.
In conclusion, I still believe the United States has a long way to go in order to recover from the pandemic. We are in top 10 for COVID-19 cases yet we are lagging behind in percentage of vaccinated population. In comparison with other first-world countries, we are definitely behind and more needs to be done to encourage vaccination. Based on the cases data for the United States, I predict another wave of cases as we hit holiday season. In future studies, I believe more research can be done on how to flatten the bell curve for cases and this data could be used to identify countries that have more success as well as predict human behavior in spreading disease.