class: center, middle, inverse, title-slide # Data Analysis (Part 2) --- ## Last week - Tidyverse -- - `%>%` -- - Loading data -- - Cleaning data --- ## This week - Summarising data with `dplyr` -- - Plotting with `ggplot2` --- ## Setup - If you haven't already, install `ggplot2` and `dplyr` using the `install.packages()` function - Add the `library(dplyr)` and `library(ggplot2)` lines to the top of your script (if they aren't already there) - Make sure you've got your C19 dataset loaded -- - Filter and sort your dataset by date and by age group with: ```r c19 <- c19 %>% filter(cdc_report_dt >= as.Date("2020/03/01")) %>% arrange(desc(cdc_report_dt), age_group) ``` --- ## Summarising data - Our dataset currently has nearly 33,000 rows -- - This is a lot of data, and so extracting meaningful conclusions from this level can be difficult -- - Instead of analysing data at this level, you might want to aggregate it up - Then you can always investigate further at a more granular level in the future but this will point you in the right direction --- ## Summarising data - Using our C19 data, we're going to aggregate up to a daily level and for each age group, so we'll have a value for each combination of date and age group* -- - \* Unless there are no rows for a date and age group (e.g. there are no entries for "20-29 Years" on Feb 2nd) -- - This will take us down from ~33,000 rows to ~1,000 -- - In `dplyr`, we use the `group_by()` and `summarise()` functions to aggregate our data up -- - If you used SQL before, it's a similar approach --- ## `group_by` & `summarise()` - Before we use the `summarise()` function, we need to tell R what we want variable we want to group by. -- - This will then create a value for each unique value of that variable(s) that we grouped by -- - So in our example, we would have a summarised value for each date and age group if we grouped by the `cdc_report_dt` and `age_group` columns --- ## `group_by` - So we know that we want to group our data by the `cdc_report_dt` and `age_group` -- - To do so, we use the `group_by()` function like so: ```r c19_grouped <- c19 %>% group_by(cdc_report_dt, age_group) head(c19_grouped, 5) ``` ``` ## # A tibble: 5 x 11 ## # Groups: cdc_report_dt, age_group [1] ## cdc_report_dt pos_spec_dt onset_dt current_status sex age_group ethnicity ## <date> <date> <date> <chr> <fct> <chr> <fct> ## 1 2020-10-16 2020-09-30 2020-09-25 Laboratory-co… Male 0 - 9 Ye… White, N… ## 2 2020-10-16 2020-09-30 2020-09-30 Laboratory-co… Male 0 - 9 Ye… White, N… ## 3 2020-10-16 2020-10-01 2020-09-30 Laboratory-co… Male 0 - 9 Ye… White, N… ## 4 2020-10-16 2020-10-01 2020-10-05 Laboratory-co… Fema… 0 - 9 Ye… White, N… ## 5 2020-10-16 2020-10-02 2020-10-02 Laboratory-co… Male 0 - 9 Ye… White, N… ## # … with 4 more variables: hosp_yn <lgl>, icu_yn <lgl>, death_yn <lgl>, ## # medcond_yn <lgl> ``` -- - Now when we view our dataset, it looks like the same but we can see that it is grouped: `Groups: cdc_report_dt, age_group [3]` --- ## `summarise()` - The data still looks the same though because we haven't done our summarisation yet, we've just told R what we want to group by -- - To summarise our data, we use the `summarise()` function a bit like we used the `mutate()` function: ```r c19_grouped <- c19 %>% group_by(cdc_report_dt, age_group) %>% summarise(number_of_cases = n()) %>% ungroup() ``` ``` ## `summarise()` regrouping output by 'cdc_report_dt' (override with `.groups` argument) ``` ```r head(c19_grouped, 5) ``` ``` ## # A tibble: 5 x 3 ## cdc_report_dt age_group number_of_cases ## <date> <chr> <int> ## 1 2020-03-01 60 - 69 Years 4 ## 2 2020-03-01 70 - 79 Years 2 ## 3 2020-03-02 60 - 69 Years 1 ## 4 2020-03-02 70 - 79 Years 1 ## 5 2020-03-03 60 - 69 Years 2 ``` --- ## `summarise()` - In the `summarise()` function call, we say what we want our new column's name to be (number_of_cases here) and then we need to say what kind of summary function we want to apply -- - In this case, we've just counted the number of entries for that date and age group (that's what the `n()` function does) -- - But we could use any other summary function - For example, if we wanted to find the percentage of cases that resulted in death, we could use the `mean()` function on the `death_yn` column*: ```r c19_example <- c19 %>% group_by(cdc_report_dt, age_group) %>% summarise(mortality_rate = mean(death_yn) * 100) %>% ungroup() ``` ``` ## `summarise()` regrouping output by 'cdc_report_dt' (override with `.groups` argument) ``` ```r head(c19_example, 5) ``` ``` ## # A tibble: 5 x 3 ## cdc_report_dt age_group mortality_rate ## <date> <chr> <dbl> ## 1 2020-03-01 60 - 69 Years 0 ## 2 2020-03-01 70 - 79 Years 0 ## 3 2020-03-02 60 - 69 Years 0 ## 4 2020-03-02 70 - 79 Years 100 ## 5 2020-03-03 60 - 69 Years 50 ``` -- - \*Because logical values `TRUE`/`FALSE` are equivalent to numeric `1` and `0`, we can use numeric functions on logical value --- ## Plotting - Now that we have our summarised dataset, we want to actually see what's going on -- - The easiest way to do this would be to create a line plot the shows the number of cases by each age group and date - This is quite a common graphic that's displayed during the Government's press briefings -- - To create our plot, we're going to use the `ggplot2` package --- ## `ggplot2` - The `ggplot2` package is a plotting library based on The Grammer of Graphics -- - This is basically the idea that graphics are a bit like language in that they have a grammer that should be followed to make them understandable -- - As part of these grammatical rules, we build plots in `ggplot2` in stages. You define your dataset, then what geom you want (lines, points, etc.), what you want to group the lines by, then what you want the scales to look like, etc. -- - But rather than using the pipe, we use the `+` operator in `ggplot2` -- - `ggplot2` is a very extensive and useful library, and we're really only going to brush the surface today --- ## `ggplot2` - To start of with, we need to define what dataset we want to plot, what we want on each axis and what we want to group by - We do this using the `ggplot()` function to define our dataset, alongside the `aes()` function to define our axes: ```r #library(ggplot2) # Make sure ggplot2 is loaded! ggplot(data = c19_grouped, aes(x = cdc_report_dt, y = number_of_cases)) ``` <img src="./data_analysis_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- ## `ggplot2` - You'll notice that our plot is blank though -- - This because we need to define what type of geom we want to use (e.g. bars, lines, points, etc.) -- - In our case, we want to use a line, so we add the `geom_line()` function: ```r ggplot(data = c19_grouped, aes(x = cdc_report_dt, y = number_of_cases)) + geom_line() ``` <img src="./data_analysis_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- ## `ggplot2` - Better, but that's not really what we wanted... -- - We're seeing these weird squigly lines because our dataset has multiple values for each date (we've got a value for each age group) -- - So what we need to do is have separate lines for each of our age groups, and perhaps we should have those as different colours --- ## `ggplot2` - To do that, we need to change our original `aes()` call: ```r ggplot(data = c19_grouped, aes(x = cdc_report_dt, y = number_of_cases, colour = age_group)) + geom_line() ``` <img src="./data_analysis_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> --- ## `ggplot2` - Much better -- - But things are still messy... - For instance, our axes labels are our variable names which don't look good. And also we probably don't want the "Unknown" and "NA" groups --- ## `ggplot2` - Changing labels - To change our scales (`age_group` also counts as a scale), we use the `scale_[axis]_[type]` functions. For example, if we want to change the x axis and the x variable is a date, we'd use the `scale_x_date()` function: ```r ggplot(data = c19_grouped, aes(x = cdc_report_dt, y = number_of_cases,colour = age_group)) + geom_line() + scale_x_date(name = "CDC Report Date") + scale_y_continuous(name = "Number of Cases") + scale_color_discrete(name = "Age Group") ``` <img src="./data_analysis_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> --- ## Exercises - Remove the "Unknown" and "NA" groups from your dataset using the `filter()` function and recreate your plot - Change the date breaks so that there is a label every two months using the `date_breaks` parameter of the `scale_x_date()` function (use `?scale_x_date()` to find out how to use it) - Change your dataset and plot so that you're plotting the mortality rate instead of the number of cases