class: center, middle, inverse, title-slide # Tidyverse --- ## Last week - Data structures + Vectors + Lists + Dataframes -- - Functions + What is a function? + How to use a function --- ## This week - Tidyverse -- - `%>%` -- - Loading data -- - Cleaning data --- ## Tidyverse -- - A set of packages designed to make data cleaning, analysis and visualisation easier -- - The 'core' Tidyverse packages are: + `readr` (reading data) + `tidyr` (tidying or reshaping data) + `dplyr` (data manipulation and summarisation) + `ggplot2` (plotting and graphics) + `purrr` (functional programming tools) + `tibble` (a re-imagining of the data frame) + `stringr` (working with character strings) + `forcats` (working with factors) -- - In this course, we're just going to look at `readr`, `dplyr` and `ggplot2` --- ## Tidy data - Central to the Tidyverse is a common philosophy, built on the idea of tidy data -- - That is essentially a way of storing data such that each observation has its own row, and each variable has its own column --- ## Tidy data example Tidy: ``` ## # A tibble: 2 x 2 ## Country Population ## <chr> <dbl> ## 1 UK 66650000 ## 2 US 328200000 ``` Untidy: ``` ## # A tibble: 1 x 2 ## UK US ## <dbl> <dbl> ## 1 66650000 328200000 ``` --- ## Tidyverse and `%>%` - Another integral part of the Tidyverse philosophy is the pipe (`%>%`) -- - The pipe takes the result from the function on its left hand side, and passes it as the first parameter to the function on the right hand side (or whatever you specify with a full stop (`.`)) -- Example: ```r library(dplyr) ``` ``` ## ## Attaching package: 'dplyr' ``` ``` ## The following objects are masked from 'package:stats': ## ## filter, lag ``` ``` ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union ``` ```r "hello world" %>% substr(1,5) %>% paste("The world is saying", .) ``` ``` ## [1] "The world is saying hello" ``` --- ## Tidyverse and `%>%` - The pipe allows continuous lines of logic that follow from left to right. Without using the pipe, our previous example would look like this: ```r paste("The world is saying", substr("hello world", 1, 5)) ``` ``` ## [1] "The world is saying hello" ``` -- - This is much harder to read because we have to read from inside to out (i.e. `"hello world"` to `substr()` then to `paste()`) -- - The pipe fits nicely into a data analysis workflow: ```r load() %>% clean() %>% summarise() %>% plot() ``` --- ## Data Loading - Time to load our dataset -- - In RStudio, install and load the `readr` package (i.e. `install.packages("readr")` then `library(readr)`) -- - Go to the raw csv file [here](https://raw.githubusercontent.com/ARawles/teacheR-extras/master/data/C19.csv), right click and press "Save as..." - Save it somewhere easily accessible --- ## Data Loading - Use the `readr::read_csv()` function to load the dataset into your R environment -- - Your command will look something like this: ```r library(readr) c19 <- read_csv("path_to_your_file") ``` ``` ## ## ── Column specification ──────────────────────────────────────────────────────── ## cols( ## cdc_report_dt = col_character(), ## pos_spec_dt = col_character(), ## onset_dt = col_character(), ## current_status = col_character(), ## sex = col_character(), ## age_group = col_character(), ## ethnicity = col_character(), ## hosp_yn = col_logical(), ## icu_yn = col_logical(), ## death_yn = col_logical(), ## medcond_yn = col_logical() ## ) ``` --- ## Data cleaning - Now we've got our data in R, we need to clean it up a bit -- - Install and load the `dplyr` package (i.e. `install.packages("dplyr")` then `library(dplyr)`) -- - Looking at our dataset, we can see that the columns that should be dates are currently character strings. Let's fix that -- - To create new columns or to overwrite existing columns, we use the `mutate` function from the `dplyr` package --- ## Data cleaning ```r library(dplyr) c19 <- c19 %>% mutate(cdc_report_dt = as.Date(cdc_report_dt, format = "%d/%m/%Y")) ``` -- - Using this as a base, do the same for the `pos_spec_dt` and `onset_dt` columns - Hint: You can alter multiple columns in a single `mutate` call -- ```r c19 <- c19 %>% mutate(pos_spec_dt = as.Date(pos_spec_dt, format = "%d/%m/%Y"), onset_dt = as.Date(onset_dt, format = "%d/%m/%Y")) ``` --- ## Data cleaning (part 2) -- - Convert the `sex` and `ethnicity` columns to factors using the `mutate` function as above -- ```r c19 <- c19 %>% mutate(sex = as.factor(sex), ethnicity = as.factor(ethnicity)) ``` --- ## Data cleaning (alternative approach) -- - We can actually use the `readr::read_csv()` function and the `col_types` argument to do this conversion in one step. -- - When we first loaded in our dataset, `readr` told us what column types we were being given. - To override those, we just need to provide a `cols()` specification to the `read_csv()` function -- - The `cols()` function requires us to specify each column type using the appropriate `col_[type]()` function. --- ## Data cleaning (alternative approach) - All together, using this approach would look like this: ```r c19 <- read_csv("path_to_your_file", col_types = cols(col_date(format = "%d/%m/%Y"), col_date(format = "%d/%m/%Y"), col_date(format = "%d/%m/%Y"), col_character(), col_factor(), col_character(), col_factor(), col_logical(), col_logical(), col_logical(), col_logical())) ``` --- ## Recap - Today we looked at: + Tidyverse and tidy data + The pipe and why it's used + How to load data in using the `readr::read_csv()` function + How to overwrite columns using the `mutate()` function + How to specify column types when importing data with `readr::read_csv()`