class: center, middle, hide-logo <style type="text/css"> pre { background: #F8F8F8; max-width: 100%; overflow-x: scroll; } </style> <style type="text/css"> .center2 { margin: 0; position: absolute; top: 50%; left: 50%; -ms-transform: translate(-50%, -50%); transform: translate(-50%, -50%); } </style> <style type="text/css"> .scroll-output { height: 80%; overflow-y: scroll; } </style> # Introduction to the Tidyverse (Part 1) ## by <img src="GraphicsSlides/Logo RUG hell.png" width="50%" /> ##### Presenter: Ruben Ernst ##### Author: Mathias Steilen ##### Last updated: _2022-10-15 11:46:28_ --- class: center, middle <img src="GraphicsSlides/components.png" width="80%" /> _[(Source)](https://hbctraining.github.io/Intro-to-R/lessons/08_intro_tidyverse.html)_ --- ### Goals for this session - Touch a little bit of everything by working live on a data set - Know where to look for an answer if something is not immediately clear - Important takeaway: Mastering the packages in the `tidyverse` alone takes years, it's completely normal to spend a large amount of time on stackoverflow looking for answers .center[ <img src="GraphicsSlides/stackoverflow meme.png" width="60%" /> ] --- ### Let's get settled in Installing and loading the `tidyverse` which in turn loads all packages contained in it: ```r install.packages("tidyverse") ``` ```r library(tidyverse) ``` Let's get the data and touch one of the first components of the `tidyverse` at the same time: ```r install.packages("palmerpenguins") ``` ```r penguins <- palmerpenguins::penguins ``` --- ### The Pipe: `%>%` The pipe from the package `magrittr` can be translated into **"take the thing from before and pipe it (put it) into what's coming after"**. ```r penguins %>% summary() ``` ``` ## species island bill_length_mm bill_depth_mm ## Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10 ## Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60 ## Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30 ## Mean :43.92 Mean :17.15 ## 3rd Qu.:48.50 3rd Qu.:18.70 ## Max. :59.60 Max. :21.50 ## NA's :2 NA's :2 ## flipper_length_mm body_mass_g sex year ## Min. :172.0 Min. :2700 female:165 Min. :2007 ## 1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007 ## Median :197.0 Median :4050 NA's : 11 Median :2008 ## Mean :200.9 Mean :4202 Mean :2008 ## 3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009 ## Max. :231.0 Max. :6300 Max. :2009 ## NA's :2 NA's :2 ``` --- ### The Pipe: `%>%` When working with many variables: ```r penguins %>% glimpse() ``` ``` ## Rows: 344 ## Columns: 8 ## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel… ## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse… ## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, … ## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, … ## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186… ## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, … ## $ sex <fct> male, female, female, NA, female, male, female, male… ## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007… ``` --- ### The Pipe's Major Advantage Is Readability The pipe can make nested functions look much nicer and **easier to read**. The code below shows how to calculate the **mean body weight for the species _Adelie_ by island** without the pipe ([Source](https://cfss.uchicago.edu/notes/pipes/)): ```r summarize(group_by(filter(penguins, species == "Adelie"), island), body_mass = mean(body_mass_g, na.rm = TRUE)) ``` ``` ## # A tibble: 3 × 2 ## island body_mass ## <fct> <dbl> ## 1 Biscoe 3710. ## 2 Dream 3688. ## 3 Torgersen 3706. ``` --- ### The Pipe's Major Advantage Is Readability The same code using the pipe can be **written, executed, and read line by line**: ```r penguins %>% filter(species == "Adelie") %>% group_by(island) %>% summarize(body_mass = mean(body_mass_g, na.rm = TRUE)) ``` ``` ## # A tibble: 3 × 2 ## island body_mass ## <fct> <dbl> ## 1 Biscoe 3710. ## 2 Dream 3688. ## 3 Torgersen 3706. ``` This is **much** easier to read. --- ### So let's read it line-by-line ```r *penguins %>% filter(species == "Adelie") %>% group_by(island) %>% summarize(body_mass = mean(body_mass_g, na.rm = TRUE)) ``` ``` ## # A tibble: 3 × 2 ## island body_mass ## <fct> <dbl> ## 1 Biscoe 3710. ## 2 Dream 3688. ## 3 Torgersen 3706. ``` Line-By-Line: 1. **Take `penguins` and (pass it on)** --- ### So let's read it line-by-line ```r penguins %>% * filter(species == "Adelie") %>% group_by(island) %>% summarize(body_mass = mean(body_mass_g, na.rm = TRUE)) ``` ``` ## # A tibble: 3 × 2 ## island body_mass ## <fct> <dbl> ## 1 Biscoe 3710. ## 2 Dream 3688. ## 3 Torgersen 3706. ``` Line-By-Line: 1. Take `penguins` and (pass it on) 1. **Filter it so only species Adelie remains (and pass it on)** --- ### So let's read it line-by-line ```r penguins %>% filter(species == "Adelie") %>% * group_by(island) %>% summarize(body_mass = mean(body_mass_g, na.rm = TRUE)) ``` ``` ## # A tibble: 3 × 2 ## island body_mass ## <fct> <dbl> ## 1 Biscoe 3710. ## 2 Dream 3688. ## 3 Torgersen 3706. ``` Line-By-Line: 1. Take `penguins` and (pass it on) 1. Filter it so only species Adelie remains (and pass it on) 1. **Group the data by island (and pass it on)** --- ### So let's read it line-by-line ```r penguins %>% filter(species == "Adelie") %>% group_by(island) %>% * summarize(body_mass = mean(body_mass_g, na.rm = TRUE)) ``` ``` ## # A tibble: 3 × 2 ## island body_mass ## <fct> <dbl> ## 1 Biscoe 3710. ## 2 Dream 3688. ## 3 Torgersen 3706. ``` Line-By-Line: 1. Take `penguins` and (pass it on) 1. Filter it so only species Adelie remains (and pass it on) 1. Group the data by island (and pass it on) 1. **Create a summarised variable `body_mass` with `mean`** --- ### Let's deconstruct that: `filter()` The `filter` function will give you back what you ask it for. You can think about it as **including the observations matching your criterion**. ```r penguins %>% filter(species == "Adelie") ``` ``` ## # A tibble: 152 × 8 ## species island bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex year ## <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int> ## 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007 ## 2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007 ## 3 Adelie Torgersen 40.3 18 195 3250 fema… 2007 ## 4 Adelie Torgersen NA NA NA NA <NA> 2007 ## 5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007 ## # … with 147 more rows, and abbreviated variable names ¹flipper_length_mm, ## # ²body_mass_g ``` --- ### Let's deconstruct that: `filter()` `filter` replaces the native R equivalent... ```r penguins[penguins$species == "Adelie",] ``` ...and makes it part of pipe chain. ```r penguins %>% filter(species == "Adelie") ``` --- ### Let's deconstruct that: `filter()` Another neat trick: `%in%`. You can give the criterion more species. **It's your go**, please filter this data set. We **only** want to see the species: - Adelie - Gentoo Some guidance: ```r penguins %>% filter(species %in% c("Species 1 here", "Species 2 here")) ``` -- Solution: ```r penguins %>% filter(species %in% c("Adelie", "Gentoo")) ``` --- ### Let's deconstruct that: `filter()` How do we know which species are in the data set? Using the `count` function, which is really useful for categorical variables: ```r penguins %>% count(species) ``` ``` ## # A tibble: 3 × 2 ## species n ## <fct> <int> ## 1 Adelie 152 ## 2 Chinstrap 68 ## 3 Gentoo 124 ``` --- ### Let's deconstruct that: `filter()` How do we know which species are in the data set? You can also sort the results: ```r penguins %>% * count(species, sort = TRUE) ``` ``` ## # A tibble: 3 × 2 ## species n ## <fct> <int> ## 1 Adelie 152 ## 2 Gentoo 124 ## 3 Chinstrap 68 ``` --- ### Let's deconstruct that: `filter()` If you want to exclude something, you need to use the `NOT` operator: ```r TRUE ``` ``` ## [1] TRUE ``` ```r !TRUE ``` ``` ## [1] FALSE ``` --- ### Let's deconstruct that: `filter()` Just put it before your condition: ```r penguins %>% filter(!species %in% c("Chinstrap")) ``` Show everything that's **NOT** any of the values in the vector to the right. ``` ## # A tibble: 276 × 8 ## species island bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex year ## <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int> ## 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007 ## 2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007 ## 3 Adelie Torgersen 40.3 18 195 3250 fema… 2007 ## 4 Adelie Torgersen NA NA NA NA <NA> 2007 ## 5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007 ## # … with 271 more rows, and abbreviated variable names ¹flipper_length_mm, ## # ²body_mass_g ``` --- ### Coming back to the example The goal is to calculate **mean body mass for the species Adelie by island**. ``` ## # A tibble: 3 × 2 ## island body_mass ## <fct> <dbl> ## 1 Biscoe 3710. ## 2 Dream 3688. ## 3 Torgersen 3706. ``` --- ### What's `group_by()`? It unlocks the ability to calculate a number for different categories: We have **three islands**. ```r penguins %>% count(island, sort = T) ``` ``` ## # A tibble: 3 × 2 ## island n ## <fct> <int> ## 1 Biscoe 168 ## 2 Dream 124 ## 3 Torgersen 52 ``` --- ### `group_by()` and `summarise()` Just summarising without grouping gives the **mean just for the species Adelie**. ```r penguins %>% filter(species == "Adelie") %>% summarize(body_mass = mean(body_mass_g, na.rm = TRUE)) ``` ``` ## # A tibble: 1 × 1 ## body_mass ## <dbl> ## 1 3701. ``` --- ### `group_by()` and `summarise()` Note that adding `na.rm = TRUE` is necessary to avoid running into errors when missing values are present. Observations with `NA` as values are discarded for the calculation, if `na.rm = TRUE` is specified. ```r penguins %>% filter(species == "Adelie") %>% summarize(body_mass = mean(body_mass_g)) ``` ``` ## # A tibble: 1 × 1 ## body_mass ## <dbl> ## 1 NA ``` --- ### `group_by()` and `summarise()` So let's add it again. Now comes the magic. By adding the `group_by` argument, the **value is calculated across all different elements in the group**. ```r penguins %>% filter(species == "Adelie") %>% group_by(island) %>% summarise(body_mass = mean(body_mass_g, na.rm = T)) ``` ``` ## # A tibble: 3 × 2 ## island body_mass ## <fct> <dbl> ## 1 Biscoe 3710. ## 2 Dream 3688. ## 3 Torgersen 3706. ``` `summarise` reduces the number of rows. --- ### `mutate()` Summarise **collapsed the observations in each group** for our calculations. If we want to **preserve the same number of observations** and add a new column containing a value for each existing value, we have to use **`mutate`**. ```r penguins %>% filter(species == "Adelie") %>% group_by(island) %>% summarize(body_mass = mean(body_mass_g, na.rm = T)) %>% * mutate(species = "Adelie") ``` ``` ## # A tibble: 3 × 3 ## island body_mass species ## <fct> <dbl> <chr> ## 1 Biscoe 3710. Adelie ## 2 Dream 3688. Adelie ## 3 Torgersen 3706. Adelie ``` --- ### It's your go Please calculate the mean bodyweight for each species. Instructions: 1. Group by the variable `species` 1. Use `summarise` to calculate a mean <br> .center[ <img src="GraphicsSlides/programmer.png" width="50%" /> ] --- ### Solution: ```r penguins %>% group_by(species) %>% summarise(body_mass = mean(body_mass_g, na.rm = T)) ``` ``` ## # A tibble: 3 × 2 ## species body_mass ## <fct> <dbl> ## 1 Adelie 3701. ## 2 Chinstrap 3733. ## 3 Gentoo 5076. ``` --- ### It's your go Please calculate the mean bodyweight for each species and each island (as a combination). Tip: You can **add more than one variable to the `group_by()` argument**. <br> .center[ <img src="GraphicsSlides/programmer.png" width="50%" /> ] --- ### Solution: ```r penguins %>% * group_by(island, species) %>% summarise(body_mass = mean(body_mass_g, na.rm = T)) ``` ``` ## # A tibble: 5 × 3 ## # Groups: island [3] ## island species body_mass ## <fct> <fct> <dbl> ## 1 Biscoe Adelie 3710. ## 2 Biscoe Gentoo 5076. ## 3 Dream Adelie 3688. ## 4 Dream Chinstrap 3733. ## 5 Torgersen Adelie 3706. ``` --- ### It's your go Please calculate the mean bodymass for each combination of species and sex. <br> .center[ <img src="GraphicsSlides/programmer.png" width="50%" /> ] --- ### Hold on .pull-left[ What is this? ```r penguins %>% group_by(species, sex) %>% summarise(body_mass = mean(body_mass_g, na.rm = T)) ``` ``` ## # A tibble: 8 × 3 ## # Groups: species [3] ## species sex body_mass ## <fct> <fct> <dbl> ## 1 Adelie female 3369. ## 2 Adelie male 4043. ## 3 Adelie <NA> 3540 ## 4 Chinstrap female 3527. ## 5 Chinstrap male 3939. ## 6 Gentoo female 4680. ## 7 Gentoo male 5485. ## 8 Gentoo <NA> 4588. ``` ] .center[.pull-right[ <img src="GraphicsSlides/afraid.png" width="90%" /> ]] --- ### Hold on .pull-left[ What is this? ```r penguins %>% group_by(species, sex) %>% summarise(body_mass = mean(body_mass_g, na.rm = T)) ``` ``` ## # A tibble: 8 × 3 ## # Groups: species [3] ## species sex body_mass ## <fct> <fct> <dbl> ## 1 Adelie female 3369. ## 2 Adelie male 4043. ## 3 Adelie <NA> 3540 ## 4 Chinstrap female 3527. ## 5 Chinstrap male 3939. ## 6 Gentoo female 4680. ## 7 Gentoo male 5485. ## 8 Gentoo <NA> 4588. ``` ] .pull-right[ **The NAs come from the `sex` variable.** ```r penguins %>% count(sex) ``` ``` ## # A tibble: 3 × 2 ## sex n ## <fct> <int> ## 1 female 165 ## 2 male 168 ## 3 <NA> 11 ``` ] --- ### `drop_na` and `select` To solve this task, we need to take care of the `NA` values in the columns we are working with. However, there is one problem: **If we drop all rows with NA values in the whole data frame, we will delete too many rows!** <img src="Intro-to-the-Tidyverse_files/figure-html/unnamed-chunk-45-1.png" width="100%" /> --- ### `drop_na` and `select` So we need to **select the columns we need first**: ```r penguins %>% select(species, sex, body_mass_g) ``` ``` ## # A tibble: 344 × 3 ## species sex body_mass_g ## <fct> <fct> <int> ## 1 Adelie male 3750 ## 2 Adelie female 3800 ## 3 Adelie female 3250 ## 4 Adelie <NA> NA ## 5 Adelie female 3450 ## 6 Adelie male 3650 ## 7 Adelie female 3625 ## 8 Adelie male 4675 ## 9 Adelie <NA> 3475 ## 10 Adelie <NA> 4250 ## # … with 334 more rows ``` --- ### `drop_na` and `select` **Then** drop. ```r penguins %>% select(species, sex, body_mass_g) %>% drop_na() ``` ``` ## # A tibble: 333 × 3 ## species sex body_mass_g ## <fct> <fct> <int> ## 1 Adelie male 3750 ## 2 Adelie female 3800 ## 3 Adelie female 3250 ## 4 Adelie female 3450 ## 5 Adelie male 3650 ## 6 Adelie female 3625 ## 7 Adelie male 4675 ## 8 Adelie female 3200 ## 9 Adelie male 3800 ## 10 Adelie male 4400 ## # … with 323 more rows ``` --- ### So let's try again Please calculate the mean for each species and each sex, dropping missing data in the necessary columns first (first select, then drop missing values, then summarise by groups). <br> .center[ <img src="GraphicsSlides/programmer.png" width="50%" /> ] --- ### Solution: ```r penguins %>% select(species, sex, body_mass_g) %>% drop_na() %>% group_by(species, sex) %>% summarise(body_mass = mean(body_mass_g, na.rm = T)) ``` ``` ## # A tibble: 6 × 3 ## # Groups: species [3] ## species sex body_mass ## <fct> <fct> <dbl> ## 1 Adelie female 3369. ## 2 Adelie male 4043. ## 3 Chinstrap female 3527. ## 4 Chinstrap male 3939. ## 5 Gentoo female 4680. ## 6 Gentoo male 5485. ``` --- ### Note: If you continue, ungroup: ```r penguins %>% select(species, sex, body_mass_g) %>% drop_na() %>% group_by(species, sex) %>% summarise(body_mass = mean(body_mass_g, na.rm = T)) %>% * ungroup() ``` ``` ## # A tibble: 6 × 3 ## species sex body_mass ## <fct> <fct> <dbl> ## 1 Adelie female 3369. ## 2 Adelie male 4043. ## 3 Chinstrap female 3527. ## 4 Chinstrap male 3939. ## 5 Gentoo female 4680. ## 6 Gentoo male 5485. ``` --- ### It's your go Please calculate the **mean bill length and depth for all the available years for all species**. <br> .center[ <img src="GraphicsSlides/programmer.png" width="50%" /> ] --- ### Solution: ```r penguins %>% select(species, year, bill_length_mm, bill_depth_mm) %>% group_by(species, year) %>% summarise(bill_length = mean(bill_length_mm, na.rm = T), bill_depth = mean(bill_depth_mm, na.rm = T)) ``` ``` ## # A tibble: 9 × 4 ## # Groups: species [3] ## species year bill_length bill_depth ## <fct> <int> <dbl> <dbl> ## 1 Adelie 2007 38.8 18.8 ## 2 Adelie 2008 38.6 18.2 ## 3 Adelie 2009 39.0 18.1 ## 4 Chinstrap 2007 48.7 18.5 ## 5 Chinstrap 2008 48.7 18.4 ## 6 Chinstrap 2009 49.1 18.3 ## 7 Gentoo 2007 47.0 14.7 ## 8 Gentoo 2008 46.9 14.9 ## 9 Gentoo 2009 48.5 15.3 ``` Now it would be really nice to be able to compare from left to right, over time. --- ### Enter: `pivot_wider()` ```r penguins %>% group_by(year, species) %>% summarise(mean_body_mass = mean(body_mass_g, na.rm = T)) %>% ungroup() %>% * pivot_wider( * names_from = year, # make this the column names * values_from = mean_body_mass # fill rows with these values * ) ``` ``` ## # A tibble: 3 × 4 ## species `2007` `2008` `2009` ## <fct> <dbl> <dbl> <dbl> ## 1 Adelie 3696. 3742 3665. ## 2 Chinstrap 3694. 3800 3725 ## 3 Gentoo 5071. 5020. 5141. ``` --- ### The opposite: `pivot_longer()` .scroll-output[ ```r penguins %>% group_by(year, species) %>% summarise(mean_body_mass = mean(body_mass_g, na.rm = T)) %>% ungroup() %>% pivot_wider(names_from = year, values_from = mean_body_mass) %>% * pivot_longer(- species, # Don't pivot this column * names_to = "year", # Column for column headers * values_to = "mean_body_mass") %>% # Column for values arrange(species) # sort descending: desc(species) ``` ``` ## # A tibble: 9 × 3 ## species year mean_body_mass ## <fct> <chr> <dbl> ## 1 Adelie 2007 3696. ## 2 Adelie 2008 3742 ## 3 Adelie 2009 3665. ## 4 Chinstrap 2007 3694. ## 5 Chinstrap 2008 3800 ## 6 Chinstrap 2009 3725 ## 7 Gentoo 2007 5071. ## 8 Gentoo 2008 5020. ## 9 Gentoo 2009 5141. ``` ] --- ### Now everything together We want to see the **percentage** of male or percentage of female penguins **for each species**. Here, pivoting wider will come in handy. Please **`count` the numbers of penguins for each sex and species**. Then, **pivot wider to have sex as the column headers**. Make sure to **drop missing values in the two variables**. Tip: No grouping is necessary for `count`, just pass the two variables to the function. With **`mutate`, you will be able to calculate the percentage**. Then use **`select` to remove unnecessary information**. (If you know it, you can use `transmute` as well.) <br> .center[ <img src="GraphicsSlides/programmer.png" width="30%" /> ] --- ### Solution: ```r penguins %>% select(sex, species) %>% drop_na() %>% count(sex, species) %>% pivot_wider(names_from = sex, values_from = n) %>% transmute(species, pct_male = male/(male + female)) ``` ``` ## # A tibble: 3 × 2 ## species pct_male ## <fct> <dbl> ## 1 Adelie 0.5 ## 2 Chinstrap 0.5 ## 3 Gentoo 0.513 ``` Note: Using **`transmute` does the same as first mutating and then selecting**. It is a `mutate` and a `select` in **one single step**. --- # That's it for today! There will be more advanced demos around the `tidyverse` which will enable you to work with larger data sets and more complicated operations. Make sure to stay updated on our socials and via our website, where all resources and dates are also published. **[Website](https://rusergroup-sg.ch/) | [Instagram](https://www.instagram.com/rusergroupstgallen/?hl=en) | [Twitter](https://twitter.com/rusergroupsg)** .center[ <img src="GraphicsSlides/palmer.png" width="70%" /> ] --- class: middle, inverse, hide-logo # Thank you for attending!
The material provided in this presentation including any information, tools, features, content and any images incorporated in the presentation, is solely for your lawful, personal, private use. You may not modify, republish, or post anything you obtain from this presentation, including anything you download from our website, unless you first obtain our written consent. You may not engage in systematic retrieval of data or other content from this website. We request that you not create any kind of hyperlink from any other site to ours unless you first obtain our written permission.