Intro-to-the-Tidyverse.knit

# Introduction to the Tidyverse (Part 1)

## by

##### Presenter: Ruben Ernst
##### Author: Mathias Steilen
##### Last updated: _2022-10-15 11:46:28_

---

_[(Source)](https://hbctraining.github.io/Intro-to-R/lessons/08_intro_tidyverse.html)_
---

### Goals for this session

- Touch a little bit of everything by working live on a data set
- Know where to look for an answer if something is not immediately clear
- Important takeaway: Mastering the packages in the `tidyverse` alone takes years, it's completely normal to spend a large amount of time on stackoverflow looking for answers

---

### Let's get settled in

Installing and loading the `tidyverse` which in turn loads all packages contained in it:

```r
install.packages("tidyverse")
```

```r
library(tidyverse)
```

Let's get the data and touch one of the first components of the `tidyverse` at the same time:

```r
install.packages("palmerpenguins")
```

```r
penguins <- palmerpenguins::penguins
```

---

### The Pipe: `%>%`

The pipe from the package `magrittr` can be translated into **"take the thing from before and pipe it (put it) into what's coming after"**.

```r
penguins %>% summary()
```

```
##       species          island    bill_length_mm  bill_depth_mm  
##  Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
##  Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
##  Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
##                                  Mean   :43.92   Mean   :17.15  
##                                  3rd Qu.:48.50   3rd Qu.:18.70  
##                                  Max.   :59.60   Max.   :21.50  
##                                  NA's   :2       NA's   :2      
##  flipper_length_mm  body_mass_g       sex           year     
##  Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
##  1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
##  Median :197.0     Median :4050   NA's  : 11   Median :2008  
##  Mean   :200.9     Mean   :4202                Mean   :2008  
##  3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
##  Max.   :231.0     Max.   :6300                Max.   :2009  
##  NA's   :2         NA's   :2
```

---

### The Pipe: `%>%`

When working with many variables:

```r
penguins %>% glimpse()
```

```
## Rows: 344
## Columns: 8
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex               <fct> male, female, female, NA, female, male, female, male…
## $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
```

---

### The Pipe's Major Advantage Is Readability

The pipe can make nested functions look much nicer and **easier to read**.

The code below shows how to calculate the **mean body weight for the species _Adelie_ by island** without the pipe ([Source](https://cfss.uchicago.edu/notes/pipes/)):

```r
summarize(group_by(filter(penguins, species == "Adelie"), island), body_mass = mean(body_mass_g, na.rm = TRUE))
```

```
## # A tibble: 3 × 2
##   island    body_mass
##   <fct>         <dbl>
## 1 Biscoe        3710.
## 2 Dream         3688.
## 3 Torgersen     3706.
```

---

### The Pipe's Major Advantage Is Readability

The same code using the pipe can be **written, executed, and read line by line**:

```r
penguins %>%
  filter(species == "Adelie") %>%
  group_by(island) %>%
  summarize(body_mass = mean(body_mass_g, na.rm = TRUE))
```

```
## # A tibble: 3 × 2
##   island    body_mass
##   <fct>         <dbl>
## 1 Biscoe        3710.
## 2 Dream         3688.
## 3 Torgersen     3706.
```

This is **much** easier to read.

---

### So let's read it line-by-line

```r
*penguins %>%
  filter(species == "Adelie") %>%
  group_by(island) %>%
  summarize(body_mass = mean(body_mass_g, na.rm = TRUE))
```

```
## # A tibble: 3 × 2
##   island    body_mass
##   <fct>         <dbl>
## 1 Biscoe        3710.
## 2 Dream         3688.
## 3 Torgersen     3706.
```

Line-By-Line:

1. **Take `penguins` and (pass it on)**

---

### So let's read it line-by-line

```r
penguins %>% 
* filter(species == "Adelie") %>%
  group_by(island) %>%
  summarize(body_mass = mean(body_mass_g, na.rm = TRUE))
```

```
## # A tibble: 3 × 2
##   island    body_mass
##   <fct>         <dbl>
## 1 Biscoe        3710.
## 2 Dream         3688.
## 3 Torgersen     3706.
```

Line-By-Line:

1. Take `penguins` and (pass it on)
1. **Filter it so only species Adelie remains (and pass it on)**

---

### So let's read it line-by-line

```r
penguins %>% 
  filter(species == "Adelie") %>%
* group_by(island) %>%
  summarize(body_mass = mean(body_mass_g, na.rm = TRUE))
```

```
## # A tibble: 3 × 2
##   island    body_mass
##   <fct>         <dbl>
## 1 Biscoe        3710.
## 2 Dream         3688.
## 3 Torgersen     3706.
```

Line-By-Line:

1. Take `penguins` and (pass it on)
1. Filter it so only species Adelie remains (and pass it on)
1. **Group the data by island (and pass it on)**

---

### So let's read it line-by-line

```r
penguins %>% 
  filter(species == "Adelie") %>%
  group_by(island) %>%
* summarize(body_mass = mean(body_mass_g, na.rm = TRUE))
```

```
## # A tibble: 3 × 2
##   island    body_mass
##   <fct>         <dbl>
## 1 Biscoe        3710.
## 2 Dream         3688.
## 3 Torgersen     3706.
```

Line-By-Line:

1. Take `penguins` and (pass it on)
1. Filter it so only species Adelie remains (and pass it on)
1. Group the data by island (and pass it on)
1. **Create a summarised variable `body_mass` with `mean`**

---

### Let's deconstruct that: `filter()`

The `filter` function will give you back what you ask it for. You can think about it as **including the observations matching your criterion**.

```r
penguins %>% 
  filter(species == "Adelie")
```

```
## # A tibble: 152 × 8
##   species island    bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex    year
##   <fct>   <fct>              <dbl>         <dbl>       <int>   <int> <fct> <int>
## 1 Adelie  Torgersen           39.1          18.7         181    3750 male   2007
## 2 Adelie  Torgersen           39.5          17.4         186    3800 fema…  2007
## 3 Adelie  Torgersen           40.3          18           195    3250 fema…  2007
## 4 Adelie  Torgersen           NA            NA            NA      NA <NA>   2007
## 5 Adelie  Torgersen           36.7          19.3         193    3450 fema…  2007
## # … with 147 more rows, and abbreviated variable names ¹flipper_length_mm,
## #   ²body_mass_g
```
---

### Let's deconstruct that: `filter()`

`filter` replaces the native R equivalent...

```r
penguins[penguins$species == "Adelie",]
```

...and makes it part of pipe chain.

```r
penguins %>% 
  filter(species == "Adelie")
```

---

### Let's deconstruct that: `filter()`

Another neat trick: `%in%`. You can give the criterion more species.

**It's your go**, please filter this data set. We **only** want to see the species:

- Adelie
- Gentoo

Some guidance:

```r
penguins %>% 
  filter(species %in% c("Species 1 here", "Species 2 here"))
```

Solution:

```r
penguins %>% 
  filter(species %in% c("Adelie", "Gentoo"))
```

---

### Let's deconstruct that: `filter()`

How do we know which species are in the data set?

Using the `count` function, which is really useful for categorical variables:

```r
penguins %>% 
  count(species)
```

```
## # A tibble: 3 × 2
##   species       n
##   <fct>     <int>
## 1 Adelie      152
## 2 Chinstrap    68
## 3 Gentoo      124
```

---

### Let's deconstruct that: `filter()`

How do we know which species are in the data set?

You can also sort the results:

```r
penguins %>% 
* count(species, sort = TRUE)
```

```
## # A tibble: 3 × 2
##   species       n
##   <fct>     <int>
## 1 Adelie      152
## 2 Gentoo      124
## 3 Chinstrap    68
```

---

### Let's deconstruct that: `filter()`

If you want to exclude something, you need to use the `NOT` operator:

```r
TRUE
```

```
## [1] TRUE
```

```r
!TRUE
```

```
## [1] FALSE
```

---

### Let's deconstruct that: `filter()`

Just put it before your condition:

```r
penguins %>% 
  filter(!species %in% c("Chinstrap"))
```

Show everything that's **NOT** any of the values in the vector to the right.

```
## # A tibble: 276 × 8
##   species island    bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex    year
##   <fct>   <fct>              <dbl>         <dbl>       <int>   <int> <fct> <int>
## 1 Adelie  Torgersen           39.1          18.7         181    3750 male   2007
## 2 Adelie  Torgersen           39.5          17.4         186    3800 fema…  2007
## 3 Adelie  Torgersen           40.3          18           195    3250 fema…  2007
## 4 Adelie  Torgersen           NA            NA            NA      NA <NA>   2007
## 5 Adelie  Torgersen           36.7          19.3         193    3450 fema…  2007
## # … with 271 more rows, and abbreviated variable names ¹flipper_length_mm,
## #   ²body_mass_g
```

---

### Coming back to the example

The goal is to calculate **mean body mass for the species Adelie by island**.

```
## # A tibble: 3 × 2
##   island    body_mass
##   <fct>         <dbl>
## 1 Biscoe        3710.
## 2 Dream         3688.
## 3 Torgersen     3706.
```

---

### What's `group_by()`?

It unlocks the ability to calculate a number for different categories:

We have **three islands**.

```r
penguins %>% count(island, sort = T)
```

```
## # A tibble: 3 × 2
##   island        n
##   <fct>     <int>
## 1 Biscoe      168
## 2 Dream       124
## 3 Torgersen    52
```

---

### `group_by()` and `summarise()`

Just summarising without grouping gives the **mean just for the species Adelie**.

```r
penguins %>% 
  filter(species == "Adelie") %>%
  summarize(body_mass = mean(body_mass_g, na.rm = TRUE))
```

```
## # A tibble: 1 × 1
##   body_mass
##       <dbl>
## 1     3701.
```

---

### `group_by()` and `summarise()`

Note that adding `na.rm = TRUE` is necessary to avoid running into errors when missing values are present. Observations with `NA` as values are discarded for the calculation, if `na.rm = TRUE` is specified.

```r
penguins %>% 
  filter(species == "Adelie") %>%
  summarize(body_mass = mean(body_mass_g))
```

```
## # A tibble: 1 × 1
##   body_mass
##       <dbl>
## 1        NA
```

---

### `group_by()` and `summarise()`

So let's add it again. Now comes the magic. By adding the `group_by` argument, the **value is calculated across all different elements in the group**.

```r
penguins %>% 
  filter(species == "Adelie") %>%
  group_by(island) %>% 
  summarise(body_mass = mean(body_mass_g, na.rm = T))
```

```
## # A tibble: 3 × 2
##   island    body_mass
##   <fct>         <dbl>
## 1 Biscoe        3710.
## 2 Dream         3688.
## 3 Torgersen     3706.
```

`summarise` reduces the number of rows.

---

### `mutate()`

Summarise **collapsed the observations in each group** for our calculations. If we want to **preserve the same number of observations** and add a new column containing a value for each existing value, we have to use **`mutate`**.

```r
penguins %>% 
  filter(species == "Adelie") %>%
  group_by(island) %>% 
  summarize(body_mass = mean(body_mass_g, na.rm = T)) %>% 
* mutate(species = "Adelie")
```

```
## # A tibble: 3 × 3
##   island    body_mass species
##   <fct>         <dbl> <chr>  
## 1 Biscoe        3710. Adelie 
## 2 Dream         3688. Adelie 
## 3 Torgersen     3706. Adelie
```

---

### It's your go

Please calculate the mean bodyweight for each species.

Instructions:

1. Group by the variable `species`
1. Use `summarise` to calculate a mean

<br>

]

---

### Solution:

```r
penguins %>% 
  group_by(species) %>% 
  summarise(body_mass = mean(body_mass_g, na.rm = T))
```

```
## # A tibble: 3 × 2
##   species   body_mass
##   <fct>         <dbl>
## 1 Adelie        3701.
## 2 Chinstrap     3733.
## 3 Gentoo        5076.
```

---

### It's your go

Please calculate the mean bodyweight for each species and each island (as a combination).

Tip: You can **add more than one variable to the `group_by()` argument**.

<br>

]

---

### Solution:

```r
penguins %>% 
* group_by(island, species) %>%
  summarise(body_mass = mean(body_mass_g, na.rm = T))
```

```
## # A tibble: 5 × 3
## # Groups:   island [3]
##   island    species   body_mass
##   <fct>     <fct>         <dbl>
## 1 Biscoe    Adelie        3710.
## 2 Biscoe    Gentoo        5076.
## 3 Dream     Adelie        3688.
## 4 Dream     Chinstrap     3733.
## 5 Torgersen Adelie        3706.
```

---

### It's your go

Please calculate the mean bodymass for each combination of species and sex.

<br>

]

---

### Hold on

What is this?

```r
penguins %>% 
  group_by(species, sex) %>% 
  summarise(body_mass = mean(body_mass_g, na.rm = T))
```

```
## # A tibble: 8 × 3
## # Groups:   species [3]
##   species   sex    body_mass
##   <fct>     <fct>      <dbl>
## 1 Adelie    female     3369.
## 2 Adelie    male       4043.
## 3 Adelie    <NA>       3540 
## 4 Chinstrap female     3527.
## 5 Chinstrap male       3939.
## 6 Gentoo    female     4680.
## 7 Gentoo    male       5485.
## 8 Gentoo    <NA>       4588.
```

]

]]

---

### Hold on

What is this?

```r
penguins %>% 
  group_by(species, sex) %>% 
  summarise(body_mass = mean(body_mass_g, na.rm = T))
```

]

**The NAs come from the `sex` variable.**

```r
penguins %>% 
  count(sex)
```

```
## # A tibble: 3 × 2
##   sex        n
##   <fct>  <int>
## 1 female   165
## 2 male     168
## 3 <NA>      11
```

]

---

### `drop_na` and `select`

To solve this task, we need to take care of the `NA` values in the columns we are working with.

However, there is one problem:

**If we drop all rows with NA values in the whole data frame, we will delete too many rows!**

---

### `drop_na` and `select`

So we need to **select the columns we need first**:

```r
penguins %>% 
  select(species, sex, body_mass_g)
```

```
## # A tibble: 344 × 3
##    species sex    body_mass_g
##    <fct>   <fct>        <int>
##  1 Adelie  male          3750
##  2 Adelie  female        3800
##  3 Adelie  female        3250
##  4 Adelie  <NA>            NA
##  5 Adelie  female        3450
##  6 Adelie  male          3650
##  7 Adelie  female        3625
##  8 Adelie  male          4675
##  9 Adelie  <NA>          3475
## 10 Adelie  <NA>          4250
## # … with 334 more rows
```

---

### `drop_na` and `select`

**Then** drop.

```r
penguins %>% 
  select(species, sex, body_mass_g) %>% 
  drop_na()
```

```
## # A tibble: 333 × 3
##    species sex    body_mass_g
##    <fct>   <fct>        <int>
##  1 Adelie  male          3750
##  2 Adelie  female        3800
##  3 Adelie  female        3250
##  4 Adelie  female        3450
##  5 Adelie  male          3650
##  6 Adelie  female        3625
##  7 Adelie  male          4675
##  8 Adelie  female        3200
##  9 Adelie  male          3800
## 10 Adelie  male          4400
## # … with 323 more rows
```

---

### So let's try again

Please calculate the mean for each species and each sex, dropping missing data in the necessary columns first (first select, then drop missing values, then summarise by groups).

<br>

]

---

### Solution:

```r
penguins %>% 
  select(species, sex, body_mass_g) %>% 
  drop_na() %>% 
  group_by(species, sex) %>% 
  summarise(body_mass = mean(body_mass_g, na.rm = T))
```

```
## # A tibble: 6 × 3
## # Groups:   species [3]
##   species   sex    body_mass
##   <fct>     <fct>      <dbl>
## 1 Adelie    female     3369.
## 2 Adelie    male       4043.
## 3 Chinstrap female     3527.
## 4 Chinstrap male       3939.
## 5 Gentoo    female     4680.
## 6 Gentoo    male       5485.
```

---

### Note: If you continue, ungroup:

```r
penguins %>% 
  select(species, sex, body_mass_g) %>% 
  drop_na() %>% 
  group_by(species, sex) %>% 
  summarise(body_mass = mean(body_mass_g, na.rm = T)) %>% 
* ungroup()
```

```
## # A tibble: 6 × 3
##   species   sex    body_mass
##   <fct>     <fct>      <dbl>
## 1 Adelie    female     3369.
## 2 Adelie    male       4043.
## 3 Chinstrap female     3527.
## 4 Chinstrap male       3939.
## 5 Gentoo    female     4680.
## 6 Gentoo    male       5485.
```

---

### It's your go

Please calculate the **mean bill length and depth for all the available years for all species**.

<br>

]

---

### Solution:

```r
penguins %>% 
  select(species, year, bill_length_mm, bill_depth_mm) %>% 
  group_by(species, year) %>% 
  summarise(bill_length = mean(bill_length_mm, na.rm = T),
            bill_depth = mean(bill_depth_mm, na.rm = T))
```

```
## # A tibble: 9 × 4
## # Groups:   species [3]
##   species    year bill_length bill_depth
##   <fct>     <int>       <dbl>      <dbl>
## 1 Adelie     2007        38.8       18.8
## 2 Adelie     2008        38.6       18.2
## 3 Adelie     2009        39.0       18.1
## 4 Chinstrap  2007        48.7       18.5
## 5 Chinstrap  2008        48.7       18.4
## 6 Chinstrap  2009        49.1       18.3
## 7 Gentoo     2007        47.0       14.7
## 8 Gentoo     2008        46.9       14.9
## 9 Gentoo     2009        48.5       15.3
```

Now it would be really nice to be able to compare from left to right, over time.

---

### Enter: `pivot_wider()`

```r
penguins %>% 
  group_by(year, species) %>% 
  summarise(mean_body_mass = mean(body_mass_g, na.rm = T)) %>% 
  ungroup() %>% 
* pivot_wider(
*   names_from = year, # make this the column names
*   values_from = mean_body_mass # fill rows with these values
*   )
```

```
## # A tibble: 3 × 4
##   species   `2007` `2008` `2009`
##   <fct>      <dbl>  <dbl>  <dbl>
## 1 Adelie     3696.  3742   3665.
## 2 Chinstrap  3694.  3800   3725 
## 3 Gentoo     5071.  5020.  5141.
```

---

### The opposite: `pivot_longer()`

```r
penguins %>% 
  group_by(year, species) %>% 
  summarise(mean_body_mass = mean(body_mass_g, na.rm = T)) %>% 
  ungroup() %>% 
  pivot_wider(names_from = year, 
              values_from = mean_body_mass) %>% 
* pivot_longer(- species, # Don't pivot this column
*              names_to = "year", # Column for column headers
*              values_to = "mean_body_mass") %>% # Column for values
  arrange(species) # sort descending: desc(species)
```

```
## # A tibble: 9 × 3
##   species   year  mean_body_mass
##   <fct>     <chr>          <dbl>
## 1 Adelie    2007           3696.
## 2 Adelie    2008           3742 
## 3 Adelie    2009           3665.
## 4 Chinstrap 2007           3694.
## 5 Chinstrap 2008           3800 
## 6 Chinstrap 2009           3725 
## 7 Gentoo    2007           5071.
## 8 Gentoo    2008           5020.
## 9 Gentoo    2009           5141.
```

]

---

### Now everything together

We want to see the **percentage** of male or percentage of female penguins **for each species**. Here, pivoting wider will come in handy.

Please **`count` the numbers of penguins for each sex and species**. Then, **pivot wider to have sex as the column headers**. Make sure to **drop missing values in the two variables**. Tip: No grouping is necessary for `count`, just pass the two variables to the function.

With **`mutate`, you will be able to calculate the percentage**. Then use **`select` to remove unnecessary information**. (If you know it, you can use `transmute` as well.)

<br>

]

---

### Solution:

```r
penguins %>% 
  select(sex, species) %>% 
  drop_na() %>% 
  count(sex, species) %>% 
  pivot_wider(names_from = sex, values_from = n) %>% 
  transmute(species,
            pct_male = male/(male + female))
```

```
## # A tibble: 3 × 2
##   species   pct_male
##   <fct>        <dbl>
## 1 Adelie       0.5  
## 2 Chinstrap    0.5  
## 3 Gentoo       0.513
```

Note: Using **`transmute` does the same as first mutating and then selecting**. It is a `mutate` and a `select` in **one single step**.

---

# That's it for today!

There will be more advanced demos around the `tidyverse` which will enable you to work with larger data sets and more complicated operations. Make sure to stay updated on our socials and via our website, where all resources and dates are also published.

**[Website](https://rusergroup-sg.ch/) | [Instagram](https://www.instagram.com/rusergroupstgallen/?hl=en) | [Twitter](https://twitter.com/rusergroupsg)**

---

# Thank you for attending!

<em style="color:#404040">The material provided in this presentation including any information, tools, features, content and any images incorporated in the presentation, is solely for your lawful, personal, private use. You may not modify, republish, or post anything you obtain from this presentation, including anything you download from our website, unless you first obtain our written consent. You may not engage in systematic retrieval of data or other content from this website. We request that you not create any kind of hyperlink from any other site to ours unless you first obtain our written permission.</em>