Intro-to-Tidymodels.knit

class: center, middle, hide-logo

# Introduction to Tidymodels

## by

##### Author/Presenter: Mathias Steilen
##### Last updated: _2022-11-15 13:06:11_

---

# Tidymodels

<br>

.center[
<img src="GraphicsSlides/tidymodels.png" width="100%" />

_[(Source)](https://www.tidymodels.org/)_
]

---

# Goals for this session

- Know how to model with `tidymodels` by working live on a data set
- Be able to use that knowledge and the code to apply it to other data and with other models

<br>

.center[
<img src="GraphicsSlides/to do.png" width="50%" />
]

---

# Today's data set

.center[
<img src="GraphicsSlides/pumpkin.jpg" width="50%" />
]

> The Great Pumpkin Commonwealth's (GPC) mission cultivates the hobby of growing giant pumpkins throughout the world by establishing standards and regulations that ensure quality of fruit, fairness of competition, recognition of achievement, fellowship and education for all participating growers and weigh-off sites. ([TidyTuesday Link](https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-10-19/readme.md))

---

# Off we go

```r
pumpkins <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-10-19/pumpkins.csv')
```

```
## Rows: 28065 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (14): id, place, weight_lbs, grower_name, city, state_prov, country, gpc...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
```

```r
library(tidyverse)
library(tidymodels)
library(doParallel)
library(vip)
```

---

### Inspecting the data

```r
glimpse(pumpkins)
```

```
## Rows: 28,065
## Columns: 14
## $ id                <chr> "2013-F", "2013-F", "2013-F", "2013-F", "2013-F", "2…
## $ place             <chr> "1", "2", "3", "4", "5", "5", "7", "8", "9", "10", "…
## $ weight_lbs        <chr> "154.50", "146.50", "145.00", "140.80", "139.00", "1…
## $ grower_name       <chr> "Ellenbecker, Todd & Sequoia", "Razo, Steve", "Ellen…
## $ city              <chr> "Gleason", "New Middletown", "Glenson", "Combined Lo…
## $ state_prov        <chr> "Wisconsin", "Ohio", "Wisconsin", "Wisconsin", "Wisc…
## $ country           <chr> "United States", "United States", "United States", "…
## $ gpc_site          <chr> "Nekoosa Giant Pumpkin Fest", "Ohio Valley Giant Pum…
## $ seed_mother       <chr> "209 Werner", "150.5 Snyder", "209 Werner", "109 Mar…
## $ pollinator_father <chr> "Self", NA, "103 Mackinnon", "209 Werner '12", "open…
## $ ott               <chr> "184.0", "194.0", "177.0", "194.0", "0.0", "190.0", …
## $ est_weight        <chr> "129.00", "151.00", "115.00", "151.00", "0.00", "141…
## $ pct_chart         <chr> "20.0", "-3.0", "26.0", "-7.0", "0.0", "-1.0", "-4.0…
## $ variety           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
```

The target and many other numeric values are still categorical.

---

### Recoding variables

.scroll-output[

```r
pumpkins <- pumpkins %>% 
  mutate(across(c(place, weight_lbs, ott, est_weight, pct_chart),
                as.numeric))

glimpse(pumpkins)
```

```
## Rows: 28,065
## Columns: 14
## $ id                <chr> "2013-F", "2013-F", "2013-F", "2013-F", "2013-F", "2…
## $ place             <dbl> 1, 2, 3, 4, 5, 5, 7, 8, 9, 10, 11, 12, NA, 13, 14, 1…
## $ weight_lbs        <dbl> 154.5, 146.5, 145.0, 140.8, 139.0, 139.0, 136.5, 136…
## $ grower_name       <chr> "Ellenbecker, Todd & Sequoia", "Razo, Steve", "Ellen…
## $ city              <chr> "Gleason", "New Middletown", "Glenson", "Combined Lo…
## $ state_prov        <chr> "Wisconsin", "Ohio", "Wisconsin", "Wisconsin", "Wisc…
## $ country           <chr> "United States", "United States", "United States", "…
## $ gpc_site          <chr> "Nekoosa Giant Pumpkin Fest", "Ohio Valley Giant Pum…
## $ seed_mother       <chr> "209 Werner", "150.5 Snyder", "209 Werner", "109 Mar…
## $ pollinator_father <chr> "Self", NA, "103 Mackinnon", "209 Werner '12", "open…
## $ ott               <dbl> 184, 194, 177, 194, 0, 190, 190, 182, 0, 0, 0, 177, …
## $ est_weight        <dbl> 129, 151, 115, 151, 0, 141, 142, 124, 0, 0, 0, 115, …
## $ pct_chart         <dbl> 20, -3, 26, -7, 0, -1, -4, 10, 0, 0, 0, 14, 4, 8, 14…
## $ variety           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
```
]

---

### id

```r
pumpkins %>% 
* separate(id, sep = "-", into = c("year", "type")) %>%
  count(type)
```

```
## # A tibble: 6 × 2
##   type      n
##   <chr> <int>
## 1 F      2756
## 2 L      1965
## 3 P     15965
## 4 S      1686
## 5 T      3166
## 6 W      2527
```

The documentation on GitHub states:

> Types: F = "Field Pumpkin", P = "Giant Pumpkin", S = "Giant Squash", W = "Giant Watermelon", L = "Long Gourd" (length in inches, not weight in pounds), T = Tomato

---

### id

```r
pumpkins <- pumpkins %>% 
  separate(id, sep = "-", into = c("year", "type")) %>% 
* mutate(type = case_when(
*   type == "F" ~ "Field Pumpkin",
*   type == "P" ~ "Giant Pumpkin",
*   type == "S" ~ "Giant Squash",
*   type == "W" ~ "Giant Watermelon",
*   type == "L" ~ "Long Gourd",
*   type == "T" ~ "Tomato"
* ))
```

---

### id

```r
pumpkins %>% 
  count(type, sort = T)
```

```
## # A tibble: 6 × 2
##   type                 n
##   <chr>            <int>
## 1 Giant Pumpkin    15965
## 2 Tomato            3166
## 3 Field Pumpkin     2756
## 4 Giant Watermelon  2527
## 5 Long Gourd        1965
## 6 Giant Squash      1686
```

---

### id

For the purpose of this session, let's focus on predicting giant pumpkins' weights.

```r
pumpkins <- pumpkins %>% 
  filter(type %in% c("Giant Pumpkin"))
```

```r
pumpkins %>% 
  count(type, sort = T)
```

```
## # A tibble: 1 × 2
##   type              n
##   <chr>         <int>
## 1 Giant Pumpkin 15965
```

---

### grower_name

```r
pumpkins %>% 
  count(grower_name, sort = T)
```

```
## # A tibble: 6,508 × 2
##    grower_name              n
##    <chr>                <int>
##  1 Platte, Joe             47
##  2 Sherwood, Jim           42
##  3 Gansert, Norman         41
##  4 Wolf, Andy              41
##  5 Miller, Gary            38
##  6 Karkos, Udo             34
##  7 Melka, Friedrich        34
##  8 Werner, Quinn           33
##  9 Tobeck, Cindy           32
## 10 Stelts, Dave & Carol    31
## # … with 6,498 more rows
```

Some recurring names, less frequent should be lumped together.

---

### city

```r
pumpkins %>% 
  count(city, sort = T)
```

```
## # A tibble: 2,737 × 2
##    city            n
##    <chr>       <int>
##  1 <NA>         1569
##  2 Napa          152
##  3 Barnesville    93
##  4 Kasterlee      90
##  5 Paredes        73
##  6 Gentilly       68
##  7 Olympia        65
##  8 Auburn         64
##  9 Johnston       62
## 10 Kaukauna       60
## # … with 2,727 more rows
```

Same again.

---

### state_prov

```r
pumpkins %>% 
  count(state_prov, sort = T)
```

```
## # A tibble: 134 × 2
##    state_prov       n
##    <chr>        <int>
##  1 Other         1549
##  2 Wisconsin     1031
##  3 California     991
##  4 Ontario        836
##  5 Ohio           713
##  6 Washington     712
##  7 Michigan       633
##  8 Pennsylvania   540
##  9 New York       505
## 10 Oregon         464
## # … with 124 more rows
```

Again.

---

### country

```r
pumpkins %>% 
  count(country, sort = T)
```

```
## # A tibble: 30 × 2
##    country            n
##    <chr>          <int>
##  1 United States   9902
##  2 Canada          2163
##  3 Germany         1144
##  4 Italy            512
##  5 Austria          390
##  6 Japan            333
##  7 Belgium          291
##  8 United Kingdom   179
##  9 Spain            173
## 10 Slovenia         158
## # … with 20 more rows
```

Not that bad this time, likely fewer are enough.

---

### gpc_site

```r
pumpkins %>% 
  count(gpc_site, sort = T)
```

```
## # A tibble: 173 × 2
##    gpc_site                                         n
##    <chr>                                        <int>
##  1 Wiegemeisterschaft Berlin/Brandenburg          453
##  2 Ohio Valley Giant Pumpkin Growers Weigh-off    376
##  3 Elk Grove Giant Pumpkin Festival               359
##  4 Nippon Ichi Dodekabocha Taikai                 333
##  5 Baumans Farm Giant Pumpkin Weigh-off           307
##  6 Stillwater Harvestfest                         298
##  7 Austrian Weigh-off                             285
##  8 Central Great Lakes Weigh-off                  271
##  9 Safeway World Championship Pumpkin Weigh-Off   270
## 10 Nekoosa Giant Pumpkin Fest                     263
## # … with 163 more rows
```

Again.

---

### seed_mother

```r
pumpkins %>% 
  count(seed_mother, sort = T)
```

```
## # A tibble: 6,009 × 2
##    seed_mother       n
##    <chr>         <int>
##  1 <NA>           4424
##  2 unknown         169
##  3 Unknown         145
##  4 2145 McMullen   122
##  5 2009 Wallace    104
##  6 1985 Miller      89
##  7 1911 Urena       72
##  8 2363 Holland     61
##  9 2008 Neptune     60
## 10 1495 Stelts      52
## # … with 5,999 more rows
```

Way too many levels, also likely not useful if seeds cannot be used in the future.

---

### pollinator_father

```r
pumpkins %>% 
  count(pollinator_father, sort = T)
```

```
## # A tibble: 3,465 × 2
##    pollinator_father     n
##    <chr>             <int>
##  1 <NA>               5160
##  2 self               1127
##  3 Self               1101
##  4 open                773
##  5 Open                665
##  6 2009 Wallace        143
##  7 2145 McMullen       124
##  8 unknown              82
##  9 Unknown              77
## 10 1985 Miller          68
## # … with 3,455 more rows
```

Again, likely not useful.

---

### ott (over-the-top inches)

![](Intro-to-Tidymodels_files/figure-html/unnamed-chunk-24-1.png)

Zeros? Likely the same as missing values, only not encoded as `NA`.

---

### ott (over-the-top inches)

Also, let's make that metric (cm).

```r
pumpkins <- pumpkins %>% 
  mutate(ott = ott*2.54)
```

.center[
<img src="GraphicsSlides/download.png" width="40%" />
]

---

### est_weight

![](Intro-to-Tidymodels_files/figure-html/unnamed-chunk-27-1.png)

Zeros? Same as with `ott`, likely need to drop them.

---

### est_weight

Make that metric as well (kg).

```r
pumpkins <- pumpkins %>% 
  mutate(est_weight = est_weight/2.205)
```

The target as well:

```r
pumpkins <- pumpkins %>% 
  mutate(weight_lbs = weight_lbs/2.205) %>% 
  rename(weight_kg = weight_lbs)
```

---

### weight_kg (Target)

![](Intro-to-Tidymodels_files/figure-html/unnamed-chunk-30-1.png)

---

### variety

```r
pumpkins %>% 
  count(variety, sort = T)
```

```
## # A tibble: 1 × 2
##   variety     n
##   <chr>   <int>
## 1 <NA>    15965
```

It's clear what to do with this one.

---

### Looking for missing values

```r
*colMeans(is.na(pumpkins)) %>%
  tidy() %>%
  arrange(x) %>%
  rename(pct = x) %>%
  filter(pct > 0) %>%
  arrange(-pct)
```

```
## # A tibble: 9 × 2
##   names                  pct
##   <chr>                <dbl>
## 1 variety           1       
## 2 pollinator_father 0.323   
## 3 weight_kg         0.315   
## 4 est_weight        0.304   
## 5 seed_mother       0.277   
## 6 city              0.0983  
## 7 place             0.0857  
## 8 ott               0.000564
## 9 pct_chart         0.000564
```

---

### Dealing with `NA` and zeros

```r
pumpkins <- pumpkins %>%
  # remove vars with too many missing values or uninteresting
  select(-c(variety, pollinator_father, seed_mother, pct_chart)) %>% 
  # remove obs with target equal to zero, not possible
  # remove obs with target NA values
  filter(weight_kg > 0,
         !is.na(weight_kg)) %>% 
  # replace zero values within predictors with NA values
  mutate(across(c(ott, est_weight),
                ~ ifelse(.x == 0, NA, .x)))
```

---

### Dealing with `NA` and zeros

```r
colMeans(is.na(pumpkins))
```

```
##        year        type       place   weight_kg grower_name        city 
##  0.00000000  0.00000000  0.08744969  0.00000000  0.00000000  0.11553238 
##  state_prov     country    gpc_site         ott  est_weight 
##  0.00000000  0.00000000  0.00000000  0.22036224  0.25375046
```

There are still some `NA` values, but these are manageable with imputation.

---

### Converting all character variables to factors

```r
pumpkins <- pumpkins %>% 
  mutate(across(where(is.character), as.factor))

pumpkins %>% head(5)
```

```
## # A tibble: 5 × 11
##   year  type   place weigh…¹ growe…² city  state…³ country gpc_s…⁴   ott est_w…⁵
##   <fct> <fct>  <dbl>   <dbl> <fct>   <fct> <fct>   <fct>   <fct>   <dbl>   <dbl>
## 1 2013  Giant…   356    453. Magari… Goff… New Ha… United… Deerfi…  940.     NA 
## 2 2013  Giant…   356    453. Leland… Canby Oregon  United… Bauman…  904.    451.
## 3 2013  Giant…   358    453. Pappas… Mars… Massac… United… Deerfi…  917.     NA 
## 4 2013  Giant…   359    453. Northr… Suss… New Br… Canada  Journe…  884.    423.
## 5 2013  Giant…   359    453. Schult… Deco… Iowa    United… Ryan N…  892.    434.
## # … with abbreviated variable names ¹weight_kg, ²grower_name, ³state_prov,
## #   ⁴gpc_site, ⁵est_weight
```

---

### Let's get modelling!

.center[
<img src="GraphicsSlides/get in loser.png" width="60%" />
]

---

### Creating training and holdout splits

```r
dt_split <- initial_split(pumpkins, strata = weight_kg)

dt_split
```

```
## <Training/Testing/Total>
## <8198/2734/10932>
```

---

### Creating training and holdout splits

```r
dt_train <- training(dt_split)
dt_test <- testing(dt_split)

dim(dt_train)
```

```
## [1] 8198   11
```

```r
dim(dt_test)
```

```
## [1] 2734   11
```

```r
dt_split
```

```
## <Training/Testing/Total>
## <8198/2734/10932>
```

---

### Cross-Validation for stable computation of model metrics

```r
folds <- vfold_cv(dt_train, v = 5, strata = weight_kg)

folds
```

```
## #  5-fold cross-validation using stratification 
## # A tibble: 5 × 2
##   splits              id   
##   <list>              <chr>
## 1 <split [6557/1641]> Fold1
## 2 <split [6558/1640]> Fold2
## 3 <split [6559/1639]> Fold3
## 4 <split [6559/1639]> Fold4
## 5 <split [6559/1639]> Fold5
```

When tuning hyperparameters, these folds can be passed into the tuning function.

---

### Preprocessing with `recipes`

The recipe works similarly like the well known `lm(y ~ x)` function. You can add some variables...

```r
xg_rec <- recipe(weight_kg ~ ott + est_weight + country,
                 data = dt_train)

xg_rec
```

```
## Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          3
```

---

### Preprocessing with `recipes`

The recipe works similarly like the well known `lm(y ~ x)` function. You can add some variables...

...or all variables.

```r
xg_rec <- recipe(weight_kg ~ ., 
                 data = dt_train)

xg_rec
```

```
## Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         10
```

---

### Let's see what we still need to do

```r
glimpse(pumpkins)
```

```
## Rows: 10,932
## Columns: 11
## $ year        <fct> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013…
## $ type        <fct> Giant Pumpkin, Giant Pumpkin, Giant Pumpkin, Giant Pumpkin…
## $ place       <dbl> 356, 356, 358, 359, 359, 361, 362, 363, 363, 363, 366, 366…
## $ weight_kg   <dbl> 453.2880, 453.2880, 453.0612, 452.8345, 452.8345, 452.6077…
## $ grower_name <fct> "Magarian, Marc", "Leland, Neal", "Pappas, Ed", "Northrup,…
## $ city        <fct> "Goffstown", "Canby", "Marshfield", "Sussex", "Decorah", "…
## $ state_prov  <fct> New Hampshire, Oregon, Massachusetts, New Brunswick, Iowa,…
## $ country     <fct> United States, United States, United States, Canada, Unite…
## $ gpc_site    <fct> Deerfield Fair, Baumans Farm Giant Pumpkin Weigh-off, Deer…
## $ ott         <dbl> 939.80, 904.24, 916.94, 883.92, 891.54, 916.94, 939.80, NA…
## $ est_weight  <dbl> NA, 450.7937, NA, 423.1293, 433.5601, NA, NA, NA, 407.7098…
```

- Factors: allow for new factor levels
- Remove: `type`
- Lump together: `grower_name`, `city`, `state_prov`, `country`, `gpc_site`
- Impute: variables with missing observations

---

### Preprocessing with `recipes`

Removing type:

```r
xg_rec <- recipe(weight_kg ~ ., 
                 data = dt_train) %>% 
* step_rm(type)

xg_rec %>% prep() %>% juice() %>% head(5)
```

```
## # A tibble: 5 × 10
##   year  place grower_name    city  state…¹ country gpc_s…²   ott est_w…³ weigh…⁴
##   <fct> <dbl> <fct>          <fct> <fct>   <fct>   <fct>   <dbl>   <dbl>   <dbl>
## 1 2013   1229 Offermann, Ir… Koeln Other   Germany XXL-Ku…  605.    137.    125.
## 2 2013   1230 Zimmermann, A… <NA>  Other   Austria Austri…   NA      NA     125.
## 3 2013   1231 Giovanni, Riva <NA>  Other   Italy   Festa …  607.    139.    125.
## 4 2013   1232 Martel Junior… Beca… Quebec  Canada  Potiro…   NA      NA     125.
## 5 2013   1232 Strange, Mike  Dorv… Quebec  Canada  Vermon…  574.    119.    125.
## # … with abbreviated variable names ¹state_prov, ²gpc_site, ³est_weight,
## #   ⁴weight_kg
```

---

### Preprocessing with `recipes`

Imputing missing values with mode for categorical and median for numeric predictors:

```r
xg_rec <- recipe(weight_kg ~ ., 
                 data = dt_train) %>% 
  step_rm(type) %>% 
* step_impute_median(all_numeric_predictors()) %>%
* step_impute_mode(all_nominal_predictors())

colMeans(is.na(xg_rec %>% prep() %>% juice()))
```

```
##        year       place grower_name        city  state_prov     country 
##           0           0           0           0           0           0 
##    gpc_site         ott  est_weight   weight_kg 
##           0           0           0           0
```

---

### Preprocessing with `recipes`

.scroll-output[

Lumping factor levels together. Because the value "Other" is already a level in the factors, we need to recode the newly added "other" to "Other" to make it the same category.

```r
xg_rec <- recipe(weight_kg ~ ., 
                 data = dt_train) %>% 
  step_rm(type) %>% 
  step_impute_median(all_numeric_predictors()) %>% 
  step_impute_mode(all_nominal_predictors()) %>% 
* step_other(grower_name, city, state_prov,
*            country, gpc_site,
*            threshold = 0.01) %>%
* step_mutate(across(c(grower_name, city, state_prov,
*                      country, gpc_site),
*            ~ forcats::fct_recode(., "Other" = "other")))

xg_rec %>% prep() %>% juice() %>% head()
```

```
## # A tibble: 6 × 10
##   year  place grower_name city    state_…¹ country gpc_s…²   ott est_w…³ weigh…⁴
##   <fct> <dbl> <fct>       <fct>   <fct>    <fct>   <fct>   <dbl>   <dbl>   <dbl>
## 1 2013   1229 Other       Other   Other    Germany Other    605.    137.    125.
## 2 2013   1230 Other       Paredes Other    Austria Austri…  734.    243.    125.
## 3 2013   1231 Other       Paredes Other    Italy   Other    607.    139.    125.
## 4 2013   1232 Other       Other   Quebec   Canada  Potiro…  734.    243.    125.
## 5 2013   1232 Other       Other   Quebec   Canada  Vermon…  574.    119.    125.
## 6 2013   1234 Other       Other   Other    Germany Other    577.    120.    124.
## # … with abbreviated variable names ¹state_prov, ²gpc_site, ³est_weight,
## #   ⁴weight_kg
```
]

---

### Preprocessing with `recipes`

.scroll-output[

Let's normalize the numerical predictors as well to see undistorted variable importance:

```r
xg_rec <- recipe(weight_kg ~ ., 
                 data = dt_train) %>% 
  step_rm(type) %>% 
  step_impute_median(all_numeric_predictors()) %>% 
  step_impute_mode(all_nominal_predictors()) %>% 
  step_other(grower_name, city, state_prov, 
             country, gpc_site,
             threshold = 0.01) %>% 
  step_mutate(across(c(grower_name, city, state_prov, 
                       country, gpc_site),
             ~ forcats::fct_recode(., "Other" = "other"))) %>% 
* step_normalize(all_numeric_predictors())

xg_rec %>% prep() %>% juice() %>% head()
```

```
## # A tibble: 6 × 10
##   year  place grower_name city    state…¹ country gpc_s…²    ott est_w…³ weigh…⁴
##   <fct> <dbl> <fct>       <fct>   <fct>   <fct>   <fct>    <dbl>   <dbl>   <dbl>
## 1 2013  0.474 Other       Other   Other   Germany Other   -0.817 -1.04      125.
## 2 2013  0.477 Other       Paredes Other   Austria Austri…  0.156  0.0143    125.
## 3 2013  0.480 Other       Paredes Other   Italy   Other   -0.798 -1.03      125.
## 4 2013  0.483 Other       Other   Quebec  Canada  Potiro…  0.156  0.0143    125.
## 5 2013  0.483 Other       Other   Quebec  Canada  Vermon… -1.05  -1.23      125.
## 6 2013  0.489 Other       Other   Other   Germany Other   -1.03  -1.21      124.
## # … with abbreviated variable names ¹state_prov, ²gpc_site, ³est_weight,
## #   ⁴weight_kg
```

]

---

### Preprocessing with `recipes`

Let's allow for novel factor levels in the new data and one-hot encode the categorical variables as well:

```r
xg_rec <- recipe(weight_kg ~ ., 
                 data = dt_train) %>% 
  step_rm(type) %>% 
  step_impute_median(all_numeric_predictors()) %>% 
  step_impute_mode(all_nominal_predictors()) %>% 
  step_other(grower_name, city, state_prov, 
             country, gpc_site,
             threshold = 0.01) %>% 
  step_mutate(across(c(grower_name, city, state_prov, 
                       country, gpc_site),
             ~ forcats::fct_recode(., "Other" = "other"))) %>% 
  step_normalize(all_numeric_predictors()) %>% 
* step_novel(all_nominal_predictors()) %>%
* step_dummy(all_nominal_predictors(), one_hot = TRUE)
```

---

### Preprocessing with `recipes`

.scroll-output[

Prepping and baking will apply all recipe steps to the new data, i.e. the holdout used for evaluation at a later stage.

```r
xg_rec %>% prep() %>% bake(dt_train) %>% head()
```

```
## # A tibble: 6 × 95
##   place    ott est_wei…¹ weigh…² year_…³ year_…⁴ year_…⁵ year_…⁶ year_…⁷ year_…⁸
##   <dbl>  <dbl>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 0.474 -0.817   -1.04      125.       1       0       0       0       0       0
## 2 0.477  0.156    0.0143    125.       1       0       0       0       0       0
## 3 0.480 -0.798   -1.03      125.       1       0       0       0       0       0
## 4 0.483  0.156    0.0143    125.       1       0       0       0       0       0
## 5 0.483 -1.05    -1.23      125.       1       0       0       0       0       0
## 6 0.489 -1.03    -1.21      124.       1       0       0       0       0       0
## # … with 85 more variables: year_X2019 <dbl>, year_X2020 <dbl>,
## #   year_X2021 <dbl>, year_new <dbl>, grower_name_Karkos..Udo <dbl>,
## #   grower_name_Other <dbl>, grower_name_new <dbl>, city_Paredes <dbl>,
## #   city_Other <dbl>, city_new <dbl>, state_prov_Alberta <dbl>,
## #   state_prov_Antwerp <dbl>, state_prov_Brandenburg <dbl>,
## #   state_prov_California <dbl>, state_prov_Colorado <dbl>,
## #   state_prov_Connecticut <dbl>, state_prov_Indiana <dbl>, …
```
]

---

### Preprocessing with `recipes`

That's the recipe done! The recipe allows to

- prevent data leakage
- create stepwise data preprocessing that is easy to understand
- no manual column indexing, subsetting, and whatever shenanigans Python requires you to do

<br>

<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Silly complaint: &quot;this tool keeps you from thinking.&quot; No one stops thinking when they use good tools. They think about more important things</p>&mdash; David Robinson (@drob) <a href="https://twitter.com/drob/status/769161059743756289?ref_src=twsrc%5Etfw">August 26, 2016</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

---

### Setting up the model

.scroll-output[

```r
xg_spec <- 
  boost_tree(
    trees = 750,
    tree_depth = tune(),
    min_n = tune(),
    loss_reduction = tune(),
    sample_size = tune(),
    mtry = tune(),
    learn_rate = tune()
  ) %>%
  set_engine("xgboost", importance = "impurity") %>%
  set_mode("regression")

xg_spec
```

```
## Boosted Tree Model Specification (regression)
## 
## Main Arguments:
##   mtry = tune()
##   trees = 750
##   min_n = tune()
##   tree_depth = tune()
##   learn_rate = tune()
##   loss_reduction = tune()
##   sample_size = tune()
## 
## Engine-Specific Arguments:
##   importance = impurity
## 
## Computational engine: xgboost
```
]

---

### Setting up the workflow

.scroll-output[

```r
xg_wflow <- 
  workflow() %>% 
  add_recipe(xg_rec) %>% 
  add_model(xg_spec)

xg_wflow
```

```
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: boost_tree()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 8 Recipe Steps
## 
## • step_rm()
## • step_impute_median()
## • step_impute_mode()
## • step_other()
## • step_mutate()
## • step_normalize()
## • step_novel()
## • step_dummy()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Boosted Tree Model Specification (regression)
## 
## Main Arguments:
##   mtry = tune()
##   trees = 750
##   min_n = tune()
##   tree_depth = tune()
##   learn_rate = tune()
##   loss_reduction = tune()
##   sample_size = tune()
## 
## Engine-Specific Arguments:
##   importance = impurity
## 
## Computational engine: xgboost
```
]

---

### Setting up a tuning grid

.scroll-output[

```r
xg_grid <- 
  grid_latin_hypercube(
    tree_depth(),
    min_n(),
    loss_reduction(),
    sample_size = sample_prop(),
    finalize(mtry(), dt_train),
    learn_rate(),
    size = 10
  )

xg_grid
```

```
## # A tibble: 10 × 6
##    tree_depth min_n loss_reduction sample_size  mtry learn_rate
##         <int> <int>          <dbl>       <dbl> <int>      <dbl>
##  1         10    26       3.64e- 4       0.811    10   2.13e- 2
##  2          9    29       1.40e- 8       0.202     5   3.03e- 3
##  3          4     9       1.89e- 3       0.835    10   7.75e- 9
##  4          7    23       3.35e-10       0.930     6   8.11e-10
##  5         11     3       2.39e- 8       0.562     2   2.22e- 7
##  6          2    21       1.16e- 1       0.106     7   3.45e-10
##  7          4    16       5.24e- 1       0.282     4   4.81e- 5
##  8         14    13       2.32e+ 0       0.665     8   1.42e- 3
##  9         13    36       2.02e- 6       0.535     8   1.23e- 5
## 10          6    37       4.58e- 6       0.432     1   2.80e- 6
```
]

---

### Tuning the model

.scroll-output[

This is the most time-intensive part with large amounts of data, so we're gonna use parallelisation here (using more of your CPU's cores). Make sure to edit this code according to your needs.

```r
start_time = Sys.time()

unregister_dopar <- function() {
  env <- foreach:::.foreachGlobals
  rm(list=ls(name=env), pos=env)
}

cl <- makePSOCKcluster(6)
registerDoParallel(cl)

xg_tune <- tune_grid(
  object = xg_wflow,
  resamples = folds,
  grid = xg_grid,
  control = control_grid(save_pred = TRUE)
  )

stopCluster(cl)
unregister_dopar()

end_time = Sys.time()
end_time - start_time
```

```
## Time difference of 1.136629 mins
```
]
---

### Looking at tuning results

.scroll-output[

```r
xg_tune %>% 
  show_best("rsq") %>% 
  glimpse()
```

```
## Rows: 5
## Columns: 12
## $ mtry           <int> 2, 9, 8, 10, 6
## $ min_n          <int> 7, 3, 24, 34, 13
## $ tree_depth     <int> 7, 11, 12, 6, 14
## $ learn_rate     <dbl> 1.545806e-02, 1.194447e-07, 3.635109e-05, 4.436419e-06,…
## $ loss_reduction <dbl> 1.125698e-09, 9.118926e-05, 1.658851e-09, 1.123321e+01,…
## $ sample_size    <dbl> 0.7507816, 0.6794007, 0.9374669, 0.4770123, 0.2252345
## $ .metric        <chr> "rsq", "rsq", "rsq", "rsq", "rsq"
## $ .estimator     <chr> "standard", "standard", "standard", "standard", "standa…
## $ mean           <dbl> 0.8665343, 0.8370460, 0.8226847, 0.8031926, 0.7912635
## $ n              <int> 5, 5, 5, 5, 5
## $ std_err        <dbl> 0.010287916, 0.004288253, 0.004602997, 0.004568200, 0.0…
## $ .config        <chr> "Preprocessor1_Model01", "Preprocessor1_Model03", "Prep…
```
]

---

### Selecting the best model configuration

.scroll-output[

```r
xg_final_wflow <- xg_wflow %>% 
  finalize_workflow(select_best(xg_tune, metric = "rsq"))

xg_final_wflow
```

```
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: boost_tree()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 8 Recipe Steps
## 
## • step_rm()
## • step_impute_median()
## • step_impute_mode()
## • step_other()
## • step_mutate()
## • step_normalize()
## • step_novel()
## • step_dummy()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Boosted Tree Model Specification (regression)
## 
## Main Arguments:
##   mtry = 2
##   trees = 750
##   min_n = 7
##   tree_depth = 7
##   learn_rate = 0.0154580616226708
##   loss_reduction = 1.12569835788786e-09
##   sample_size = 0.750781629404519
## 
## Engine-Specific Arguments:
##   importance = impurity
## 
## Computational engine: xgboost
```
]

---

### Fitting the final model on the initial split (entire training data)

```r
xg_final_fit <- xg_final_wflow %>% 
  last_fit(dt_split)

xg_final_fit
```

```
## # Resampling results
## # Manual resampling 
## # A tibble: 1 × 6
##   splits              id               .metrics .notes   .predictions .workflow 
##   <list>              <chr>            <list>   <list>   <list>       <list>    
## 1 <split [8198/2734]> train/test split <tibble> <tibble> <tibble>     <workflow>
```

Let's inspect what's in there.

---

### Inspecting our final fit

```r
xg_final_fit %>% 
  collect_metrics()
```

```
## # A tibble: 2 × 4
##   .metric .estimator .estimate .config             
##   <chr>   <chr>          <dbl> <chr>               
## 1 rmse    standard      44.1   Preprocessor1_Model1
## 2 rsq     standard       0.890 Preprocessor1_Model1
```

---

### Inspecting our final fit

```r
xg_final_fit %>% 
  collect_predictions()
```

```
## # A tibble: 2,734 × 5
##    id               .pred  .row weight_kg .config             
##    <chr>            <dbl> <int>     <dbl> <chr>               
##  1 train/test split  367.     1      453. Preprocessor1_Model1
##  2 train/test split  345.     6      453. Preprocessor1_Model1
##  3 train/test split  379.    18      449. Preprocessor1_Model1
##  4 train/test split  371.    22      447. Preprocessor1_Model1
##  5 train/test split  400.    26      445. Preprocessor1_Model1
##  6 train/test split  393.    30      444. Preprocessor1_Model1
##  7 train/test split  385.    31      444. Preprocessor1_Model1
##  8 train/test split  372.    33      442. Preprocessor1_Model1
##  9 train/test split  382.    40      439. Preprocessor1_Model1
## 10 train/test split  389.    41      438. Preprocessor1_Model1
## # … with 2,724 more rows
```

With this, we can make a useful plot.

---

### Inspecting our final fit

.scroll-output[

```r
xg_final_fit %>% 
  collect_predictions() %>% 
  ggplot(aes(weight_kg, .pred)) +
  geom_point(alpha = 0.5, colour = "midnightblue") +
  geom_abline(lty = "dashed", colour = "grey50", size = 0.75) +
  labs(title = "Predictions vs. Actuals of Holdout")
```

<img src="Intro-to-Tidymodels_files/figure-html/unnamed-chunk-59-1.png" width="100%" />
]

---

### Variable Importance Plot

.scroll-output[

```r
xg_final_fit %>%
  extract_workflow() %>% 
  extract_fit_parsnip() %>% 
  vi() %>%
  slice_max(order_by = Importance, n = 7) %>% 
  ggplot(aes(Importance, reorder(Variable, Importance))) +
  geom_col(fill = "midnightblue", colour = "white") +
  labs(title = "Variable Importance",
       subtitle = "Only the seven most important predictors are shown.",
       y = "Predictor",
       x = "Relative Variable Importance") +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 12),
        plot.subtitle = element_text(face = "italic", colour = "grey50"))
```

<img src="Intro-to-Tidymodels_files/figure-html/unnamed-chunk-60-1.png" width="100%" />
]
---

### Making predictions on new data

.scroll-output[

```r
one_pumpkin <- dt_test %>% 
  select(-weight_kg) %>% 
  sample_n(1)

one_pumpkin %>% glimpse()
```

```
## Rows: 1
## Columns: 10
## $ year        <fct> 2016
## $ type        <fct> Giant Pumpkin
## $ place       <dbl> 1188
## $ grower_name <fct> "Marin, Courant"
## $ city        <fct> NA
## $ state_prov  <fct> Quebec
## $ country     <fct> Canada
## $ gpc_site    <fct> Potirothon de Gentilly
## $ ott         <dbl> 657.86
## $ est_weight  <dbl> 176.8707
```

]
---

### Making predictions on new data

Feeding new data (in this case from the holdout) into the model enables you to make predictions. All recipe steps are applied and the workflow equally executed.

```r
xg_final_fit %>% 
  extract_workflow() %>% 
  predict(one_pumpkin)
```

```
## # A tibble: 1 × 1
##   .pred
##   <dbl>
## 1  180.
```

---

### Making predictions on new data

.scroll-output[

If you want to keep the characteristics of the new observation you made a prediction on, use `augment()` instead of `predict()`.

```r
xg_final_fit %>% 
  extract_workflow() %>% 
  augment(one_pumpkin) %>% 
  glimpse()
```

```
## Rows: 1
## Columns: 11
## $ year        <fct> 2016
## $ type        <fct> Giant Pumpkin
## $ place       <dbl> 1188
## $ grower_name <fct> "Marin, Courant"
## $ city        <fct> NA
## $ state_prov  <fct> Quebec
## $ country     <fct> Canada
## $ gpc_site    <fct> Potirothon de Gentilly
## $ ott         <dbl> 657.86
## $ est_weight  <dbl> 176.8707
## $ .pred       <dbl> 180.2735
```

]

---

### That's all!

Of course you can fit a plethora of different models. These are just slightly different, the `tidymodels` framework stays the same, that's the superpower!

For instance, here is the specification for an elastic net:

```r
en_spec <- linear_reg(penalty = tune(),
                      mixture = tune()) %>% 
  set_engine("glmnet")

en_wflow <-
  workflow() %>% 
  add_recipe(en_rec) %>% 
  add_model(en_spec)

en_grid <- 
  grid_latin_hypercube(
    penalty(),
    mixture(),
    size = 100
    )
```

---

# That's all!

We hope this tutorial was useful! Feel free to make use of the code and don't despair if you encounter errors, that's totally normal. For further questions, feel free to reach out to us. Make sure to stay updated on our socials and via our website where all resources and dates are also published.

<br>

.center[
<img src="GraphicsSlides/Logo RUG hell.png" width="60%" />

**[Website](https://rusergroup-sg.ch/) | [Instagram](https://www.instagram.com/rusergroupstgallen/?hl=en) | [Twitter](https://twitter.com/rusergroupsg)**

]

---

class: middle, inverse, hide-logo

# Thank you for attending!

<em style="color:#404040">The material provided in this presentation including any information, tools, features, content and any images incorporated in the presentation, is solely for your lawful, personal, private use. You may not modify, republish, or post anything you obtain from this presentation, including anything you download from our website, unless you first obtain our written consent. You may not engage in systematic retrieval of data or other content from this website. We request that you not create any kind of hyperlink from any other site to ours unless you first obtain our written permission.</em>