class: center, middle, hide-logo <style type="text/css"> pre { background: #F8F8F8; max-width: 100%; overflow-x: scroll; } </style> <style type="text/css"> .scroll-output { height: 80%; overflow-y: scroll; } </style> # Introduction to R ## by <img src="GraphicsSlides/Logo RUG hell.png" width="50%" /> ##### Authors/Presenters: Sven, Mathias ##### Last updated: _2022-09-29 10:37:20_ --- class: hide-logo, center, middle <img src="GraphicsSlides/Logo RUG hell.png" width="50%" /> ### Who's in the room? --- class: center, hide-logo background-image: url("GraphicsSlides/Offering.png") background-size: contain --- ### Goals for today's session - Introduction / Welcome - Why should you learn R - Installing R and R Studio on your Machine - Getting a first overview over R Studio - Quick Intro into Data Types and Objects - Getting to know Base R commands - Learn a bit about Penguins <img src="GraphicsSlides/palmer.png" width="35%" /> --- ### Why should you learn R? - Excel can be extremely messy, especially when working with multiple data sheets <br> <img src="GraphicsSlides/excel_mess.png" width="110%" /> --- ### Why should you learn R? .scroll-output[ .center[ <img src="GraphicsSlides/plot_paper.png" width="70%" /> ] .center[ <img src="GraphicsSlides/excel_data.png" width="120%" /> ] ] --- ### Why should you learn R? .scroll-output[ ```r trade_data <- read_csv("DataSets/trade_data.csv", col_types = cols(`Period 1,2` = "d")) out_plot <- trade_data |> rename(Country = `Commercial partner`, Period = `Period 1,2`, Tariff = `Kind of rate`, CHF_Value = `Value (CHF)`, Duties_CHF = `Amount of customs duties (CHF)`) |> select(Country, Period, Tariff, Duties_CHF, Product) |> mutate( Tariff = case_when(Tariff == "Total trade" ~ "Total_Trade", Tariff == "10 - Normal rate" ~ "Normal_Rate", Tariff == "20 - Reduced rate" ~ "Reduced_Rate", Tariff == "30 - Exemption" ~ "Exemption", TRUE ~ "Other")) |> group_by(Country, Period, Tariff, Product) |> summarize(Duties_CHF = max(Duties_CHF)) |> pivot_wider(names_from = Tariff, values_from = Duties_CHF) |> ungroup() |> replace_na(list(Normal_Rate = 0, Total_Trade = 0, Reduced_Rate = 0, Exemption = 0)) |> mutate(Util_Rate = (Exemption + Reduced_Rate) / Total_Trade) |> filter(Country == "India") |> ggplot(aes(x = Period, y = Util_Rate, color = as_factor(Product)))+ geom_point(size = 3)+ geom_line(size = 0.8)+ scale_y_continuous(breaks = seq(0,1.2, by = 0.2), labels = percent_format())+ scale_x_continuous(breaks = seq(2000,2018, by = 2))+ expand_limits(y = c(0, 1))+ theme_minimal()+ labs(title = "Adjusted Utilization Rate 2000 - 2018")+ theme(plot.title = element_text(size = 22, hjust = 0.5, face = "bold"), axis.title = element_blank(), axis.text = element_text(size = 14), panel.grid.minor.x = element_blank(), panel.grid.major.x = element_blank(), legend.position = "bottom", legend.title = element_blank(), legend.text = element_text(size = 12)) ``` ] --- .center[ <br> <br> <img src="Intro_to_R_files/figure-html/unnamed-chunk-10-1.png" width="100%" /> ] --- ### Why should you learn R? .pull-left[ <img src="Intro_to_R_files/figure-html/unnamed-chunk-11-1.png" width="100%" /> ] .pull-right[ - Most of all: Reproducible! - Work with Large Data Sets --> eg. [**US Election Donations Analysis**](https://github.com/svensglinz/Data_Analysis_Project) <br> <img src="GraphicsSlides/strong v weak.png" width="100%" /> ] --- ### Shiny Apps: Host your work online for external users <div> <center> <iframe src="https://svenglinz.shinyapps.io/InteractivePlot/?showcase=0" width="900px" height="610px" frameBorder="0"> </iframe> </center> </div> --- ### Just shortly: Data Types and Objects In our R environment, we can store all sorts of objects and data. There are various objects such as *vectors*, *matrices*, *lists* and *tibbles*. All objects are of a certain type (for example Integer, Double, Character, Logical) and it is important to always make sure that your data is formatted as the right type. .pull-left[ .center[ <img src="GraphicsSlides/wrong_data_type.png" width="65%" /> ] ] .pull-right[ Example Integers and Strings ```r first_number <- "100" second_number <- 200 first_number + second_number ``` ``` ## Error in first_number + second_number: non-numeric argument to binary operator ``` ] --- ### Data Types and Objects - Logical Values can be represented by: - TRUE/FALSE (alternatively: T/F) - When performing mathematical operations, TRUE = 1, and FALSE = 0: ```r a <- TRUE b <- T c <- FALSE sum(a,b,c) ``` ``` ## [1] 2 ``` --- ### Data Types and Objects - Vectors contain multiple numbers, strings, logical values etc. - They can only contain data of **one type**, so you cannot put a number and a string into the same vector! .pull-left[ ```r first_vector <- c(1,2,3,4) second_vector <- c(5,6,7,8) third_vector <- c("hello", 100, "test", FALSE) first_vector + second_vector ``` ``` ## [1] 6 8 10 12 ``` ```r third_vector ``` ``` ## [1] "hello" "100" "test" "FALSE" ``` ] .pull-right[ <br> <img src="GraphicsSlides/vectors.png" width="140%" /> ] --- ### Data Types and Objects .scroll-output[ - Usually, you will work with rectangular data in `tibbles` or `data frames` - `tibble`: two dimensional frame made of different columns - each column is a distinct vector (i.e. different data types possible) ```r tibble_test <- tibble(x = first_vector, y = second_vector, z = third_vector) tibble_test ``` ``` ## # A tibble: 4 × 3 ## x y z ## <dbl> <dbl> <chr> ## 1 1 5 hello ## 2 2 6 100 ## 3 3 7 test ## 4 4 8 FALSE ``` A list is like a vector but can contain different data types: ```r list_test <- list(vector = first_vector, tibble = tibble_test, string = first_number, number = as.numeric(first_number)) list_test ``` ``` ## $vector ## [1] 1 2 3 4 ## ## $tibble ## # A tibble: 4 × 3 ## x y z ## <dbl> <dbl> <chr> ## 1 1 5 hello ## 2 2 6 100 ## 3 3 7 test ## 4 4 8 FALSE ## ## $string ## [1] "100" ## ## $number ## [1] 100 ``` ] --- ## Working with Tibbles .pull-left[ <img src="GraphicsSlides/palmerpenguins_package.png" width="50%" /> ] .pull-right[ ```r #install.packages("palmerpenguins") penguins <- palmerpenguins::penguins ``` ] ```r head(penguins, 4) ``` ``` ## # A tibble: 4 × 8 ## species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex ## <fct> <fct> <dbl> <dbl> <int> <int> <fct> ## 1 Adelie Torge… 39.1 18.7 181 3750 male ## 2 Adelie Torge… 39.5 17.4 186 3800 fema… ## 3 Adelie Torge… 40.3 18 195 3250 fema… ## 4 Adelie Torge… NA NA NA NA <NA> ## # … with 1 more variable: year <int> ``` --- ### Working with Tibbles - Simple Subsetting You can subset every tibble as follows: - row and column: - `tibble_name[row_index, column_index/name] ` - row only: - `tibble_name[row_index,]` - column only: - `tibble_name[column_index/name]` OR `tibble_name[,column_index/name]` Now let's get the **second row** and **third column** of the penguins data -- ```r penguins[2,3] penguins[2, "bill_length_mm"] ``` -- What if we want multiple rows/ columns or if we want to delete an entry? -- ```r penguins[c(1:5, 200:202), c(2,5)] penguins[-c(1:4),1] ``` --- ### Subsetting with Logical Conditions If we look for a specific value within our data set (column, row or both), we can subset the data frame by **logical conditions**. - `==` **equal to** - `&` means **and** - `|` means **or** - `>`, `<`, `<=`, `>=` are used to **compare numerical values** ```r penguins[penguins$sex == "male" & penguins$island == "Biscoe",] ``` ``` ## # A tibble: 5 × 8 ## species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex ## <fct> <fct> <dbl> <dbl> <int> <int> <fct> ## 1 Adelie Biscoe 37.7 18.7 180 3600 male ## 2 Adelie Biscoe 38.2 18.1 185 3950 male ## 3 Adelie Biscoe 38.8 17.2 180 3800 male ## 4 Adelie Biscoe 40.6 18.6 183 3550 male ## 5 Adelie Biscoe 40.5 18.9 180 3950 male ## # … with 1 more variable: year <int> ``` --- ### Task 1: Replace island "Dream" with "Unknown" **Tip:** Start by subsetting the tibble for column *Island* and within this column, all values which are *Dream*. Then, assign a new value to these values with the operator `<-`. ```r penguins[,"island"] ``` ``` ## # A tibble: 344 × 1 ## island ## <fct> ## 1 Torgersen ## 2 Torgersen ## 3 Torgersen ## 4 Torgersen ## 5 Torgersen ## 6 Torgersen ## 7 Torgersen ## 8 Torgersen ## 9 Torgersen ## 10 Torgersen ## # … with 334 more rows ``` --- ### Step 1: Lets filter our data set and replace the results with "Unknown" ```r penguins[penguins$island == "Dream", "island"] <- "Unknown" ``` -- .pull-left[ <img src="GraphicsSlides/no_work.png" width="70%" /> ] .pull-right[ ``` ## Error: ## ! Assigned data `"Unknown"` must be compatible with existing data. ## ℹ Error occurred for column `island`. ## ✖ Can't convert from <character> to <factor<ccf33>> due to loss of generality. ## • Locations: 1. ``` ] --- ### Step 2: Lets convert the data type from factor to string ```r penguins$island <- as.character(penguins$island) ``` -- Did it work? -- Yes! ```r str(penguins$island) ``` ``` ## chr [1:344] "Torgersen" "Torgersen" "Torgersen" "Torgersen" "Torgersen" ... ``` --- ### Back to Step 1: ```r penguins[penguins$island == "Dream", "island"] <- "Unknown" ``` -- This time, it worked: ```r penguins[penguins$island == "Unknown",] ``` ``` ## # A tibble: 3 × 8 ## species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex ## <fct> <chr> <dbl> <dbl> <int> <int> <fct> ## 1 Adelie Unkno… 39.5 16.7 178 3250 fema… ## 2 Adelie Unkno… 37.2 18.1 178 3900 male ## 3 Adelie Unkno… 39.5 17.8 188 3300 fema… ## # … with 1 more variable: year <int> ``` -- And if needed, reformat back to factor. ```r penguins$island <- as.factor(penguins$island) ``` --- ### Task 2: How would we see the difference in body mass between female and male penguins? ```r library(dplyr) glimpse(penguins) ``` ``` ## Rows: 344 ## Columns: 8 ## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel… ## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse… ## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, … ## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, … ## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186… ## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, … ## $ sex <fct> male, female, female, NA, female, male, female, male… ## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007… ``` --- ### Step 1: Subsetting - Subset the data frame for one sex ```r penguins[penguins$sex == "male",] ``` ``` ## # A tibble: 179 × 8 ## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g ## <fct> <fct> <dbl> <dbl> <int> <int> ## 1 Adelie Torgersen 39.1 18.7 181 3750 ## 2 <NA> <NA> NA NA NA NA ## 3 Adelie Torgersen 39.3 20.6 190 3650 ## 4 Adelie Torgersen 39.2 19.6 195 4675 ## 5 <NA> <NA> NA NA NA NA ## 6 <NA> <NA> NA NA NA NA ## 7 <NA> <NA> NA NA NA NA ## 8 <NA> <NA> NA NA NA NA ## 9 Adelie Torgersen 38.6 21.2 191 3800 ## 10 Adelie Torgersen 34.6 21.1 198 4400 ## # … with 169 more rows, and 2 more variables: sex <fct>, year <int> ``` --- ### Step 2: Get body mass - Extract the vector you are interested in with the $ sign ```r penguins[penguins$sex == "male",]$body_mass_g ``` ``` ## [1] 3750 NA 3650 4675 NA NA NA NA 3800 4400 4500 4200 3600 3950 3800 ## [16] 3550 3950 3900 3900 4150 3950 4650 3900 4400 4600 3425 NA 4150 4300 4050 ## [31] 3700 3800 3750 4400 4050 3950 4100 4450 3900 4150 4250 3900 4000 4700 4200 ## [46] 3550 3800 3950 4300 4450 4300 4350 4100 4725 4250 3550 3900 4775 4600 4275 ## [61] 4075 3775 3325 3500 3875 4000 4300 4000 3500 4475 3900 3975 4250 3475 3725 ## [76] 3650 4250 3750 4000 5700 5700 5400 5200 5150 5550 5850 5850 6300 5350 5700 ## [91] 5050 5100 NA 5650 5550 5250 6050 5400 5250 5350 5700 4750 5550 5400 5300 ## [106] 5300 5000 5050 5000 5550 5300 5650 5700 NA 5800 5550 5000 5100 5800 6000 ## [121] 5950 5450 5350 5600 5300 5550 5400 5650 5200 4925 5250 5600 5500 NA 5500 ## [136] 5500 5500 5950 5500 5850 NA 6000 NA 5750 5400 3900 3650 3725 3750 3700 ## [151] 3775 4050 4050 3300 4400 3400 3800 4150 3800 4550 4300 4100 3600 4800 4500 ## [166] 3950 3550 4450 4300 3250 3950 4050 3450 4050 3800 3950 4000 3775 4100 ``` --- ### Step 3: Calculate the mean - use the native `mean()` function to calculate the mean body mass -- ```r mean(penguins[penguins$sex == "male",]$body_mass_g) ``` -- ``` ## [1] NA ``` What is this? .center[ <img src="GraphicsSlides/shock.jpg" width="40%" /> ] --- ### Back to Step 2: Get body mass There are `NA` in the data! ```r penguins[penguins$sex == "male",]$body_mass_g ``` ``` ## [1] 3750 NA 3650 4675 NA NA NA NA 3800 4400 4500 4200 3600 3950 3800 ## [16] 3550 3950 3900 3900 4150 3950 4650 3900 4400 4600 3425 NA 4150 4300 4050 ## [31] 3700 3800 3750 4400 4050 3950 4100 4450 3900 4150 4250 3900 4000 4700 4200 ## [46] 3550 3800 3950 4300 4450 4300 4350 4100 4725 4250 3550 3900 4775 4600 4275 ## [61] 4075 3775 3325 3500 3875 4000 4300 4000 3500 4475 3900 3975 4250 3475 3725 ## [76] 3650 4250 3750 4000 5700 5700 5400 5200 5150 5550 5850 5850 6300 5350 5700 ## [91] 5050 5100 NA 5650 5550 5250 6050 5400 5250 5350 5700 4750 5550 5400 5300 ## [106] 5300 5000 5050 5000 5550 5300 5650 5700 NA 5800 5550 5000 5100 5800 6000 ## [121] 5950 5450 5350 5600 5300 5550 5400 5650 5200 4925 5250 5600 5500 NA 5500 ## [136] 5500 5500 5950 5500 5850 NA 6000 NA 5750 5400 3900 3650 3725 3750 3700 ## [151] 3775 4050 4050 3300 4400 3400 3800 4150 3800 4550 4300 4100 3600 4800 4500 ## [166] 3950 3550 4450 4300 3250 3950 4050 3450 4050 3800 3950 4000 3775 4100 ``` --- ### Back to Step 3: Calculate the mean ```r mean(penguins[penguins$sex == "male",]$body_mass_g, * na.rm = T) ``` ``` ## [1] 4545.685 ``` .center[ <img src="GraphicsSlides/flex tape.png" width="40%" /> ] --- ### Step 4: Compare male and female penguins ```r tibble( mean_g = c(mean(penguins[penguins$sex == "male",]$body_mass_g, na.rm = T), mean(penguins[penguins$sex == "female",]$body_mass_g, na.rm = T)), sex = c("male", "female") ) ``` ``` ## # A tibble: 2 × 2 ## mean_g sex ## <dbl> <chr> ## 1 4546. male ## 2 3862. female ``` --- ### How do we know what arguments a function such as `mean()` can take? .pull-left[ <img src="GraphicsSlides/wait_a_minute.png" width="60%" /> ] .pull-right[ ```r args(mean.default) ``` ``` ## function (x, trim = 0, na.rm = FALSE, ...) ## NULL ``` [**R-Documentation**](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/mean) or the easier way: ```r ?mean() ``` ] --- ### Task 3 - Calculate the mean for all numeric variables Tip: the function **`sapply(tibble, function)`** applies the **function** to all vectors/columns in the **tibble** and returns the result as a vector (also think about `NA` again!) ```r library(dplyr) glimpse(penguins) ``` ``` ## Rows: 344 ## Columns: 8 ## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel… ## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse… ## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, … ## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, … ## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186… ## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, … ## $ sex <fct> male, female, female, NA, female, male, female, male… ## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007… ``` --- ### Step 1: Filter out all numeric columns and store the result as a new object ```r data_male <- penguins[penguins$sex == "male", c("bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g")] data_female <- penguins[penguins$sex == "female", c("bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g")] ``` --- ### Step 2: .scroll-output[ Calculate the mean for all columns in the data frame with `sapply()`, store the resulting vectors and produce the tibble. ```r mean_male <- sapply( data_male, mean, * na.rm = T ) mean_female <- sapply( * na.omit(data_female), mean ) tibble(male = mean_male, female = mean_female) ``` ``` ## # A tibble: 4 × 2 ## male female ## <dbl> <dbl> ## 1 45.9 42.1 ## 2 17.9 16.4 ## 3 205. 197. ## 4 4546. 3862. ``` ] --- ### Task 2: How many penguins live on each island? - Also: How many islands are there? ```r library(dplyr) glimpse(penguins) ``` ``` ## Rows: 344 ## Columns: 8 ## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel… ## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse… ## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, … ## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, … ## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186… ## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, … ## $ sex <fct> male, female, female, NA, female, male, female, male… ## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007… ``` --- ### Step 1: Do we have 1 observation per penguin? .scroll-output[ - Look at documentation or know from data (in obvious cases like this) - Here: counting islands will be sufficient ```r penguins$island ``` ``` ## [1] Torgersen Torgersen Torgersen Torgersen Torgersen Torgersen Torgersen ## [8] Torgersen Torgersen Torgersen Torgersen Torgersen Torgersen Torgersen ## [15] Torgersen Torgersen Torgersen Torgersen Torgersen Torgersen Biscoe ## [22] Biscoe Biscoe Biscoe Biscoe Biscoe Biscoe Biscoe ## [29] Biscoe Biscoe Unknown Unknown Unknown Unknown Unknown ## [36] Unknown Unknown Unknown Unknown Unknown Unknown Unknown ## [43] Unknown Unknown Unknown Unknown Unknown Unknown Unknown ## [50] Unknown Biscoe Biscoe Biscoe Biscoe Biscoe Biscoe ## [57] Biscoe Biscoe Biscoe Biscoe Biscoe Biscoe Biscoe ## [64] Biscoe Biscoe Biscoe Biscoe Biscoe Torgersen Torgersen ## [71] Torgersen Torgersen Torgersen Torgersen Torgersen Torgersen Torgersen ## [78] Torgersen Torgersen Torgersen Torgersen Torgersen Torgersen Torgersen ## [85] Unknown Unknown Unknown Unknown Unknown Unknown Unknown ## [92] Unknown Unknown Unknown Unknown Unknown Unknown Unknown ## [99] Unknown Unknown Biscoe Biscoe Biscoe Biscoe Biscoe ## [106] Biscoe Biscoe Biscoe Biscoe Biscoe Biscoe Biscoe ## [113] Biscoe Biscoe Biscoe Biscoe Torgersen Torgersen Torgersen ## [120] Torgersen Torgersen Torgersen Torgersen Torgersen Torgersen Torgersen ## [127] Torgersen Torgersen Torgersen Torgersen Torgersen Torgersen Unknown ## [134] Unknown Unknown Unknown Unknown Unknown Unknown Unknown ## [141] Unknown Unknown Unknown Unknown Unknown Unknown Unknown ## [148] Unknown Unknown Unknown Unknown Unknown Biscoe Biscoe ## [155] Biscoe Biscoe Biscoe Biscoe Biscoe Biscoe Biscoe ## [162] Biscoe Biscoe Biscoe Biscoe Biscoe Biscoe Biscoe ## [169] Biscoe Biscoe Biscoe Biscoe Biscoe Biscoe Biscoe ## [176] Biscoe Biscoe Biscoe Biscoe Biscoe Biscoe Biscoe ## [183] Biscoe Biscoe Biscoe Biscoe Biscoe Biscoe Biscoe ## [190] Biscoe Biscoe Biscoe Biscoe Biscoe Biscoe Biscoe ## [197] Biscoe Biscoe Biscoe Biscoe Biscoe Biscoe Biscoe ## [204] Biscoe Biscoe Biscoe Biscoe Biscoe Biscoe Biscoe ## [211] Biscoe Biscoe Biscoe Biscoe Biscoe Biscoe Biscoe ## [218] Biscoe Biscoe Biscoe Biscoe Biscoe Biscoe Biscoe ## [225] Biscoe Biscoe Biscoe Biscoe Biscoe Biscoe Biscoe ## [232] Biscoe Biscoe Biscoe Biscoe Biscoe Biscoe Biscoe ## [239] Biscoe Biscoe Biscoe Biscoe Biscoe Biscoe Biscoe ## [246] Biscoe Biscoe Biscoe Biscoe Biscoe Biscoe Biscoe ## [253] Biscoe Biscoe Biscoe Biscoe Biscoe Biscoe Biscoe ## [260] Biscoe Biscoe Biscoe Biscoe Biscoe Biscoe Biscoe ## [267] Biscoe Biscoe Biscoe Biscoe Biscoe Biscoe Biscoe ## [274] Biscoe Biscoe Biscoe Unknown Unknown Unknown Unknown ## [281] Unknown Unknown Unknown Unknown Unknown Unknown Unknown ## [288] Unknown Unknown Unknown Unknown Unknown Unknown Unknown ## [295] Unknown Unknown Unknown Unknown Unknown Unknown Unknown ## [302] Unknown Unknown Unknown Unknown Unknown Unknown Unknown ## [309] Unknown Unknown Unknown Unknown Unknown Unknown Unknown ## [316] Unknown Unknown Unknown Unknown Unknown Unknown Unknown ## [323] Unknown Unknown Unknown Unknown Unknown Unknown Unknown ## [330] Unknown Unknown Unknown Unknown Unknown Unknown Unknown ## [337] Unknown Unknown Unknown Unknown Unknown Unknown Unknown ## [344] Unknown ## Levels: Biscoe Torgersen Unknown ``` ] --- ### Step 2: Use `table()` ```r table(penguins$island) ``` ``` ## ## Biscoe Torgersen Unknown ## 168 52 124 ``` -- - Quick outlook on `tidyverse` alternative ```r penguins %>% * count(island) ``` ``` ## # A tibble: 3 × 2 ## island n ## <fct> <int> ## 1 Biscoe 168 ## 2 Torgersen 52 ## 3 Unknown 124 ``` --- ### Task 3: What species live on island "Biscoe"? - Eventually: Calculate a percentage for each species on island Biscoe ```r glimpse(penguins) ``` ``` ## Rows: 344 ## Columns: 8 ## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel… ## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse… ## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, … ## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, … ## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186… ## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, … ## $ sex <fct> male, female, female, NA, female, male, female, male… ## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007… ``` --- ### Step 1: Subset for island ```r penguins[penguins$island == "Biscoe",] ``` ``` ## # A tibble: 168 × 8 ## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g ## <fct> <fct> <dbl> <dbl> <int> <int> ## 1 Adelie Biscoe 37.8 18.3 174 3400 ## 2 Adelie Biscoe 37.7 18.7 180 3600 ## 3 Adelie Biscoe 35.9 19.2 189 3800 ## 4 Adelie Biscoe 38.2 18.1 185 3950 ## 5 Adelie Biscoe 38.8 17.2 180 3800 ## 6 Adelie Biscoe 35.3 18.9 187 3800 ## 7 Adelie Biscoe 40.6 18.6 183 3550 ## 8 Adelie Biscoe 40.5 17.9 187 3200 ## 9 Adelie Biscoe 37.9 18.6 172 3150 ## 10 Adelie Biscoe 40.5 18.9 180 3950 ## # … with 158 more rows, and 2 more variables: sex <fct>, year <int> ``` --- ### Step 2: Get species vector ```r penguins[penguins$island == "Biscoe",]$species ``` ``` ## [1] Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie ## [11] Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie ## [21] Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie ## [31] Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie Adelie ## [41] Adelie Adelie Adelie Adelie Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo ## [51] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo ## [61] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo ## [71] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo ## [81] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo ## [91] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo ## [101] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo ## [111] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo ## [121] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo ## [131] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo ## [141] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo ## [151] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo ## [161] Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo Gentoo ## Levels: Adelie Chinstrap Gentoo ``` --- ### Step 3: Ol' reliable `table()` .scroll-output[ ```r table(penguins[penguins$island == "Biscoe",]$species) ``` ``` ## ## Adelie Chinstrap Gentoo ## 44 0 124 ``` - Can you calculate a percentage from this? - The answer is: No, convert to data frame first. --> A data frame is similar to a tibble, however, sometimes the functions need to coerce an element of class "table" are different. ```r my_table <- table(penguins[penguins$island == "Biscoe",]$species) *as.data.frame(my_table) ``` ``` ## Var1 Freq ## 1 Adelie 44 ## 2 Chinstrap 0 ## 3 Gentoo 124 ``` ] --- ### Step 4: Create a percentage column ```r freq_table <- table(penguins[penguins$island == "Biscoe",]$species) freq_table <- as.data.frame(freq_table) ``` - Create a new column to the right dividing the observation by the sum of `Freq` -- ```r freq_table$percentage <- freq_table[,"Freq"]/sum(freq_table$Freq) freq_table ``` ``` ## Var1 Freq percentage ## 1 Adelie 44 0.2619048 ## 2 Chinstrap 0 0.0000000 ## 3 Gentoo 124 0.7380952 ``` --- ### Why did we use a Data Frame? - What about Tibbles? Alternatively, with `as_tibble()` ```r as_tibble(my_table, .name_repair = "unique") ``` --- ### Why did we use a Data Frame? - What about Tibbles? If you are interested, run the following commands! .pull-left[ ```r class(my_table) ``` ``` ## [1] "table" ``` ```r head(methods(as_tibble)) ``` ``` ## [1] "as_tibble.data.frame" ## [2] "as_tibble.default" ## [3] "as_tibble.dribble" ## [4] "as_tibble.googlesheets4_schema_GridRange" ## [5] "as_tibble.googlesheets4_schema_NamedRange" ## [6] "as_tibble.googlesheets4_schema_ProtectedRange" ``` ] .pull-right[ ```r methods(as.data.frame)[29:32] ``` ``` ## [1] "as.data.frame.resample" "as.data.frame.spec_tbl_df" ## [3] "as.data.frame.table" "as.data.frame.tbl_df" ``` [**Class Tibble Documentation**](https://tibble.tidyverse.org/reference/as_tibble.html) ] --- ### Outlook on `Tidyverse`: Join us in one of the next sessions - Not only faster to calculate percentages for one island... ```r penguins %>% filter(island == "Biscoe") %>% count(species) %>% mutate(percentage = n/sum(n)) ``` ``` ## # A tibble: 2 × 3 ## species n percentage ## <fct> <int> <dbl> ## 1 Adelie 44 0.262 ## 2 Gentoo 124 0.738 ``` --- ### Outlook on `Tidyverse`: Join us in one of the next sessions - ...but for all combinations as well. ```r penguins %>% group_by(island) %>% count(species) %>% transmute(island, species, percentage = n/sum(n)) %>% pivot_wider(names_from = species, values_from = percentage) ``` ``` ## # A tibble: 3 × 4 ## # Groups: island [3] ## island Adelie Gentoo Chinstrap ## <fct> <dbl> <dbl> <dbl> ## 1 Biscoe 0.262 0.738 NA ## 2 Torgersen 1 NA NA ## 3 Unknown 0.452 NA 0.548 ``` --- ### Task 4: Linear relationship between bill length and body mass .scroll-output[ - Console output is boring to look at: Plot the linear relationship between bill length and body mass with base R - Tip: Check out how the function `plot` works with `?plot` ```r penguins[, c("body_mass_g", "bill_length_mm")] ``` ``` ## # A tibble: 6 × 2 ## body_mass_g bill_length_mm ## <int> <dbl> ## 1 3750 39.1 ## 2 3800 39.5 ## 3 3250 40.3 ## 4 NA NA ## 5 3450 36.7 ## 6 3650 39.3 ``` ] --- ### Step 1: Which chart type? - for linear relationships, we can plot a regression line in a scatter plot - independent (predictor) variable on the x-axis - dependent (response) variable on the y-axis .center[ <img src="GraphicsSlides/scatter.png" width="45%" /> ] --- ### Step 2: Use `plot()` ```r plot(x = penguins$bill_length_mm, y = penguins$body_mass_g) ``` <img src="Intro_to_R_files/figure-html/unnamed-chunk-72-1.png" width="75%" /> --- ### Step 3: Add a regression line ```r plot(x = penguins$bill_length_mm, y = penguins$body_mass_g) *abline(lm(penguins$body_mass_g ~ penguins$bill_length_mm), * col = "blue", lwd = 2, lty = 2) ``` <img src="Intro_to_R_files/figure-html/unnamed-chunk-73-1.png" width="75%" /> --- ### Step 4: Your Go - Try it again for bill_depth_mm (predictor) on the same y variable -- ```r plot(x = penguins$bill_depth_mm, y = penguins$body_mass_g) abline(lm(penguins$body_mass_g ~ penguins$bill_depth_mm), col = "blue", lwd = 2, lty = 2) ``` <img src="Intro_to_R_files/figure-html/unnamed-chunk-74-1.png" width="75%" /> --- ### Are these two different species? ```r plot(x = penguins$bill_depth_mm, y = penguins$body_mass_g) abline(lm(penguins$body_mass_g ~ penguins$bill_depth_mm), col = "blue", lwd = 2, lty = 2) ``` <img src="Intro_to_R_files/figure-html/unnamed-chunk-75-1.png" width="75%" /> --- ### Outlook on `Tidyverse` ```r penguins %>% ggplot(aes(x = bill_depth_mm, y = body_mass_g, colour = species)) + geom_point(alpha = 0.5) + geom_smooth(method = "lm", se = F, lty = 2) ``` <img src="Intro_to_R_files/figure-html/unnamed-chunk-76-1.png" width="90%" /> --- # That's it for today! This was an introduction to R. For more (machine learning, data visualisation, intro to packages, writing your thesis in RMarkdown, etc.), join our upcoming session. For further questions, feel free to reach out to us. Make sure to stay updated on our socials and via our website where all resources and dates are also published. <br> .center[ <img src="GraphicsSlides/Logo RUG hell.png" width="60%" /> **[Website](https://rusergroup-sg.ch/) | [Instagram](https://www.instagram.com/rusergroupstgallen/?hl=en) | [Twitter](https://twitter.com/rusergroupsg)** ] --- class: middle, inverse, hide-logo # Thank you for attending!
The material provided in this presentation including any information, tools, features, content and any images incorporated in the presentation, is solely for your lawful, personal, private use. You may not modify, republish, or post anything you obtain from this presentation, including anything you download from our website, unless you first obtain our written consent. You may not engage in systematic retrieval of data or other content from this website. We request that you not create any kind of hyperlink from any other site to ours unless you first obtain our written permission.