class: center, middle, hide-logo <style type="text/css"> pre { background: #F8F8F8; max-width: 100%; overflow-x: scroll; } </style> <style type="text/css"> .scroll-output { height: 80%; overflow-y: scroll; } </style> # Web Scraping with Selenium ## by <img src="GraphicsSlides/Logo RUG hell.png" width="50%" /> ##### Author/Presenter: Ruben ##### Last updated: _2022-12-01 13:23:49_ --- ## Goals for today's session Main goals: - Learn what websites are made of and how that helps us - Get to know HTML elements and their attributes - Learn how to conceptualize a web scraper and build it Less obvious ones: - Get to know tools and crutches to help scrape complex websites - Learn about potential pitfalls and hurdles when scraping --- ## The basic idea behind web scraping If interesting data is available openly on websites, why not just take it? <br> .center[ <img src="GraphicsSlides/free-real-estate.gif" width="50%" /> ] --- ## The basic idea behind web scraping Just a couple of possible roadblocks: - Data is spread across a lot of sites - It's unstructured And most importantly: why do something by hand when you can write code for hours to do it for you? .center[ <img src="GraphicsSlides/hacker-hackerman.gif" width="35%" /> ] --- ## Different approaches for webscraping You might have heard of these: - BeautifulSoup - Requests - Scrapy - Selenium As you might have guessed, today we're using Selenium! .center[ <img src="GraphicsSlides/scraping_packages.png" width="80%" /> ] --- ## Why use Selenium? .pull-left[ ### BeautifulSoup and Requests are mainly used to - GET web pages and - support you in the extraction of data from the DOM (you don't need to fully understand this right now) --> "static" web scraping ] .pull-right[ ### Selenium "drives" a browser and can therefore be used to scrape more complex websites with - buttons to press - inputs to fill out - errors and timeouts --> "dynamic" web scraping ] --- ## The setup First we need to install the RSelenium package (and tidyverse of course): ```r install.packages("RSelenium") library(RSelenium) library(tidyverse) ``` -- And choose our desired browser to drive (I like Chrome): .center[ <img src="GraphicsSlides/chrome.jpg" width="50%" /> ] --- ## Download chromedriver via RSelenium If you've never used chromedriver before: ```r chromeDr <- rsDriver(browser = "chrome", port = 4569L, chromever = "105.0.5195.52") ``` If you get this error: ```r 'Error in chrome_ver(chromecheck[["platform"]], chromever) : version requested doesnt match versions available = 101.0.4951.15,101.0.4951.41,102.0.5005.27,105.0.5195.19,105.0.5195.52,106.0.5249.21,106.0.5249.61,107.0.5304.18,107.0.5304.62,108.0.5359.22,108.0.5359.71' ``` Choose the highest available that matches the major version of your Chrome browser (e.g., 107.xyz...) --- ## Download chromedriver via RSelenium If everything worked, close the chromedriver for now, as we need some more settings. ```r chromeDr[["server"]]$stop() ``` --- ### If that didn't work: Download chromedriver manually Chromedriver is a development tool. You can download it here: https://chromedriver.chromium.org/downloads .pull-left[ On Windows put the file in<br> `C:\Windows\` ] .pull-right[ On MacOS put the file in<br> `/usr/bin/` <br> (you get there by pressing `cmd + ⇧ + G` and pasting the path above) ] .center[ <img src="GraphicsSlides/im_a_mac.jpg" width="60%" /> ] --- ## Today's target: homegate.ch .center[ <img src="GraphicsSlides/screenshot_homegate.jpg" width="100%" /> ] --- ## Some considerations before we start .pull-left[ - Why not just click the "search" button to have all the listings - We have to enter a region -> way too annoying to do manually ] .pull-right[ <img src="GraphicsSlides/screenshot_filters.jpg" width="100%" /> ] --- ## Some considerations before we start Having a look around the page reveals that we can nonetheless access all the listings: .center[ <img src="GraphicsSlides/screenshot_for_rent.jpg" width="80%" /> ] We can go by canton and make Selenium click all the buttons for us --- ## Let's fire up that robo-browser ```r chromeDr <- rsDriver(browser = "chrome", port = 4569L, chromever = "105.0.5195.52", extraCapabilities = list(chromeOptions = list(args = c('--disable-gpu', '--window-size=1280,800'), prefs = list( "profile.default_content_settings.popups" = 0L, "download.prompt_for_download" = FALSE, "directory_upgrade" = TRUE )))) remDr <- chromeDr[["client"]] ``` The extra options are used to ensure that: - We don't run into graphics driver issues - We suppress popups - We have adequate permissions --- ## Opening a webpage From here it's like regular web browsing - except controlled by your code ```r remDr$navigate("https://www.homegate.ch/mieten/immobilien/land-schweiz") ``` Interacting with the filters and buttons is just as easy: ```r e <- remDr$findElement(value = '//*[@id="app"]/main/div/div/div[2]/div[1]/div/div[2]/div/div[1]/div[2]/a/span') e$clickElement() ``` --- ## How to find stuff on a webpage We find elements (such as buttons and the data we want) via XPATHs (among others) Like you've seen previously:<br> ```r '//*[@id="app"]/main/div/div/div[2]/...' ``` .center[ <img src="GraphicsSlides/what-the.gif" width="50%" /> ] --- ## The DOM Webpages are built in the hypertext markup language (HTML):<br> ```html <!DOCTYPE html> <html> <body> <h1>My First Heading</h1> <p>My first paragraph.</p> </body> </html> ``` XPATHs are just a way to express the "path" of any element in this structure -- In this case: to get the heading element:<br> ```r '//body/h1' ``` --- ## Extracting the actual data Depending on the element, we want different attributes of the elements: - text - href (a link) - value (with inputs) or even metadata: - whether the element exists - how many of the same elements exist - the child elements --- ## Example ```html <body> <div> <a href='https://example.com'>Link 1</a> <a href='https://example.com'>Link 2</a> <a href='https://example.com'>Link 3</a> </div> </body> ``` -- ```r '//body/div/a[2]' ``` gives us: .pull-left[ - text: <mark>"Link 2"</mark> - href: <mark>"https://example.com"</mark> - value: <mark>None</mark> ] .pull-right[ - element exists: <mark>yes</mark> - how many of the same elements exist: <mark>3</mark> - the child elements: <mark>None</mark> ] --- ## Supertrick: Getting XPATHS easily In your browser, open the developer console:<br> `F12` or `cmd + opt + i` Find the element with the picker tool (top left) and copy its XPATH .center[ <img src="GraphicsSlides/screenshot-xpath.jpg" width="50%" /> ] --- ### Task 1: Get the text of the pink button on homegate **Tip:** Start by the element and then extract its *text* attribute as follows: ```r remDr$findElement(value = '//some XPATH')$getElementText() ``` .center[ <img src="GraphicsSlides/programmer.png" width="50%" /> ] --- ### Solution ```r remDr$findElement(value = '//*[@id="app"]/main/div/div[2]/div/div/div[5]/button')$getElementText() ``` ```r [[1]] [1] "Suchen" ``` --- ## multiple elements Often it is useful to find multiple elements at once. .center[ <img src="GraphicsSlides/screenshot_for_rent.jpg" width="50%" /> ] Here we could extract the links to all cantons and save them to go through later. --> We need to find a common identifier. ```r e <- remDr$findElements(value = "//*/div[contains(@class, 'GeoDrillDownSRPLink')]/a") canton_links <- unlist(lapply(e, function(x){x$getElementAttribute("href")}))[2:27] ``` --- ## New concept: element attribute HTML elements carry more data than just a text. These can be useful for ** finding ** elements as well as ** extracting ** additional data. - Example of a `<div>` tag with a class: ```html <div class="my_class_name"> Text </div> ``` -> We can use the class to **find** this `<div>` element. - Example of an `<a>` tag with an href (a link): ```html <a href="https://example.com"> Link to example.com </a> ``` -> We might want to **extract** this link --- ## Homegate does not like to be scraped With the specific example of Homegate, we have a problem: **They randomize part of their class names, so we cannot easily grab them** **BUT:** we have a solution: Instead of searching for a specific class (with the randomized part in it), we just search for the constant part: ```html *<div class="GeoDrillDownSRPLink_srpLink_2Zztq col-md-5"> Text </div> ``` We can do that by specifying `contains(@class, 'my_class_name')` in square brackets: ```r e <- remDr$findElements(value = "//*/div[contains(@class, 'GeoDrillDownSRPLink')]/a") ``` --- ## Other attributes The class attribute is generally the most useful for finding elements, however you might need to use others: - id - css styles - title - value - etc. You can use the same syntax: - if looking for the exact attribute: ```r e <- remDr$findElements(value = "//*/div[@id='full_id_name']/a") ``` - if looking for partial attribute name: ```r e <- remDr$findElements(value = "//*/div[contains(@id, 'partial_id_name')]/a") ``` --- ## Let's get into the main scraping Now we have all the links of the Cantons, let's look at the actual listings. First, navigate to the first Canton's page: ```r e <- remDr$navigate(canton_links[1]) ``` .center[ <img src="GraphicsSlides/screenshot_listing_1.png" width="100%" /> ] --- ## What to scrape Thinking about potential analyses we might want to conduct, interesting values to scrape could be: - price - size of the appartment - number of rooms - location - description (maybe we can extract important keywords) Less obvious but useful: **the link** of the detailed listing. At a later time we might want to extract more data from the listing and saving the link will make this much easier. --- ### Task 2: Find the XPATH of all elements containing the price **Tip:** You might need to find the class of the parent elements <br> <br> <br> .center[ <img src="GraphicsSlides/programmer.png" width="50%" /> ] --- ### Solution The price is in a `<span>` element without a class. However, the parent of this `<span>` (which is another `<span>`) conveniently has the attribute `class='ListItemPrice_price_1o0i3'`. Again, Homegate includes a random part `'1o0i3'` (this might be different for you), but we can just search for the first part of the class name: ```r e <- remDr$findElements(value = "//*/span[contains(@class, 'ListItemPrice_price')]/span[2]") ``` So let's see what we have scraped: ```r lapply(e, function(x){x$getElementText()}) %>% unlist() ``` ```r [1] "790.–" "2,350.–" "1,810.–" "1,440.–" "1,340.–" "1,415.–" "1,535.–" "1,710.–" "1,915.–" "1,995.–" "2,100.–" "2,200.–" "2,330.–" "2,430.–" "3,120.–" "3,200.–" "1,615.–" [18] "1,920.–" "1,800.–" "1,680.–" ``` --- ### Task 3: Find all areas of the appartments (given in square meters) **Tip:** This time, the `<span>` you're looking for has a class, no need to go via the parent! <br> <br> <br> .center[ <img src="GraphicsSlides/gimme_that.jpeg" width="60%" /> ] --- ### Solution The span that has the area information has the class `'ListItemLivingSpace_value_2zFir'`. Again, we only look for the first (non-random) part. ```r e <- remDr$findElements(value = "//*/span[contains(@class, 'ListItemLivingSpace_value')]") ``` Which gives us the following values: ```r lapply(e, function(x){x$getElementText()}) %>% unlist() ``` ```r [1] "55m2" "114m2" "84m2" "69m2" "64m2" "78m2" "126m2" "85m2" "100m2" "130m2" "87m2" "119m2" "193m2" "140m2" "97m2" ``` --- ## A small hick-up: missing values Unfortunately, not every listing has the living area given, we only found 15 values in 20 listings: ```r [1] "55m2" "114m2" "84m2" "69m2" "64m2" "78m2" "126m2" "85m2" "100m2" "130m2" "87m2" "119m2" "193m2" "140m2" "97m2" ``` Why could this be a problem? --- ## A small hick-up: missing values .pull-left[ If we just fill this data into a table and assume the rest is `NA`, we get the following:<br> .center[ <img src="GraphicsSlides/screenshot_table_wrong.png" width="60%" /> ] ] .pull-right[ In reality, this is the correct data:<br><br><br> .center[ <img src="GraphicsSlides/screenshot_table_right.png" width="60%" /> ] ] --- ## How can we deal with this? Because `findElements` only lists all the elements that match with certain criteria and does not tell us anything about associated data (which listing it came from), we have to find a new approach. **Currently:** we get all the values on one page in vectors (of different lengths, unfortunately). **Idea:** we could go listing-by-listing and extract all values that are associated with that listing. --- ## How can we deal with this? Remember that we can specify the parent(s) of a certain element we are looking for? --> So let's find all the listing parents (the large boxes all the other stuff is contained in) ```r parents <- remDr$findElements(value = "//*/a[contains(@class, 'ListItem')]") ``` All listings are wrapped in a link that leads to the detailed listing page --> we can use that! --- ## A closer look at the listing wrappers Interestingly, all these `<a>` elements that wrap around the listings contain a link with the listing ID: ```html <a class="ListItem" href="https://www.homegate.ch/mieten/3002089277"></a> ``` This ID is useful to scrape, as this allows us to have a time-stable identifier for the listings. With this, we could for example scrape the page regularly and see: - which listings have come on the market recently - which ones no one wants - how prices might have changed .center[ <img src="GraphicsSlides/zucc.jpeg" width="30%" /> ] --- ## How can we go "listing-by-listing"? Because we were able to extract listing IDs from the href of all listing wrappers (the `<a>` elements), we can include them in our XPATHs. Because the parent elements are active bindings, not just static values, we can use `findChildElement` to find its "children", such as the `<span>` containing the price: ```r child <- parents[[1]]$findChildElement(value = paste0("//*/a[contains(@href, '", listing_id, "')]//span[contains(@class, 'ListItemPrice_price')]/span[2]")) ``` --- ## How can we go "listing-by-listing"? Now we are looking for two conditions: - listing ID in the `href` of the parent element - `class` in the child element We obtain the ID with ```r listing_id <- unlist(strsplit(unlist(listing$getElementAttribute("href")), "/(?!.*/)", perl = TRUE))[2] ``` *Note:* The strange looking string `"/(?!.*/)"` is a regular expression looking for any "/" that is not followed by another "/", allowing us to split the URL and to extract the listing ID. We will look at regular expressions in more detail later in this presentation. --- ### Task 4: Find the rest of the XPATHs for the information we wanted to scrape **Reminder:** Besides the price and area we also wanted to get: - number of rooms - location - description <br> .center[ <img src="GraphicsSlides/programmer.png" width="50%" /> ] --- ### Solution - The **number of rooms** is contained in: ```r "//*/span[contains(@class, 'ListItemRoomNumber_value')]" ``` - The **location** is in: ```r "//*/div[contains(@class, 'ListItem') and contains(@class, '_data_')]/p[2]/span" ``` *Note:* Some elements are harder to specify than others. You can use the logical operators **and** and **or** to chain multiple `contains` together. - The **description** is in: ```r "//*/div[contains(@class, 'ListItemDescription_description')]/p" ``` --- ## Let's put it all together We have: - a list of all parent elements `<a>` - the XPATHs to their children Now we can loop over all the parents: ```r for (parent in parents){ price <- parent$findChildElement(value = paste0("//*/a[contains(@href, '", listing_id, "')]//span[contains(@class, 'ListItemPrice_price')]/span[2]")) sq_m <- parent$findChildElement(value = paste0("//*/a[contains(@href, '", listing_id, "')]//span[contains(@class, 'ListItemLivingSpace_value')]")) nr_rooms <- parent$findChildElement(value = paste0("//*/a[contains(@href, '", listing_id, "')]//span[contains(@class, 'ListItemRoomNumber_value')]")) location <- parent$findChildElement(value = paste0("//*/a[contains(@href, '", listing_id, "')]//div[contains(@class, 'ListItem') and contains(@class, '_data_')]/p[2]/span")) description <- parent$findChildElement(value = paste0("//*/a[contains(@href, '", listing_id, "')]//div[contains(@class, 'ListItemDescription_description')]/p")) * listing_link <- parent$getElementAttribute("href") data.frame("price" = price, "sq_m" = sq_m, "nr_rooms" = nr_rooms, "location" = location, "description" = description, "listing_link" = listing_link) } ``` *Note:* don't forget to extract the link of the detailed listing, as we have done previously. --- ## We uncover the next problem The previous code will likely throw this error: ```r "Selenium message:no such element: Unable to locate element: {'method':'xpath','selector':'//*/span[contains(@class, 'ListItemLivingSpace_value')]'} (Session info: chrome=106.0.5249.119) For documentation on this error, please visit: https://www.seleniumhq.org/exceptions/no_such_element.html Build info: version: '4.0.0-alpha-2', revision: 'f148142cf8', time: '2019-07-01T21:30:10' System info: host: 'eduroamstud-130-82-245-171.unisg.ch', ip: 'fe80:0:0:0:10ba:f169:ddd9:af11%en0', os.name: 'Mac OS X', os.arch: 'x86_64', os.version: '10.16', java.version: '14.0.1' Driver info: driver.version: unknown Error: Summary: NoSuchElement Detail: An element could not be located on the page using the given search parameters. class: org.openqa.selenium.NoSuchElementException Further Details: run errorDetails method" ``` --- ## We uncover the next problem `NoSuchElement` is pretty self-explanatory and it's about a known issue: __*Some listings don't have an area value*__ How can we mitigate this? -- --> We **try** to find the element and assign `"NA"` if we can't ```r for (parent in parents){ price <- NA tryCatch( expr = { price <- listing$findChildElement(value = paste0("//*/a[contains(@href, '", listing_id, "')]//span[contains(@class, 'ListItemPrice_price')]/span[2]")) price <- unlist(price$getElementText()) }, error = function(err){ return(1) }) } ``` --- ## tryCatch In web scraping it is quite common to encounter unexpected (or expected) errors, but we nonetheless would like to keep going. `tryCatch()` offers exactly that functionality: we **try** some expression: ```r for (parent in parents){ price <- NA tryCatch( * expr = { * price <- listing$findChildElement(value = paste0("//*/a[contains(@href, '", listing_id, "')]//span[contains(@class, 'ListItemPrice_price')]/span[2]")) * price <- unlist(price$getElementText()) * }, error = function(err){ return(1) }) } ``` --- ## tryCatch In web scraping it is quite common to encounter unexpected (or expected) errors, but we nonetheless would like to keep going. and if it throws an error, we "catch" it: ```r for (parent in parents){ price <- NA tryCatch( expr = { price <- listing$findChildElement(value = paste0("//*/a[contains(@href, '", listing_id, "')]//span[contains(@class, 'ListItemPrice_price')]/span[2]")) price <- unlist(price$getElementText()) }, * error = function(err){ * return(1) * }) } ``` --- ## Putting it all together If we put the `tryCatch()` clauses on every statement where we look for an element that is potentially missing, we are covered against the unevitable errors and *our code is much more stable*. > We do not include the full code here but you can download it from our [GitHub](https://github.com/rusergroupstgallen/rusergroupstgallen.github.io/blob/main/Web%20Scraping%20with%20Selenium/scraper.R) --- ## Putting it all together To make our code run a bit faster and to increase code reusability, let's put the child-element-search in a convenient function with the following logic: ```r find_homegate_elements <- function(listing){ price <- NA tryCatch( expr = { price <- listing$findChildElement(value = paste0("//*/a[contains(@href, '", listing_id, "')]//span[contains(@class, 'ListItemPrice_price')]/span[2]")) price <- unlist(price$getElementText()) }, error = function(err){ return(1) }) # Same procedure with the other 4 values return(data.frame(data.frame("price" = price, "sq_m" = sq_m, "nr_rooms" = nr_rooms, "location" = location, "description" = description))) } ``` --- ## Putting it all together Now we can apply this function to the list of parent elements for each page: ```r page_result <- lapply(parents, find_homegate_elements) # Coerce the results into a data frame page_result <- as.data.frame(do.call(rbind, page_result)) ``` --- ## Putting it all together The resulting table for every page has all the information we wanted: ```r Rows: 20 Columns: 6 $ price <chr> "790.–", "2,350.–", "1,810.–", "1,440.–", "1,340.–", "1,415.–", "1,535.–", "1,710.–", "1,915.–", "1,995.–", "2,100.–", "2,200.–", "2,330.–", "2,430.–", "3,1…" $ sq_m <chr> "55m2", "114m2", "84m2", NA, NA, NA, "69m2", "64m2", "78m2", "126m2", "85m2", "100m2", "130m2", "87m2", "119m2", "193m2", NA, NA, "140m2", "97m2" $ nr_rooms <chr> "2rm", "3.5rm", "3.5rm", "3.5rm", "4rm", "3.5rm", "3rm", "2.5rm", "3.5rm", "5rm", "2.5rm", "4.5rm", "2rm", "2.5rm", "3.5rm", "4.5rm", "3.5rm", "5.5rm", "1rm…" $ location <chr> "5620 Bremgarten AG", "Habsburgerstrasse 54, 5200 Brugg", "General-Guisanstrasse, 5316 Leuggern", "Hauptstrasse 73C, 5737 Menziken", "Küttigerstrasse 28, 50…" $ description <chr> "• Ab sofort ist diese 2 Zi-Wohnung frei in der Altstadt/Unterstadt von Bremgarten mit Küche (neu), Wohnraum, Schlafzimmer und Badzimmer • Separater Zugang …" $ listing_id <chr> "3002143209", "3002078497", "3001740639", "3002051685", "3002008282", "3002139009", "3002137050", "3002090945", "3002095259", "3002083088", "3002115827", "3…" ``` <br> .center[ <img src="GraphicsSlides/great_success.gif" width="40%" /> ] --- ## Turning a page... Now that we have code to scrape all the info from one page, let's also scrape all the other pages! To do that, a human would click the "next page" button, so why don't we tell our robot to do the same?! First, we locate the "next page" button: ```r e <- remDr$findElement(value = "//*/a[contains(@aria-label, 'Go to next page')]") ``` And we click it: ```r e$clickElement() ``` Notice, here we are using the attribute `'aria-label'` which is quite uncommon, yet we can use the exact same syntax we already know. --- ## Turning a page... When we're on the next page, we use the same function as before and go to the next page. Until we've arrived on the last page, which we somehow need to register. One way to do that is to look for the "next page" button, once it's disabled, we know we have arrived on the last page. ```r tryCatch( next_page_button <- remDr$findElements(value = "//*/a[contains(@aria-label, 'Go to next page')]") ) ``` -- With a simple if statement, we can evaluate what we found: ```r if (length(next_page_button) < 1){ last_page <- TRUE } ``` --- ## Turning a page... To keep looping through the pages until we have arrived at the last one, we can use a `while` loop: ```r while (last_page == FALSE){ # Scrape the page } ``` -- If we now put all of this in a `for` loop that goes over all of the canton links we have extracted previously, we have a recipe to scrape all of the rental properties on Homegate: ```r for (c_link in canton_links){ while (last_page == FALSE){ # Scrape the page } } ``` --- ## Home stretch All that is left to do is to also scrape all of the properties listed to buy: ```r remDr$navigate("https://www.homegate.ch/buy/real-estate/switzerland") ``` -- We can apply the exact same recipe as for the rental properties as we wrote code that would be highly reusable: ```r for (c_link in canton_links){ while (last_page == FALSE){ # Scrape the page } } ``` --- ## Home stretch The only thing we need to adjust is to add a different value for the listing type: ```r # Add a column for the listing type page_result$listing_type <- "buy" ``` -- We now have a data set of about 45'000 real estate listings in Switzerland and the skills to scrape almost any website out there! Let's just save it really quick: ```r save(homegate_data, file = "Homegate_scrape.RData") ``` --- ## Data cleaning Now that we have all that data, there is still some work to do: <b> clean it! </b> - Some of our numeric data is saved as strings (e.g., listing IDs): ```r $ listing_id <chr> "3002143209", "3002078497", "..." ``` - Other data points include symbols that hinder further analysis (e.g., prices): ```r $ price <chr> "790.–", "2,350.–", "..." ``` - Some data is hidden within other data points (e.g., ZIP codes): ```r $ location <chr> "5620 Bremgarten AG", "Habsburgerstrasse 54, 5200 Brugg", "..." ``` --- ## Data cleaning Turning numbers that are saved as strings into numbers again is easy: ```r homegate_data$listing_id <- sapply(homegate_data$listing_id, as.numeric) ``` If they contain characters such as ".-" or ",", we first need to remove them, otherwise we just get NAs: ```r sapply(homegate_data$price, as.numeric) ``` ```r [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA [58] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA ``` --- ## New concept: regular expressions (regex) Regular expressions help us to find strings of characters in other strings: ```r str_remove("69420.-", "\\.-") %>% as.numeric() ``` ``` ## [1] 69420 ``` *Note:* We need to "escape" the period character because it has a different meaning in regular expressions: it is a so-called wildcard for any other character. With the double backslashes it just selects a period though. --- ## New concept: regular expressions (regex) With regex, we can also specify patterns that we are looking for in a string (such as four digit numbers of ZIP codes). -- We do that by specifying: - The type of character we want: in this case digits: `\\d` - The number of consecutive occurrences: in this case four `{4}` ```r str_extract("9000 St. Gallen", "\\d{4}") %>% as.numeric() ``` ``` ## [1] 9000 ``` Here, it does not matter where the target substring occurs: ```r str_extract("St. Gallen (9000)", "\\d{4}") %>% as.numeric() ``` ``` ## [1] 9000 ``` --- ## Applying regex Let's apply these concepts to clean the data: ```r homegate_data$price <- sapply(homegate_data$price, function(x){str_remove(x, "\\.–")}) %>% sapply(., function(x){str_remove_all(x, ",")}) %>% unlist(.) %>% as.numeric() ``` ```r homegate_data$sq_m <- sapply(homegate_data$sq_m, function(x){str_remove(x, "m2")}) %>% sapply(., function(x){str_remove_all(x, ",")}) %>% unlist(.) %>% as.numeric() ``` ```r homegate_data$zip_code <- sapply(homegate_data$location, function(x){str_extract(x, "\\d{4}")}) %>% as.numeric() ``` ```r homegate_data$nr_rooms <- sapply(homegate_data$nr_rooms, function(x){str_remove(x, "rm")}) %>% as.numeric() ``` --- ## Applying regex Now things look much cleaner and more useable for future analysis: ```r glimpse(homegate_data) ``` ```r $ price <dbl> 790, 2350, 1810, 1440, 1340, ... $ sq_m <chr> 55, 114, 84, NA, NA, ... $ nr_rooms <dbl> 2.0, 3.5, 3.5, 3.5, 4.0, ... $ location <chr> "5620 Bremgarten AG", "Habsburgerstrasse 54, 5200 Brugg", ... $ description <chr> "• Ab sofort ist diese 2 Zi-Wohnung frei in der Altstadt/Unterstadt von ..." $ listing_id <dbl> 3002143209, 3002078497, 3001740639, 3002051685, 3002008282, ... $ listing_type <chr> "rent", "rent", "rent", "rent", "rent", ... $ zip_code <dbl> 5620, 5200, 5316, 5737, 5000, ... ``` .center[ <img src="GraphicsSlides/nice-smack.gif" width="30%" /> ] --- ## Great success! We now have a **clean** data set of over 45'000 real estate listings! <br> .center[ <img src="GraphicsSlides/thats_all.webp" width="70%" /> ] --- # So let's recap We learned: - How websites are structured - How to find stuff with XPATHs - How to drive a robo-browser - How to deal with missing values - How to clean scraped data And most importantly: We learned how to harvest some juicy, free data! --- ## Appendix: Implicit wait Sometimes you have to wait until a page is loaded fully or else you won't find certain elements. A function like this can be helpful: ```r implWait <- function(wait_s = 30){ counter <- 0 webElem <- NULL while(is.null(webElem) & counter < wait_s){ webElem <- tryCatch( expr = { suppressMessages({ remDr$findElement(value = "//XPATH to important element") }) }, error = function(err){ NULL }) Sys.sleep(1) counter <- counter + 1 print("waiting") } } ``` --- ## Appendix: Implicit wait We enter a `while` loop that looks for an element until found: ```r implWait <- function(wait_s = 30){ counter <- 0 webElem <- NULL while(is.null(webElem) & counter < wait_s){ * webElem <- tryCatch( * expr = { * suppressMessages({ * remDr$findElement(value = "//XPATH to important element") * }) * }, * error = function(err){ * NULL * }) Sys.sleep(1) counter <- counter + 1 print("waiting") } } ``` --- ## Appendix: Implicit wait And wait for one second while it's not found: ```r implWait <- function(wait_s = 30){ counter <- 0 webElem <- NULL while(is.null(webElem) & counter < wait_s){ webElem <- tryCatch( expr = { suppressMessages({ remDr$findElement(value = "//XPATH to important element") }) }, error = function(err){ NULL }) * Sys.sleep(1) * counter <- counter + 1 * print("waiting") } } ``` --- ## Appendix: Implicit wait You might want to add exit conditions as well, so you can specify what happens if the element is not found. ```r implWait <- function(wait_s = 30){ counter <- 0 webElem <- NULL # while loop * if (is.null(webElem)){ * return(1) * } else { * return(0) * } } ``` --- ## Appendix: Implicit wait For example, you could skip to the next page if the element is not found after the specified implicit wait: ```r if (implWait(30) == 1){ remDr$navigate("URL to next page") } ``` --- ## Appendix: Exit conditions After you are done scraping a website, it is good practice to release the binding to the remote driver instance: ```r remDr$close() ``` And to stop the webdriver server instance (in this case chromedriver): ```r chromeDr[["server"]]$stop() ``` Just add these statements to the end of your script. --- ## Appendix: Running the scraper Scraping a website can sometimes take hours, and RStudio can be a bit moody over this amount of time. We therefore highly recommend that you run your scraper from a terminal and **not within RStudio**. On Windows, open a Powershell, navigate to the directory of your script with the `cd` command and run: ```bash > Rscript scraper.R ``` On Mac, open a Terminal, navigate to the correct directory with `cd` and run: ```bash $ Rscript scraper.R ``` --- ## Appendix: Running the scraper If you wish to preserve the environment variables from your session (which is advisable), start an R session in your terminal and then run the script. In Windows Powershell: ```bash > R.exe > source("scraper.R") ``` And in Mac Terminal: ```bash $ R > source("scraper.R") ``` This way, you won't lose your data, should your scraper crash. --- # That's it for today! **Some closing words** For further questions, feel free to reach out to us. Make sure to stay updated on our socials and via our website where all resources and dates are also published. <br> .center[ <img src="GraphicsSlides/Logo RUG hell.png" width="60%" /> **[Website](https://rusergroup-sg.ch/) | [Instagram](https://www.instagram.com/rusergroupstgallen/?hl=en) | [Twitter](https://twitter.com/rusergroupsg)** ] --- class: middle, inverse, hide-logo # Thank you for attending!
The material provided in this presentation including any information, tools, features, content and any images incorporated in the presentation, is solely for your lawful, personal, private use. You may not modify, republish, or post anything you obtain from this presentation, including anything you download from our website, unless you first obtain our written consent. You may not engage in systematic retrieval of data or other content from this website. We request that you not create any kind of hyperlink from any other site to ours unless you first obtain our written permission.