Web scraping with Rstudio and rvest

Pluriza
5 min readMar 3, 2020

--

If you are into Data Science, Data Analysis, Machine learning or Artificial Intelligence related projects, chances are that you will have to extract data from different sources at some point. In a lot of cases that may be JSON(Javascript Object Notation) data, .csv(comma separated values) or plain data from a website, aka web-scraping. In this blog, we will take a look at the tools that R has available to perform some simple web scraping tasks using the rvest package.

Extract information of all the products detailed in the following site: http://books.toscrape.com/, this data will be gathered in a .csv file for the moment, this will be useful in case we want to perform data analysis, data visualization or machine learning algorithms in the future.

- R installed on your machine: https://www.r-project.org/

- R studio: The most used IDE for Data Science with R, it comes with many useful tools and features for Data/Dataframe visualization, packages/dependencies checker and the option to interact with objects stored in your environment. It is open-source and you can get it here: https://rstudio.com/

-Some basic HTML/CSS knowledge: Most of the work behind web scraping will be done handling HTML tags and CSS properties, there is no need to be a web developer to scrape a website but some basic knowledge will help a lot.

Navigating the website: To begin, you may want to go to the site we are going to scrape http://books.toscrape.com/, and take a look at how the info is spread throughout the website, as you can see the products are paginated and there is a bunch of data we may want to extract from each book like Name, price, reviews, and detail.

To begin let`s import rvest and the purrr package:

The purr package enhances R`s functional programming tools and makes data frame manipulation a bit easier.

Next, we are going to read the HTML from our site using the rvest function read_html()

Let’s now make some simple tests to verify how we will scrape this site, assuming we want to extract the data from the first book, ‘A Light in The Attic’, we need to use the book`s CSS selector to extract its node to get the product’s href that will redirect us to the book details, the following code completes that using the rvest functions html_node(), html_text() and html_attr(). If you run this code you will see the product’s href printed on the console:

let’s append the main URL into the href we extracted to the book’s full URL with paste0():

Pagination and generating all URLs:

By exploring the site you will notice that there are 50 pages containing books, and each page has 20 books on it. All the pages have the same URL, the only thing that changes is the “page-” section of the string, so, with the following code, we will generate all the URLs for the 50 pages of the site. We will use the stringr package to do some simple string manipulation, str_replace() to change “page-3” for a new string generated with the prefix “page-” and an element from a list of numbers from 1 to 50. By running this code you will have the 50 URLs stored in a variable called all_pages:

Extraction each book’s page

Now that we have all the pages listed, we need to navigate each page`s URL and extract each book’s URL. To do this we will perform a similar task to the one we did at the beginning of this guide; we need to read the page URL, get the CSS selector of the node that contains the href to extract it as text. We will do this operation a thousand times (20 books per page * 50 pages), so we will write this as a function, by executing the code you will get all the 20 href’s of the books on the first page:

We will now use the sapply() function to execute this code for every one of the 50 pages; this will return a list of 1000 hrefs as a “matrix”, however, to parse this data we need the list to be of type “character, we will convert the list using the as.vector() function and use paste0() to append the main URL to the generated hrefs, by executing this code we will finally have a list of class “character” containing the URL of every book on this site:

Now that we are able to navigate each page, let’s extract some information, we are going to define a random book URL to make some tests and extract data from each book, first, we are going to try to get the book’s name and price since the CSS selectors are simple, this task will be similar to the first step of this guide when we were extracting data from the first book, ‘A Light in The Attic’:

Extracting Review as a number:

This data is not displayed as text on the site, it is a rating component defined using HTML classes so, in order to extract this information, after getting its HTML node using CSS selectors, we will use the substring() function to extract the class’s text (Ex.” One”, “Two”, etc) and use a switch case to assign this to a number.

By running this code you will get the review printed as a number on the console:

Extracting the product details table:

The rvest package can extract values from an HTML table and store them as a variable with the “character” class, we will use the table’s second column (“X2”) to get the detail’s values, using the t() and as.dataframe() functions to store this data as a data frame; the first column (“X1”) will be used to set the data frame column names with the colnames() function. Running this code will produce the product’s table as a data frame (Note that the data frame values are of type factor, we will convert them to “character” in the next step.):

We will now define a vector containing the data we extracted:

-Name

-Price

-Review

-Availability

The value for availability will be extracted from the table we got from the previous step and converted to “character”:

We will now create a function that receives a URL, performs all of this data extraction and returns a vector containing the book’s details. Running this code will execute the function and give us the details from the first book! we are almost done:

To finish we will execute our getDetails() function on each one of the thousand URLs with sapply() and store the results in a Dataframe. Finally, we will store the data in a .csv file (Remember that the .csv file will be saved on your current working directory, you may change this in Rstudio on the Session Tab -> Set Working Directory)

Conclusion: In this guide, we have learned about the tools R has available for simple scraping tasks, you may want to try them now on other websites and/or wait for the next guide where we will use the data we got on out .csv file to do some Data visualization and analysis.

Bibliography

https://cran.r-project.org/web/packages/rvest/rvest.pdf

https://www.datacamp.com/community/tutorials/r-web-scraping-rvest

https://www.w3schools.com/cssref/css_selectors.asp

https://www.agenty.com/docs/scraping-agent/scraping-data-using-css-selectors

Repository

https://gitlab.com/pluriza/web-scraping-with-rstudio-and-rvest

--

--

Pluriza
Pluriza

Written by Pluriza

A startup dedicated to the development of business software solutions based on scalable, modern and secure architectures.

Responses (2)