Web scraping, web harvesting or web data extraction is data scraping used for extracting data from websites.

-Wikipedia

A couple of days ago, I was looking for project ideas on medium and I remembered having stumbled upon this  post  sometime back which gives advice on building data portfolio projects. At the end of the post, the author pitched a project idea on finding out the divorce rates of actors and actresses on Wikipedia. I decided to take up the challenge and see if I can actually scrape the biography pages of actors and actresses on Wikipedia and get any interesting insights then finally build a model around it.

This post will highlight how I got to scraping out this data using R’s package rvest. rvest is an R package that makes it easy for us to scrape data from the web. So, brace yourselves, technical post ahead!

1. Getting Started

Pre-requisites:
  • To get started with web scraping in R you’ll obviously need some working knowledge of R programming language.
  • Throughout this post/tutorial we’ll be working with the rvest package which you can install using the following code:

install.packages("rvest")

  • Some knowledge of HTML and CSS will also be an added advantage. If you don’t have any knowledge on HTML and CSS, worry not, you can use an opensource software known as Selector Gadget. You can simply access it by downloading the Selector Gadget extension from this website . Using the extension you can select the parts of any website and get the relevant tags by simply clicking on the part of the website you’d like to scrape out.
  • Finally, you also need Google chrome.

Awesome! Now, let’s get started on scraping Wikipedia:

2. Scraping Wikipedia using R

  • Step 1

After searching through Wikipedia’s website, I came across this page that has a list of around 1500 american actresses and links to their Wikipedia pages where you can access more information about them.

List of American actresses

below is a screenshot of Jennifer Aniston’s Wikipedia page that we’re going to scrape. The highlighted part is the part we need in this case. It contains information about the actor’s place and date of birth, occupation e.t.c.

Highlighted Wikipedia page using Selector Gadget

I’ve actually highlighted that part using Selector Gadget, which helps in getting the specific CSS selectors that we want to use so as to scrape the page. From the screenshot above, at the bottom middle you can see that our CSS selector for the highlighted biography table is “.vcard” which we are going to need later. You could also use Google Chrome’s developer tools to look at the HTML behind the biography table as shown highlighted in the screenshot below and copy the CSS selector by right clicking on the html code.

 

  • Step 2

To read the web page into R we will need the rvest package and also the magrittr package which uses the operator %<% that takes the output from the left and uses it as the first argument of input on the right.


#load in the Rvest and magrittr package

library(rvest)

library(magrittr)

The function we’re going to use first is the read_html() which is used in reading HTML pages. You do this using the following code:

#read HTML code from the website
webpage <- read_html("https://en.wikipedia.org/wiki/Jennifer_Aniston")

Next, since we have already identified our CSS selector “v.card”, we use the following code to extract the information
in the biography table.

table <- webpage %<%
html_nodes("table.vcard") %<%
html_table(header=F)
table <- table[[1]]

#add the table to a dataframe
dict <- as.data.frame(table)

We then come up with the table below:

Easy right?

Conclusion

So, you’ve just learnt how to scrape a html table from a web page using R. A lot goes into the code when scraping each bio table from the list of actresses. You can access the code and data I extracted here.

References

Analytics Vidhya, Beginner’s guide on web scraping – https://www.analyticsvidhya.com/blog/2017/03/beginners-guide-on-web-scraping-in-r-using-rvest-with-hands-on-knowledge/

 

Till next time 🙂