How to automate web scraping using Selenium in Python: IMDB case
There are some ways to scrape websites in python. Just mention famous libraries such as BeautifulSoup or Scrapy that have been widely utilized for crawling the web. But, in this article, I will share my experience using Selenium library for web-scraping.
Selenium library work to automate the manual activity of the user interacting with the website. Imagine you wanted to get some information on some website, what would you do?
Of course, you'll initiate with opening the browser, put the address of the website on the browser, go to the site, take a look at the information and then finally get the information.
Selenium does the same step as the manual process, but it is done automatically. Let's get started to see how it works.
. . . . .
Preparation
We'll be using python 3 and Pycharm at this time. If you're using Python 2, it probably may need slightly adjustment on your code.
• Selenium package:
• ChromeDriver:
We'll be using chrome browser, however it also can be done using other browsers like firefox, opera, internet explorer etc.
Install the Webdriver; the latest ChromeDriver from: https://chromedriver.storage.googleapis.com/index.html?path=2.42/
Let's jump!
Let's try to open the URL, we'll fly to IMDB page.
Chrome browser will automatically open up and you'll see the text "Chrome is being controlled by automated test software". That's cool, we just begin the first step. The next step is to navigate the website to the page where the information is located.
Our purpose at this moment is to obtain the list of most popular TV shows. To that, what action that we expect from Selenium to get there are:
1. Click the "Menu" on the top left, beside the IMDB home page button.
2. Find the link of "Most Popular TV Shows" and then click it.
3. After the list of popular TV Shows appear, then we'll scrape it out
| Our objective list movies to scrape |
Okay, now we know the step-by-step to achieve of our goal. The next step is to translate that actions step with Selenium.
Locate the element
There are several ways to locate the element of page: by using id, css selector, Xpath, link text, class name, etc. For more detail please see the selenium python documentation here: 4. Locating Elements — Selenium Python Bindings 2 documentation (selenium-python.readthedocs.io)
In this example, I will show you how to use XPath, Link Text and Class Name as a reference to locate the element. To get the Xpath: right-click on the "menu" button and then select "inspect" and then right-click again on the element > Copy > Copy Xpath.
We've succeeded to click the menu button and be ready to click another link which is "Most Popular TV Shows".
Scrape the list of Popular Shows
We'll use the class name "titleColumn" to scrape the title and year as this text information is located under this class. However, we need to bit manipulate the string text using regex later on to seperate the information of movie title and year.
Meanwhile, for the rating, we'll use the class name "ratingColumn.imdbRating".
Output
Finally I transfer the result from dataframe to csv fileBoom!
Now you know the basics of how to scrape the website using Selenium. If you are interested to explore more, what I recommend for the next are:
• Try different ways of locating Elements
• Using Explicit Waits instead of exact period time (time.sleep())
• Try different action (send key, scroll, etc)
Happy learning and thanks for reading ^_^
Comments
Post a Comment