Web Scraping Wikipedia Tables using BeautifulSoup and Python
'Data is the new oil'
As an aspiring data engineer, I do a lot of projects which involve scraping data from various websites. Some companies like Twitter do provide APIs to get their information in a more organized way while we have to scrape other websites to get data in a structured format.
The general idea behind web scraping is to retrieve data that exists on a website and convert it into a format that is usable for analysis. In this tutorial, I will be going through a detail but simple explanation of how to scrape data in Python using BeautifulSoup. I will be scraping Wikipedia to find out all the countries in Asia.
Firstly we are going to import requests library. Requests allows you to send organic, grass-fed HTTP/1.1 requests, without the need for manual labor.Tables with name of Asian countries on Wiki
Now we assign the link of the website through which we are going to scrape the data and assign it to variable named website_url.
requests.get(url).text will ping a website and return you HTML of the website.
We begin by reading the source code for a given web page and creating a BeautifulSoup (soup) object with the BeautifulSoup function. BeautifulSoup is a Python package for parsing HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping. Prettify() function in BeautifulSoup will enable us to view how the tags are nested in the document.
If you carefully inspect the HTML script all the table contents i.e.names of the countries which we intend to extract is under class wikitable sortable.
So our first task is to find class 'wikitable sortable' in the HTML script.
Under table class 'wikitable sortable' we have links with country name as title.
Now to extract all the links within <a>, we will use find_all().
From the links, we have to extract the title which is the name of countries.
To do that we create a list Countries so that we can extract the name of countries from the link and append it to the list countries.
Convert the list countries into Pandas DataFrame to work in python.
Thank you for reading this article. See you later on the next topics!
Comments
Post a Comment