dungtothelo

Posts

Showing posts from August, 2022

Data Engineering Project — Cửa hàng bán lẻ phần 4 — Analyzing the Data

- August 31, 2022

Web Scraping Wikipedia Tables using BeautifulSoup and Python

- August 30, 2022

'Data is the new oil' As an aspiring data engineer, I do a lot of projects which involve scraping data from various websites. Some companies like Twitter do provide APIs to get their information in a more organized way while we have to scrape other websites to get data in a structured format. The general idea behind web scraping is to retrieve data that exists on a website and convert it into a format that is usable for analysis. In this tutorial, I will be going through a detail but simple explanation of how to scrape data in Python using BeautifulSoup. I will be scraping Wikipedia to find out all the countries in Asia. Tables with name of Asian countries on Wiki Firstly we are going to import requests library. Requests allows you to send organic, grass-fed HTTP/1.1 requests, without the need for manual labor. Now we assign the link of the website through which we are going to scrape the data and assign it to variable named website_url. requests.get(url).text will ping a ...

How to automate web scraping using Selenium in Python: IMDB case

- August 23, 2022

There are some ways to scrape websites in python. Just mention famous libraries such as BeautifulSoup or Scrapy that have been widely utilized for crawling the web. But, in this article, I will share my experience using Selenium library for web-scraping. Selenium library work to automate the manual activity of the user interacting with the website. Imagine you wanted to get some information on some website, what would you do? Of course, you'll initiate with opening the browser, put the address of the website on the browser, go to the site, take a look at the information and then finally get the information. Selenium does the same step as the manual process, but it is done automatically. Let's get started to see how it works. ...

Data Engineering Project — Cửa hàng bán lẻ phần 3 — Data Warehousing

- August 16, 2022

Mở đầu Đây là phần 3 của "Data Engineering Project - Cửa hàng bán lẻ" series. Trong 2 bài viết trước, tôi đã trích xuất dữ liệu rượu whiskey qua web scraping, thiết kế MySQL database được sử dụng như nguồn dữ liệu chính của tổ chức và load dữ liệu vào đó. Cơ sở dữ liệu hiện tại sẽ trông như thế này: Phần này sẽ tập trung vào 1 thành phần quan trọng trong kiến trúc dữ liệu của tổ chức chính là: Data Warehouse. Nói 1 cách đơn giản, kho dữ liệu là 1 cơ sở dữ liệu quan hệ khác được sử dụng cùng với cơ sở dữ liệu trung tâm của tổ chức. Kho dữ liệu sao chép dữ liệu từ cơ sở dữ liệu của tổ chức và lưu trữ dữ liệu đó theo 1 cách cụ thể giúp nó có hiệu suất cao cho các hoạt động đọc. Nó cho phép tổ chức có 1 nguồn dữ liệu lịch sử duy nhất bao gồm các quá trình kinh doanh chính trong tổ chức. Kho dữ liệu có thể là tại chỗ, có nghĩa là tổ chức sẽ phải tự triển khai và duy trì nó hoặc nó có thể là 1 dịch vụ trả phí của 1 nhà cung cấp đám mây. Tất cả phụ thuộc vào trường hợp sử dụng. Cô...