Posts

Showing posts from February, 2023

ETL Pipelines With Airflow

Image
Introduction In this blog post, I want to go over the operations of data engineering called Extract, Transform, Load (ETL) and show how they can be automated and scheduled using Apache Airflow. Extracting data can be done in a multitude of ways, but one of the most common ways is to query a WEB API. If the query is successful, then we will receive data back from the API's server. Often times the data we get back is in the form of JSON. JSON can pretty much be thought of as semi-structured data or as a dictionary where the dictionary keys and values are strings. Since the data is a dictionary of strings this means we must transform it before storing or loading it into a database. Airflow is a platform to schedule and monitor workflows and in this post, I will show you how to use it to extract the daily weather in Ha Noi from the OpenWeatherMap API, convert the temperature to Celsius, and load the data in a simple PostgreSQL database. Let's first get started with how to query an ...

Built a working Hadoop-Spark-Hive-Superset cluster on Docker

Image
Let’s get this thing started Now you can download my repo  hadoop-spark-hive-superset and from the directory where you placed it all it takes is this command to get the multi-container environment running: And you can break it all down again by going to that same directory and running this: All the containers will then be stopped and removed. But: the images and volumes stay! So don’t be surprised that the csv file you uploaded to HDFS will still be there. Quick starts Quick start HDFS Find the Container ID of the namenode. Copy  supermart_grocery_sales.csv  to the namenode. Go to the bash shell on the namenode with that same Container ID of the namenode. Create a HDFS directory /data//sales/ supermart_grocery_sales . Copy  supermart_grocery_sales.csv  to HDFS: Quick start Spark Go to http://<dockerhadoop_IP_address>:8080 or  http://localhost:8080/  on your Docker host (laptop). Here you find the spark:// master address: Go to the command line o...