Built a working Hadoop-Spark-Hive-Superset cluster on Docker

Let’s get this thing started

Now you can download my repo hadoop-spark-hive-superset and from the directory where you placed it all it takes is this command to get the multi-container environment running:

And you can break it all down again by going to that same directory and running this:

All the containers will then be stopped and removed. But: the images and volumes stay! So don’t be surprised that the csv file you uploaded to HDFS will still be there.

Quick starts

Quick start HDFS

Find the Container ID of the namenode.


Copy supermart_grocery_sales.csv to the namenode.


Go to the bash shell on the namenode with that same Container ID of the namenode.


Create a HDFS directory /data//sales/supermart_grocery_sales.


Copy supermart_grocery_sales.csv to HDFS:


Quick start Spark

Go to http://<dockerhadoop_IP_address>:8080 or http://localhost:8080/ on your Docker host (laptop). Here you find the spark:// master address:


Go to the command line of the Spark master and start spark-shell.


Load supermart_grocery_sales.csv from HDFS.


Quick start Hive

Find the Container ID of the Hive Server.


Go to the command line of the Hive server and start hiveserver2

Maybe a little check that something is listening on port 10000 now

Okay. Beeline is the command line interface with Hive. Let’s connect to hiveserver2 now.

Show databases

Create new database

And let's create a table

And have a little select statement going.







Comments

Popular posts from this blog

Sending Emails using Apache Airflow Email Operator

ETL Process Using Airflow and Docker