Real time analytics: Airflow + Kafka + Druid + Superset

- February 08, 2023

Real-time analytics has become a necessity for large companies around the world. When your data has been analyzed in a streaming fashion that allows you to continuously analyze customer behavior and act on it. I also want to test Druid's real-time capability, I am looking for a real-time analytics solution. These blogs give an introduction to setting up streaming analytics using open-source technologies.

Airflow

In this blog use Airflow (a task scheduling platform) that allows you to create, orchestrate and monitor data workflows. In this case, using like a producer sent data to the Kafka topic.

Kafka

Kafka is a distributed messaging platform that allows you to sequentially log streaming data into topic-specific feeds, which other applications can tap into. In this setup, Kafka collects and buffers the events, which are then ingested by Druid.

Druid

Apache Druid provides low-latency real-time data ingestion from Kafka, flexible data exploration, and rapid data aggregation. Druid is not considered a data lake but rather a data river. Since the data is generated by the user, sensor, or whatever, it will work in the foreground application, as with the Hive/Presto setup, data is typically available hourly or daily, but with Druid, the data is available to query upon accessing the database. Druid rates it as a 90%-98% speed improvement over Apache Hive (untested).

Superset

Apache Superset is an Open Source data visualization tool that can represent data graphically. Superset was initially created by AirBnB and later released to the Apache community. Apache Superset is developed in Python language and uses Flask Framework for all web interactions. Superset supports the majority of RDBMS through SQL Alchemy.

How to it works?

Let's set up an example of a real-time coin price analysis system based on Airflow-Kafka-Druid-Superset. Using Docker, it's easy to set up a local instance and make it easier to try and explore your ideas.

1. To set the system, start by cloning the repo git.

2. Next, we need to build the local images.

∎ Infos of services:

∎ Note that the Airflow user is admin and the password will be automatically generated at the a-airflow directory path /app/standalone_admin_password.txt after the server runs. As for the Superset, it is necessary to go to the running container and execute the command unit to create the user with the tag:

∎ Airflow Scheduler app_airflow/app/dags/demo.py is configured to run once a minute executing a message to the Kafka demo topic with list and price coin data ['BTC', 'ETH', 'BTT', 'DOT']. The structure of the message data is as below, I just randomize the price for simplicity. To start streaming, sign in to airflow and enable demo dags.

∎ Note: you also can using python to run producer.py like alternative airflow demo dags.

3. Configure Druid to receive streaming

From Druid Service http://localhost:8888/ select load data > Kafka, enter information Kafka server kafka:9092 and topic demo and config output.

4. Setup Superset connect to Druid like data source

Login into Superset http://localhost:8088/ create new database Data > Databases > + Database connection on Druid using sqlalchemy uri: druid://broker:8082/druid/v2/sql/ for detail can read at Superset-Database-Connect

5. Create dashboard

To create dashboards with superset. Go to SQL Lab > SQL Editor select the database as druid, schema druid, table demo, and execute the query you need. Once done, click on Explore select your chart, and publicizes it as a dashboard.

Finally enjoy it! 💥💥

Search This Blog

dungtothelo

Real time analytics: Airflow + Kafka + Druid + Superset

Comments

Post a Comment

Popular posts from this blog

Sending Emails using Apache Airflow Email Operator

Built a working Hadoop-Spark-Hive-Superset cluster on Docker

ETL Process Using Airflow and Docker