Apache Spark on Windows: A Docker approach
How to set-up a Apache Spark development environment with minimum effort with Docker for Windows
Recently I was allocated to a project where the entire customer database is in Apache Spark / Hadoop. As a standard in all my projects, I first went to prepare the development environment on the corporate laptop, which comes with Windows as standard OS. As many already know, preparing a development environment on a Windows laptop can sometimes be painful and if the laptop is a corporate one it can be even more painful (due to restrictions imposed by the system administrator, corporate VPN, etc...).
Creating a development environment for Apache Spark / Hadoop is no different. Installing Spark on Windows is extremely complicated. Several dependencies need to be installed (Java SDK, Python, Winutils, Log4j), services need to be configured and environment variables need to be properly set. Given that, I decided to use Docker as the first option for all my development environments.
Why Docker?
1. There is no need to install any library or application on Windows, only Docker. No need to ask Technical Support for permission to install software and libraries every week.
2. Windows will always run at maximum potential (without having countless services starting on login)
3. Have different environments for projects, including software versions. Ex: a project can use Apache Spark 2 with Scala and another Apache Spark 3 project with pyspark without any conflict.
4. There are several ready-made images made by the community (postgres, spark, jupyters, etc...), making the development set-up mach faster.
These are just some of the advantages of Docker, there are others which you can read more about on the Docker official page.
With all that said, let's get down to business and set up our Apache Spark environment.
Install Docker for Windows
You can follow the start guide to download Docker for Windows and go for instructions to install Docker on your machine. If your Windows is the Home Edition, you can follow Install Docker Desktop on Windows Home instructions.
Jupyter and Apache Spark
As I said earlier, one of the coolest features of docker relies on community images. There are a lot of pre-made images for almost all needs available to download and use with minimum or no configuration. Take some time to explore the Docker Hub and see by yourself.
The Jupyter developers have been doing an amazing job actively maintaining some images for Data Scientists and Researchers, the project page can be found here. Some of the images are:
1. jupyter/r-notebook includes popular packages from the R ecosystem.
2. jupyter/scipy-notebook includes popular packages from the scientific Python ecosystem.
3. jupyter/tensorflow-notebook includes popular Python deep learning libraries.
4. jupyter/pyspark-notebook includes Python support for Apache Spark.
5. jupyter/all-spark-notebook includes Python, R and Scala support for Apache Spark.
and many others.
For our Apache Spark environment, we choose the jupyter/pyspark-notebook, as we don't need the R and Scala support.
To create a new container you can go to a terminal and type the following:
This command pulls the jupyter/pyspark-notebook image from Docker Hub if it is not already present on the localhost.
It then starts a container with name=pyspark running a Jupyter Notebook server and exposes the server on host port 8888.
You may instruct the start script to customize the container environment before launching the notebook server. You do so by parsing arguments (-e flag) to the docker run command. The list of all available variables can be found in docker-stacks docs.
The server logs appear in the terminal and include a URL to the notebook server. You can navigate to the URL, create a new python notebook and paste the following code:
Now we have our Apache Spark environment with minimum effort. You can open a terminal and install packages using conda or pip and mange your packages and dependencies as you wish. Once you have finished you can press ctrl+C and stop the container.
Data Persistence
If you want to start your container and have your data persisted you cannot run the "docker run" command again, this will create a new default container, so what we need to do?
You can type in a terminal:
this will list all containers available, to start the same container that you create previously, type:
where -a is a flag that tells docker to bind the console output to the terminal and pyspark is the name of the container. To learn more about docker start options you can visit Docker docs.
Conclusions
In this article, we can see how docker can speed up the development lifecycle and help us mitigate some of the drawbacks of using Windows as the main OS for development. Microsoft is doing a great job with WSL, Docker, and other tools for developers and engineers even supporting GPU processing through docker and WSL. The future is promising.
Comments
Post a Comment