Blog

Docker image for Apache Spark with Azure Data Lake Storage Gen2

Jan 18, 2022 | 2 minutes read

Read how you could potentially optimise your development workflow for Spark applications using Azure storage as main data store.

Context

Deploying Spark applications on Azure PaaS services like Databricks is gold standard and you may do this via AzDevOps or similar product. Once deployed, chances are your Spark applications will read and write data to Azure storage and most of the time you will access data via mount points and this may force a dependency to use compute service for your development workflow.

Mount points are very convenient so I thought I should find a way to mimic cloud development on my local machine. E.g.: Develop Spark applications locally accessing data in ADLS via mount points so that I can package my code and deploy to Databricks for production with minimal configuration change.

Approach

Enter Blobfuse, a Microsoft supported Azure storage FUSE driver. This in combination with an optimised Spark image (this alone gives your plenty of benefits already) by Datamechanics are the perfect duo to achieve our goal.

Enter Docker-Spark-ADLS image, this docker image gives you the configuration to mount your Azure storage on the Linux file system in your container and enables you to buildSpark applications accessing data in the cloud via mount points from your local machine. This could accelerate your development workflow and optimise your development cost if employed efficiently.

Why don’t you take it out for a spin and let me know your thoughts? I’d also welcome pull requests for enhancements.

Happy programming!