We’re delighted to release the Azure Toolkit for IntelliJ support for SQL Server Big Data Cluster Spark job development and submission. For first-time Spark developers, it can often be hard to get started and build their first application, with long and tedious development cycles in the integrated development environment (IDE). This toolkit empowers new users to get started with Spark in just a few minutes. Experienced Spark developers also find it faster and easier to iterate their development cycle.
The toolkit extends IntelliJ support for the Spark job life cycle starting from creation, authoring, and debugging, through submission of jobs to SQL Server Big Data Clusters. It enables you to enjoy a native Scala and Java Spark application development experience and quickly start a project using built-in templates and sample code. The integration with SQL Server Big Data Cluster empowers you to quickly submit a job to the big data cluster as well as monitor its progress. The Spark console allows you to check schemas, preview data, and validate your code logic in a shell-like environment while you can develop Spark batch jobs within the same toolkit.
It looks pretty good from my vantage point.
This post post will focus on creating a big data cluster so that you can get up and running as fast as possible, as such the storage type used will be ephemeral, this perfectly acceptable for “Kicking the tyres”. For production grade installations integration with a production grade storage platform is required via a storage plugin. Before we create our cluster, with the assumption we are doing this with an on premises infrastructure, the following pre-requisites need to be met:
Read the whole thing, but wait until part 4 before putting anything valuable in it.
Minikube is a good learning tool and Microsoft provides instructions for deploying a big data cluster to this ‘Platform’. However, its single node nature and the fact that application pods run on the master node means that this does not reflect a cluster that anyone would run in production. Kubernetes-as-a-service is probably by far the easiest option for spinning a cluster up, however it relies on an Aws, Azure or Google Cloud Platform account, hence there is a $ cost associated with this. This leaves a vanilla deployment of Kubernetes on premises. Based on the assumption that most people will have access to Windows server version 2008 or above, a relatively cheap and way of deploying a Kubernetes cluster is via Linux virtual machines running on Hyper-V. This blog post will provide step by step instructions for creating the virtual machines to act as the master and worker nodes in the cluster.
This is going on my “try this out when I have time” list. So expect a full report sometime in the year 2023.
Some of these technologies and concepts are not owned or created by Microsoft – the concepts are universal, and a few of the technologies are open-source. I’ve marked those in italics.
I’ve also included a few links to a training resource I’ve found to be useful. I normally use LinkedIn Learning for larger courses, along with EdX, DataCamp, and many other platforms for in-depth training. The links I have indicated here are by no means exhaustive, but they are free, and provide a good starting point.
Click through for a list of some of the technologies in play.