Big data clusters
– Scale out by supporting deployment configurations with an increased number of SQL Server instances in the compute pool. You can now specify up to 4 instances in the compute pool for optimal performance of your queries against data pool, storage pool, or other external data sources.
– The mssqlctl utility includes updates to ease the big data cluster management experience with enhancements to the login experience. There is also a new command to list the cluster endpoints.
– Persistent volumes abstract the details of how the storage is provided and how it’s consumed. In this release, we’re enhancing the supported storage configurations by enabling you to customize storage classes independently for logs and data. With these changes, we also consolidated the storage configurations for big data components, so that the number of persistent volume claims for a big data cluster has been reduced for a default minimum configuration cluster.
There are a few other changes announced in this CTP. Now that we’re at 3.0, the light is at the end of the tunnel.
In a previous post I shared current SQL Server 2019 learning resources, which you can view in detail here.
However, SQL Server 2019 Big Data Clusters are very involved. So, I thought I better dedicate a whole post to further learning resources for it.
Because some people have different learning methods I have included references to both documents and videos in this post. In addition, I have created the below links in case somebody wants to go directly to a specific section.
Kevin’s put together quite a few useful links here.
We’re delighted to release the Azure Toolkit for IntelliJ support for SQL Server Big Data Cluster Spark job development and submission. For first-time Spark developers, it can often be hard to get started and build their first application, with long and tedious development cycles in the integrated development environment (IDE). This toolkit empowers new users to get started with Spark in just a few minutes. Experienced Spark developers also find it faster and easier to iterate their development cycle.
The toolkit extends IntelliJ support for the Spark job life cycle starting from creation, authoring, and debugging, through submission of jobs to SQL Server Big Data Clusters. It enables you to enjoy a native Scala and Java Spark application development experience and quickly start a project using built-in templates and sample code. The integration with SQL Server Big Data Cluster empowers you to quickly submit a job to the big data cluster as well as monitor its progress. The Spark console allows you to check schemas, preview data, and validate your code logic in a shell-like environment while you can develop Spark batch jobs within the same toolkit.
It looks pretty good from my vantage point.
This post post will focus on creating a big data cluster so that you can get up and running as fast as possible, as such the storage type used will be ephemeral, this perfectly acceptable for “Kicking the tyres”. For production grade installations integration with a production grade storage platform is required via a storage plugin. Before we create our cluster, with the assumption we are doing this with an on premises infrastructure, the following pre-requisites need to be met:
Read the whole thing, but wait until part 4 before putting anything valuable in it.
Minikube is a good learning tool and Microsoft provides instructions for deploying a big data cluster to this ‘Platform’. However, its single node nature and the fact that application pods run on the master node means that this does not reflect a cluster that anyone would run in production. Kubernetes-as-a-service is probably by far the easiest option for spinning a cluster up, however it relies on an Aws, Azure or Google Cloud Platform account, hence there is a $ cost associated with this. This leaves a vanilla deployment of Kubernetes on premises. Based on the assumption that most people will have access to Windows server version 2008 or above, a relatively cheap and way of deploying a Kubernetes cluster is via Linux virtual machines running on Hyper-V. This blog post will provide step by step instructions for creating the virtual machines to act as the master and worker nodes in the cluster.
This is going on my “try this out when I have time” list. So expect a full report sometime in the year 2023.
Some of these technologies and concepts are not owned or created by Microsoft – the concepts are universal, and a few of the technologies are open-source. I’ve marked those in italics.
I’ve also included a few links to a training resource I’ve found to be useful. I normally use LinkedIn Learning for larger courses, along with EdX, DataCamp, and many other platforms for in-depth training. The links I have indicated here are by no means exhaustive, but they are free, and provide a good starting point.
Click through for a list of some of the technologies in play.