I’ve been “playing around” with Big Data Clusters for some time now and CTP 3.2 is way ahead when it comes to streamlining the BDC deployment process. You can check out my 4-part series on deploying BDC on AKS to see how cumbersome the process used to be. New in CTP 3.2, you can deploy a BDC on AKS (an existing cluster OR a new cluster) using an Azure Data Studio notebook. Let’s see how.
Click through for instructions. It was rather smart of Microsoft to release the instructions as a notebook.
In September 2018, Microsoft announced a new partnership with Azul Systems, a leading Java open source contributor and distributor. This partnership allows for all Azure customers to use Azul’s Zulu for Azure – Enterprise distribution of Java for free with support jointly provided by Microsoft and Azul. That’s right – supported for free.
Today, we are announcing that we have extended that partnership to cover SQL Server. Starting in the SQL Server 2019 community technology preview (CTP) 3.2 that was released today, we are including Azul System’s Zulu Embedded right out of the box for all scenarios where Java is used in SQL Server – in PolyBase, Apache Spark, Java extensibility, and more. There is no additional cost beyond what you pay for SQL Server.
This is interesting. We’ll have to see if the CTP 3.2 installation doesn’t ask for JDK 1.8 anymore and just installs the Azul Systems version.
There are many ways to view the health of your Big Data Cluster. As of CTP 3.0, there are kubectl commands, mssqlctl commands as well as dashboards. For the sake of this series, I will focus on the dashboards. I will blog about some of the useful kubectl and mssqlctl commands in later posts.
The first dashboard is the Microsoft Cluster Administration portal (see below snapshot). This is a view into the Big Data Cluster Controller. As you can see from the image below, the Overview pane shows the Controller, Master Instance and all the pools. On the left hand side you can see more details. If you click on the “Service Endpoint” option, you will see a list of endpoints that you can bookmark.
Something I appreciate is that Microsoft thought ahead on what the monitoring story should look like rather than waiting until the end and slapping something together.
The big data clusters feature continues to add key capabilities for its initial release in SQL Server 2019. This month, the release extends the Apache Spark™ functionality for the feature by supporting the ability to read and write to data pool external tables directly as well as a mechanism to scale compute separately from storage for compute-intensive workloads. Both enhancements should make it easier to integrate Apache Spark™ workloads into your SQL Server environment and leverage each of their strengths. Beyond Apache Spark™, this month’s release also includes machine learning extensions with MLeap where you can train a model in Apache Spark™ and then deploy it for use in SQL Server through the recently released Java extensibility functionality in SQL Server CTP 3.0. This should make it easier for data scientists to write models in Apache Spark™ and then deploy them into production SQL Server environments for both periodic training and full production against the trained model in a single environment.
Click through to learn more about what has changed.
To kick off the Big Data Cluster “Default configuration” creation, we will execute the following Powershell command:
mssqlctl cluster create
That will first prompt us to accept the license terms. Type y and Enter.
Mohammad takes us through the default installation, which requires only a few parameters before it can go on its merry way.
Next, we will create a resource group by executing the following command:
az group create –name nameOfMyresourceGroup –location eastus2
Once you execute the above command, you can go into the Azure portal and refresh your resource group pane and see the newly created resource group.
Once that is setup, it’s time to create the actual Kubernetes cluster.
Click through for the full set of instructions.
The *absolute* unique feature that BDCs offer that *no* other company or product offers is: Data Virtualization
The new and enhanced SQL Server 2019 Polybase feature comes with connectors to many different data sources: Oracle, Teradata, Apache Spark, MongoDB, Azure Cosmos DB, and even ODBC connectivity to IBM’s DB2, SAP HANA and excel (see image below)
So far, Microsoft does not have a simple way to create a Big Data Cluster. It’s a bit cumbersome of a process and the learning curve is a bit steep. However, Microsoft is currently working on making it easier to deploy a Big Data Cluster via Notebook in Azure Data Studio and eventually some type of “deployment wizard.” But for now, the only option is to do it the long way.
The series will continue, but check out the setup work.
Big data clusters
– Scale out by supporting deployment configurations with an increased number of SQL Server instances in the compute pool. You can now specify up to 4 instances in the compute pool for optimal performance of your queries against data pool, storage pool, or other external data sources.
– The mssqlctl utility includes updates to ease the big data cluster management experience with enhancements to the login experience. There is also a new command to list the cluster endpoints.
– Persistent volumes abstract the details of how the storage is provided and how it’s consumed. In this release, we’re enhancing the supported storage configurations by enabling you to customize storage classes independently for logs and data. With these changes, we also consolidated the storage configurations for big data components, so that the number of persistent volume claims for a big data cluster has been reduced for a default minimum configuration cluster.
There are a few other changes announced in this CTP. Now that we’re at 3.0, the light is at the end of the tunnel.
In a previous post I shared current SQL Server 2019 learning resources, which you can view in detail here.
However, SQL Server 2019 Big Data Clusters are very involved. So, I thought I better dedicate a whole post to further learning resources for it.
Because some people have different learning methods I have included references to both documents and videos in this post. In addition, I have created the below links in case somebody wants to go directly to a specific section.
Kevin’s put together quite a few useful links here.
We’re delighted to release the Azure Toolkit for IntelliJ support for SQL Server Big Data Cluster Spark job development and submission. For first-time Spark developers, it can often be hard to get started and build their first application, with long and tedious development cycles in the integrated development environment (IDE). This toolkit empowers new users to get started with Spark in just a few minutes. Experienced Spark developers also find it faster and easier to iterate their development cycle.
The toolkit extends IntelliJ support for the Spark job life cycle starting from creation, authoring, and debugging, through submission of jobs to SQL Server Big Data Clusters. It enables you to enjoy a native Scala and Java Spark application development experience and quickly start a project using built-in templates and sample code. The integration with SQL Server Big Data Cluster empowers you to quickly submit a job to the big data cluster as well as monitor its progress. The Spark console allows you to check schemas, preview data, and validate your code logic in a shell-like environment while you can develop Spark batch jobs within the same toolkit.
It looks pretty good from my vantage point.