Press "Enter" to skip to content

Category: Big Data Clusters

Using Azure Kubernetes Services for Big Data Clusters

Mohammad Darab explains why it’s a good idea to use Azure Kubernetes Service when building out a Big Data Cluster:

According to the Microsoft documentation, there are three ways to deploy a Big Data Cluster:

1. Minikube
2. Kubeadm
3. AKS

I’ll go into each and list the pros and cons.

Of course, if you have a great Kubernetes admin, on-prem is certainly a viable option, but AKS is definitely easier to get started with.

Comments closed

Creating Big Data Clusters with Azure Data Studio

Niels Berglund takes us through the creation of a Big Data Cluster by using Azure Data Studio to generate a notebook:

I wrote a blog post back in November 2018, about how to install and deploy SQL Server 2019 Big Data Cluster on Azure Kubernetes Service. Back then SQL Server 2019 Big Data Cluster was in private preview, (CTP 2.1 I believe), and you had to sign up, to get access to the “bits”. Well, you did not really get any “bits”; what you did get was access to Python deployment scripts.

Now, September 2019, the BDC is in public preview (you do not have to sign up), and it has reached Release Candidate (RC) status, RC 1. The install method has changed, or rather, in addition to installing via deployment scripts, you can now also install using Azure Data Studio deployment notebooks, and that is what this blog post is about.

Having gone through this myself, there’s quite a bit of reading involved in the setup, but they make the process pretty smooth. This also shows off one of the key benefits of notebooks: documentation and code together.

Comments closed

Develop BDC PySpark Jobs in Visual Studio Code

Jenny Jiang announces a new capability in Visual Studio Code:

With the Visual Studio Code extension, you can enjoy native Python programming experiences such as linting, debugging support, language service, and so on. You can run current linerun selected lines of code, or run all for your PY file. You can import and export a .ipynb notebook and perform a notebook like query including Run Cell, Run Above, or Run Below. You can also enjoy a notebook like interactive experience that includes your source code and markdown comments along with the running results and output. You can remove the unneeded sections, enter comments, or type additional code in the interactive results window. Moreover, you can visualize your results in a graphic format through a matplotlib like Jupyter Notebook. The integration with SQL Server 2019 Big Data Clusters empowers you to quickly submit a PySpark batch job to the big data cluster and monitor job progress.

This is rather useful for developers, though I greatly prefer the Azure Data Studio notebook interface.

Comments closed

SQL Server 2019 RC 1.1

Amit Banerjee announces a minor numeric change and a big update to SQL Server 2019 RC1:

In continuation with our announcement of SQL Server 2019 release candidate last week, we’re announcing that the release candidate refresh for SQL Server 2019 is now available to download. The release candidate now includes bits for Big Data Clusters in SQL Server 2019 in this refresh.

Back in July, we announced the preview of Big Data Clusters in SQL Server 2019 and since then we’ve seen our customers actively bringing their big data analytical workloads to SQL Server 2019 to operationalize their AI and machine learning projects.

Read on for more.

Comments closed

“Big” Data

Buck Woody explains that “Big Data” is just data:

A few years ago it was all the rage to talk about “Big Data”. Lots of descriptions of “Big Data” popped up, including the “V’s” (Variety, Velocity, Volume, etc.) that proved very helpful. I even have my own definition:

Big Data is any data you can’t process
in the time you want
with the systems you have

This post is quite reasonable in its depiction of the problem. I extend it a bit further than that and talk about difficulty of processing the data. Nonetheless, read Buck’s full thoughts and check out the Big Data Clusters workshop.

Comments closed

Deploying a Big Data Cluster

Mohammad Darab takes us through the Big Data Cluster deployment process using Azure Data Studio:

I’ve been “playing around” with Big Data Clusters for some time now and CTP 3.2 is way ahead when it comes to streamlining the BDC deployment process. You can check out my 4-part series on deploying BDC on AKS to see how cumbersome the process used to be. New in CTP 3.2, you can deploy a BDC on AKS (an existing cluster OR a new cluster) using an Azure Data Studio notebook. Let’s see how.

Click through for instructions. It was rather smart of Microsoft to release the instructions as a notebook.

Comments closed

Azul Java in SQL Server 2019

Travis Wright announces support for Azul Systems’ Java distribution in SQL Server 2019:

In September 2018, Microsoft announced a new partnership with Azul Systems, a leading Java open source contributor and distributor. This partnership allows for all Azure customers to use Azul’s Zulu for Azure – Enterprise distribution of Java for free with support jointly provided by Microsoft and Azul. That’s right – supported for free.

Today, we are announcing that we have extended that partnership to cover SQL Server. Starting in the SQL Server 2019 community technology preview (CTP) 3.2 that was released today, we are including Azul System’s Zulu Embedded right out of the box for all scenarios where Java is used in SQL Server – in PolyBase, Apache Spark, Java extensibility, and more. There is no additional cost beyond what you pay for SQL Server.

This is interesting. We’ll have to see if the CTP 3.2 installation doesn’t ask for JDK 1.8 anymore and just installs the Azul Systems version.

Comments closed

Monitoring Big Data Clusters

Mohammad Darab continues a series on Big Data Clusters:

There are many ways to view the health of your Big Data Cluster. As of CTP 3.0, there are kubectl commands, mssqlctl commands as well as dashboards. For the sake of this series, I will focus on the dashboards. I will blog about some of the useful kubectl and mssqlctl commands in later posts.

The first dashboard is the Microsoft Cluster Administration portal (see below snapshot). This is a view into the Big Data Cluster Controller. As you can see from the image below, the Overview pane shows the Controller, Master Instance and all the pools. On the left hand side you can see more details. If you click on the “Service Endpoint” option, you will see a list of endpoints that you can bookmark.

Something I appreciate is that Microsoft thought ahead on what the monitoring story should look like rather than waiting until the end and slapping something together.

Comments closed

SQL Server 2019 CTP 3.1 Released

Anshul Rampal announces CTP 3.1 of SQL Server 2019:

The big data clusters feature continues to add key capabilities for its initial release in SQL Server 2019. This month, the release extends the Apache Spark™ functionality for the feature by supporting the ability to read and write to data pool external tables directly as well as a mechanism to scale compute separately from storage for compute-intensive workloads. Both enhancements should make it easier to integrate Apache Spark™ workloads into your SQL Server environment and leverage each of their strengths. Beyond Apache Spark™, this month’s release also includes machine learning extensions with MLeap where you can train a model in Apache Spark™ and then deploy it for use in SQL Server through the recently released Java extensibility functionality in SQL Server CTP 3.0. This should make it easier for data scientists to write models in Apache Spark™ and then deploy them into production SQL Server environments for both periodic training and full production against the trained model in a single environment.

Click through to learn more about what has changed.

Comments closed

Building a Big Data Cluster

Mohammad Darab continues a series on SQL Server Big Data Clusters in Azure Kubernetes Service:

To kick off the Big Data Cluster “Default configuration” creation, we will execute the following Powershell command:

mssqlctl cluster create

That will first prompt us to accept the license terms. Type y and Enter. 

Mohammad takes us through the default installation, which requires only a few parameters before it can go on its merry way.

Comments closed