Press "Enter" to skip to content

Category: Big Data Clusters

Building an AKS Cluster

Mohammad Darab continues a series on Big Data Clusters by creating a Kubernetes pod in Azure Kubernetes Service:

Next, we will create a resource group by executing the following command:
az group create –name nameOfMyresourceGroup –location eastus2

Once you execute the above command, you can go into the Azure portal and refresh your resource group pane and see the newly created resource group.

Once that is setup, it’s time to create the actual Kubernetes cluster.

Click through for the full set of instructions.

Comments closed

An Intro to SQL Server Big Data Clusters

Mohammad Darab has a series on Big Data Clusters. Part zero explains what they are:

The *absolute* unique feature that BDCs offer that *no* other company or product offers is: Data Virtualization

The new and enhanced SQL Server 2019 Polybase feature comes with connectors to many different data sources: Oracle, Teradata, Apache Spark, MongoDB, Azure Cosmos DB, and even ODBC connectivity to IBM’s DB2, SAP HANA and excel (see image below)

Part one shows how to set one up:

So far, Microsoft does not have a simple way to create a Big Data Cluster. It’s a bit cumbersome of a process and the learning curve is a bit steep. However, Microsoft is currently working on making it easier to deploy a Big Data Cluster via Notebook in Azure Data Studio and eventually some type of “deployment wizard.” But for now, the only option is to do it the long way.

The series will continue, but check out the setup work.

Comments closed

SQL Server 2019 CTP 3.0

The SQL Server team has announced the latest CTP for SQL Server 2019:

Big data clusters
– Scale out by supporting deployment configurations with an increased number of SQL Server instances in the compute pool. You can now specify up to 4 instances in the compute pool for optimal performance of your queries against data pool, storage pool, or other external data sources.
The mssqlctl utility includes updates to ease the big data cluster management experience with enhancements to the login experience. There is also a new command to list the cluster endpoints.
Persistent volumes abstract the details of how the storage is provided and how it’s consumed. In this release, we’re enhancing the supported storage configurations by enabling you to customize storage classes independently for logs and data. With these changes, we also consolidated the storage configurations for big data components, so that the number of persistent volume claims for a big data cluster has been reduced for a default minimum configuration cluster.

There are a few other changes announced in this CTP. Now that we’re at 3.0, the light is at the end of the tunnel.

Comments closed

Learning About Big Data Clusters

Kevin Chant shares resources for getting started with SQL Server Big Data Clusters:

In a previous post I shared current SQL Server 2019 learning resources, which you can view in detail here.

However, SQL Server 2019 Big Data Clusters are very involved. So, I thought I better dedicate a whole post to further learning resources for it.

Because some people have different learning methods I have included references to both documents and videos in this post. In addition, I have created the below links in case somebody wants to go directly to a specific section.

Kevin’s put together quite a few useful links here.

Comments closed

Developing Big Data Cluster Spark Jobs with IntelliJ

Jenny Jiang shows how we can use IntelliJ IDEA to develop Spark jobs against SQL Server Big Data Clusters:

We’re delighted to release the Azure Toolkit for IntelliJ support for SQL Server Big Data Cluster Spark job development and submission. For first-time Spark developers, it can often be hard to get started and build their first application, with long and tedious development cycles in the integrated development environment (IDE). This toolkit empowers new users to get started with Spark in just a few minutes. Experienced Spark developers also find it faster and easier to iterate their development cycle.

The toolkit extends IntelliJ support for the Spark job life cycle starting from creation, authoring, and debugging, through submission of jobs to SQL Server Big Data Clusters. It enables you to enjoy a native Scala and Java Spark application development experience and quickly start a project using built-in templates and sample code. The integration with SQL Server Big Data Cluster empowers you to quickly submit a job to the big data cluster as well as monitor its progress. The Spark console allows you to check schemas, preview data, and validate your code logic in a shell-like environment while you can develop Spark batch jobs within the same toolkit.

It looks pretty good from my vantage point.

Comments closed

Creating A Big Data Cluster

Chris Adkin continues a series on big data clusters in SQL Server 2019:

This post post will focus on creating a big data cluster so that you can get up and running as fast as possible, as such the storage type used will be ephemeral, this perfectly acceptable for “Kicking the tyres”. For production grade installations integration with a production grade storage platform is required via a storage plugin. Before we create our cluster, with the assumption we are doing this with an on premises infrastructure, the following pre-requisites need to be met:

Read the whole thing, but wait until part 4 before putting anything valuable in it.

Comments closed

Deploying SQL Server 2019 Big Data Clusters With Kubernetes

Chris Adkin has the start of a new series:

Minikube is a good learning tool and Microsoft provides instructions for deploying a big data cluster to this ‘Platform’. However, its single node nature and the fact that application pods run on the master node means that this does not reflect a cluster that anyone would run in production. Kubernetes-as-a-service is probably by far the easiest option for spinning a cluster up, however it relies on an Aws, Azure or Google Cloud Platform account, hence there is a $ cost associated with this. This leaves a vanilla deployment of Kubernetes on premises. Based on the assumption that most people will have access to Windows server version 2008 or above, a relatively cheap and way of deploying a Kubernetes cluster is via Linux virtual machines running on Hyper-V. This blog post will provide step by step instructions for creating the virtual machines to act as the master and worker nodes in the cluster. 

This is going on my “try this out when I have time” list.  So expect a full report sometime in the year 2023.

Comments closed

Technologies Surrounding Big Data Clusters In SQL Server 2019

Buck Woody has some long-term homework for people:

Some of these technologies and concepts are not owned or created by Microsoft – the concepts are universal, and a few of the technologies are open-source. I’ve marked those in italics.

I’ve also included a few links to a training resource I’ve found to be useful. I normally use LinkedIn Learning for larger courses, along with EdX, DataCamp, and many other platforms for in-depth training. The links I have indicated here are by no means exhaustive, but they are free, and provide a good starting point.

Click through for a list of some of the technologies in play.

Comments closed