Press "Enter" to skip to content

Category: Big Data Clusters

On-Premises Scale-Out Post-Big Data Clusters

Chris Adkin looks at alternatives to SQL Server 2019 Big Data Clusters:

This post assumes that for reasons relating to data sovereignty, fiduciary or regulatory reasons in general that the:

– analytics platform will be underpinned by something which is cloud and on premises infrastructure agnostic, Kubernetes in other words.

– focal points of the Data Lake processing element will be Python and open source tools

– SQL Server 2022 S3 object virtualisation is the preferred technology for querying the Data Lake via a T-SQL surface area

– S3 is the preferred technology for storing the data in our Data Lake.

Read on for the high-level solution and stay tuned for more detailed answers.

Comments closed

SQL Server Analytics Updates

The SQL Server team drops bad news on a Friday:

Today, we are announcing the retirement of PolyBase scale-out groups in Microsoft SQL Server. Scale-out group functionality will be removed from the product in SQL Server 2022. In-market SQL Server 2019, 2017, and 2016 will continue to support the functionality to the end of support for those products.

In addition to killing Big Data Clusters, they’re also killing the Java connector in PolyBase and scale-out groups. I have a blog post coming up today on the topic with my full set of thoughts. The short version is, “Mostly not bad, though losing scale-out groups sucks.”

Comments closed

Goodbye, Big Data Clusters

Mohammad Darab has a few thoughts:

I wanted to publish a quick note about something a little near and dear to me. As of today, Feb 25th, 2022, Microsoft made the official announcement that they are retiring SQL Server Big Data Clusters. You can read the full statement here.

It is a bit disappointing to see the lack of take-up here. Personally, I fought to get this set up in my environment but wasn’t able to convince management. I’m guessing there was a fair amount of that which led to a lack of take-up.

Comments closed

New in SQL Server Big Data Clusters

Daniel Coelho has an update on what’s available in SQL Server Big Data Clusters:

SQL Server Big Data Clusters (BDC) is a capability brought to market as part of the SQL Server 2019 release. Big Data Clusters extends SQL Server’s analytical capabilities beyond in-database processing of transactional and analytical workloads by uniting the SQL engine with Apache Spark and Apache Hadoop to create a single, secure, and unified data platform. It is available exclusively to run on Linux containers, orchestrated by Kubernetes, and can be deployed in multiple-cloud providers or on-premises.

Today, we’re proud to announce the release of the latest cumulative update, CU13, for SQL Server Big Data Clusters which includes important changes and capabilities:

Updating to the most recent production-ready version of Spark (as of today) is a nice upgrade.

Comments closed

Updates to SQL Server Big Data Clusters

Rahul Ajmera fills us in on what they’ve been doing with SQL Server Big Data Clusters:

Today, we’re announcing the release of the latest cumulative update (CU9) for SQL Server Big Data Clusters, which includes important capabilities:

– Support to configure BDC post deployment.
– Improved experience for encryption at rest.
– Ability to install Python packages at Spark job submission time.
– Upgraded software versions for most of our OSS components (Grafana, Kibana, FluentBit, etc.) to ensure Big Data Clusters images are up to date with the latest enhancements and fixes.
– Miscellaneous improvements and bug fixes.

This announcement highlights some of the major improvements, provides additional context to better understand the design behind these capabilities, and points you to relevant resources to learn more and get started.

Click through for more detail on a few of the items.

Comments closed

Looking at BDC in Kubernetes with Lens

Mohammad Darab shows off a tool to monitor the Kubernetes cluster driving a Big Data Cluster:

I don’t recall how I came across this Kubernetes IDE called Lens, but all I know is it’s cool as hec! It connects to a Kubernetes cluster (using the kube config file) and gives you an in depth view of all the different Kubernetes objects, their associated yaml files, health/metrics, etc. In this blog post I will show you how we can look into a Big Data Cluster’s Kubernetes infrastructure using Lens.

Click through for instructions on installation, as well as how to use the product.

Comments closed

Stopping and Starting an Azure Kubernetes Service Cluster

Mohammad Darab wants to save some cash (or at least Azure credits):

I remember when I first started deploying Big Data Clusters, they were on Azure Kubernetes Service utilizing the $200 credit for first time sign ups. By the time I got around to figuring out how to deploy the BDC, not only was my $200 credit gone, but I started to incur cost out of pocket.

If only there was a feature that would allow me to stop the VMs in AKS whenever I wasn’t using them. Well, I’m excited to share that Microsoft AKS (Azure Kubernetes Service) came out with a neat feature (currently in preview at the time of the publishing of this post) that allows you to stop and start your AKS cluster by running a simple command. Of course I had to try it out on BDCs and to my surprise it worked. Well, sort of. Let me explain…

Read on for more information, as well as current limitations.

Comments closed

OpenShift and SQL Server Big Data Clusters

Chris Adkin explains why support for OpenShift is important for SQL Server Big Data Clusters:

One thing that should become immediately apparent when installing and administering an OpenShift cluster, is that it is a lot more prescriptive and opinionated that vanilla Kubernetes. The simple reason for this is that OpenShift is intended to be deployed to environments that require enterprise grade levels of hardening and security. For example, Red Hat mandates the operating system distributions you must use, to the extent that when deploying a cluster on VMware – Red Hat’s documentation recommends the use of OVA’s, compressed files containing install-able virtual machines.

Read on for the full story.

Comments closed

Planning for a Big Data Cluster

Chris Adkin has started a series on SQL Server Big Data Clusters:

Proposing the idea of using virtual machines as Kubernetes cluster nodes to a Kubernetes purist is likely to be met with consternation. However, the different nodes in your cluster have different resource requirements. A master node can get away with as little as 2 GB of memory and 2 logical processors, worker nodes require much more resources. A best practice is never to run applications on master nodes in production. The view of the world from a Kubernetes purist, is that Kubernetes is designed to obviate the need for virtualization. Consider that you do go down the bare metal route, its unlikely that you are going to purchase blades or servers with 2 GB of memory and 2 CPU cores. At the very least consider the use of virtual machines to host master nodes on. For organizations that have standardized on a software defined virtualized infrastructure, Kubernetes will run perfectly happy on this. Also for the rapid provisioning of environments – virtualization provides the fastest means of doing this – simply create yourself a virtual machine template and base your cluster node hosts on this.

Click through for more guidance around what you need to know before you deploy a cluster.

Comments closed