Press "Enter" to skip to content

Day: March 1, 2022

Thoughts on the Long Run: PolyBase

I have some thoughts on a recent announcement:

We could see the writing on the wall here ever since Cloudera and Hortonworks merged. Cloudera Distribution of Hadoop (CDH) and Hortonworks Data Platform (HDP) were both on-premises offerings that you could also get in the cloud. Post-merger, Cloudera Data Platform (CDP) was cloud-only and, to my knowledge, they have never released an on-premises version. Cloud versus on-premises isn’t itself the issue but it does tie in with the issue: in order for PolyBase to work, certain ports need to be exposed on your Hadoop cluster. Cloud offerings tend not to want to expose a bunch of ports to internal services and so PolyBase to CDP was a non-starter.

It’s about 30% bad news, 50% good news, and 20% meh news. Click through for the longer-form version of that.

Comments closed

AutoML with pycaret

Brendan Tierney looks at the pycaret library:

In this post we will have a look at using the AutoML feature in the Pycaret Python library. AutoML is a popular topic and allows Data Scientists and Machine Learning people to develop potentially optimized models based on their data. All requiring the minimum of input from the Data Scientist. As with all AutoML solutions, care is needed on the eventual use of these models. With various ML and AI Legal requirements around the World, it might not be possible to use the output from AutoML in production. But instead, gives the Data Scientists guidance on creating an optimized model, which can then be deployed in production. This facilitates requirements around model explainability, transparency, human oversight, fairness, risk mitigation and human in the loop.

Read on for a tutorial as well as additional resources.

Comments closed

Killing Blocking SPIDs with XESmartTarget

Gianluca Sartori continues a series on XESmartTarget:

For this post, the problem to solve is this: a session has an open transaction, is blocking other sessions, it’s been sleeping for a long time and it’s probably a good idea to kill it. This usually happens when there’s a problem in the application, that doesn’t handle transactions properly and leaves open transactions for a long time, maybe because it displays an error dialog, waiting for user input. There is very little that you can do in these cases: the blocked processes continue to pile up and the only thing left to do is kill the offending session.

Let’s see how to do that with XESmartTarget.

Let’s, shall we?

Comments closed

SQL Tools Updates

Timi Oshin has updates on SSMS and Azure Data Studio:

Azure Data Studio 1.35 now supports easier keyboard navigation in notebooks without mouse clicking. This is done by hitting the Esc key and navigating between cell rows using the Up and Down arrow keys. To enter edit mode, hit the Enter key on the keyboard. The new Table Designer preview feature supports creating new tables and editing existing tables on a connected SQL Server instance. This is a highly requested product enhancement and enables more productive schema management with a modern, streamlined UX.

Haha! It only took several years but my hectoring finally pays off. Now for the full set of Jupyter keyboard shortcuts…

Comments closed

On-Premises Scale-Out Post-Big Data Clusters

Chris Adkin looks at alternatives to SQL Server 2019 Big Data Clusters:

This post assumes that for reasons relating to data sovereignty, fiduciary or regulatory reasons in general that the:

– analytics platform will be underpinned by something which is cloud and on premises infrastructure agnostic, Kubernetes in other words.

– focal points of the Data Lake processing element will be Python and open source tools

– SQL Server 2022 S3 object virtualisation is the preferred technology for querying the Data Lake via a T-SQL surface area

– S3 is the preferred technology for storing the data in our Data Lake.

Read on for the high-level solution and stay tuned for more detailed answers.

Comments closed

Zero Records but Lots of Space Used

Jeff Iannucci solves a riddle:

Anyhow, it’s worthwhile to occasionally review the tables in a database to see which ones are growing every day, using the most space.

But what if during a review you see the largest table looks like this?

That’s around 24 GB of sweet drive space allocated for 0 records. But…how?

Let me show you how.

Click through to see how. My initial thought was LOB craziness but Jeff’s example doesn’t even need that.

Comments closed

Deleting an RDS Instance

Chad Callihan takes out the trash:

We’ve created an AWS RDS instance and logged into it successfully. One thing to remember when creating test instances is when to them when you’re finished. While a lot of test instances I’ve created have been free tier, it’s still good to clean up rather than leave instances lingering. Today, let’s clean up a test instance.

Click through for the step-by-step on how to do this.

Comments closed

Choosing a Cost Threshold for Parallelism

Andrea Allred doesn’t want to do things by the numbers:

Cost Threshold for Parallelism (CTfP) is one of my favorite server level settings in SQL Server. I remember the first time I heard this setting mentioned by Grant Fritchey. I quickly hopped on my servers and found them all set at the default (5) and adjusted them to 50 for the non SSRS servers and 30 for the SSRS ones. That was many years ago, but I had kept those numbers in my head because I didn’t know a better way.

Read on for a better way.

Comments closed