It’s been a little while since I’ve pulled one of these but Curated SQL is taking the day off. We’ll be back tomorrow, however: same top-hat time, same top-hat channel.
Comments closedCurated SQL Posts
I have some thoughts on a recent announcement:
We could see the writing on the wall here ever since Cloudera and Hortonworks merged. Cloudera Distribution of Hadoop (CDH) and Hortonworks Data Platform (HDP) were both on-premises offerings that you could also get in the cloud. Post-merger, Cloudera Data Platform (CDP) was cloud-only and, to my knowledge, they have never released an on-premises version. Cloud versus on-premises isn’t itself the issue but it does tie in with the issue: in order for PolyBase to work, certain ports need to be exposed on your Hadoop cluster. Cloud offerings tend not to want to expose a bunch of ports to internal services and so PolyBase to CDP was a non-starter.
It’s about 30% bad news, 50% good news, and 20% meh news. Click through for the longer-form version of that.
Comments closedBrendan Tierney looks at the pycaret library:
In this post we will have a look at using the AutoML feature in the Pycaret Python library. AutoML is a popular topic and allows Data Scientists and Machine Learning people to develop potentially optimized models based on their data. All requiring the minimum of input from the Data Scientist. As with all AutoML solutions, care is needed on the eventual use of these models. With various ML and AI Legal requirements around the World, it might not be possible to use the output from AutoML in production. But instead, gives the Data Scientists guidance on creating an optimized model, which can then be deployed in production. This facilitates requirements around model explainability, transparency, human oversight, fairness, risk mitigation and human in the loop.
Read on for a tutorial as well as additional resources.
Comments closedThe Hadoop in Real World team moves data around in Elasticsearch:
In this post we will describe how to copy an index and its contents to a new index in Elasticsearch.
We currently have an index named account. We are going to copy the account index and its content to another index named account_v2 using the reindex API.
Click through to see how.
Comments closedGianluca Sartori continues a series on XESmartTarget:
For this post, the problem to solve is this: a session has an open transaction, is blocking other sessions, it’s been sleeping for a long time and it’s probably a good idea to kill it. This usually happens when there’s a problem in the application, that doesn’t handle transactions properly and leaves open transactions for a long time, maybe because it displays an error dialog, waiting for user input. There is very little that you can do in these cases: the blocked processes continue to pile up and the only thing left to do is kill the offending session.
Let’s see how to do that with XESmartTarget.
Let’s, shall we?
Comments closedTimi Oshin has updates on SSMS and Azure Data Studio:
Azure Data Studio 1.35 now supports easier keyboard navigation in notebooks without mouse clicking. This is done by hitting the Esc key and navigating between cell rows using the Up and Down arrow keys. To enter edit mode, hit the Enter key on the keyboard. The new Table Designer preview feature supports creating new tables and editing existing tables on a connected SQL Server instance. This is a highly requested product enhancement and enables more productive schema management with a modern, streamlined UX.
Haha! It only took several years but my hectoring finally pays off. Now for the full set of Jupyter keyboard shortcuts…
Comments closedChris Adkin looks at alternatives to SQL Server 2019 Big Data Clusters:
This post assumes that for reasons relating to data sovereignty, fiduciary or regulatory reasons in general that the:
– analytics platform will be underpinned by something which is cloud and on premises infrastructure agnostic, Kubernetes in other words.
– focal points of the Data Lake processing element will be Python and open source tools
– SQL Server 2022 S3 object virtualisation is the preferred technology for querying the Data Lake via a T-SQL surface area
– S3 is the preferred technology for storing the data in our Data Lake.
Read on for the high-level solution and stay tuned for more detailed answers.
Comments closedJeff Iannucci solves a riddle:
Anyhow, it’s worthwhile to occasionally review the tables in a database to see which ones are growing every day, using the most space.
But what if during a review you see the largest table looks like this?
That’s around 24 GB of sweet drive space allocated for 0 records. But…how?
Let me show you how.
Click through to see how. My initial thought was LOB craziness but Jeff’s example doesn’t even need that.
Comments closedChad Callihan takes out the trash:
We’ve created an AWS RDS instance and logged into it successfully. One thing to remember when creating test instances is when to them when you’re finished. While a lot of test instances I’ve created have been free tier, it’s still good to clean up rather than leave instances lingering. Today, let’s clean up a test instance.
Click through for the step-by-step on how to do this.
Comments closedAndrea Allred doesn’t want to do things by the numbers:
Cost Threshold for Parallelism (CTfP) is one of my favorite server level settings in SQL Server. I remember the first time I heard this setting mentioned by Grant Fritchey. I quickly hopped on my servers and found them all set at the default (5) and adjusted them to 50 for the non SSRS servers and 30 for the SSRS ones. That was many years ago, but I had kept those numbers in my head because I didn’t know a better way.
Read on for a better way.
Comments closed