Press "Enter" to skip to content

Curated SQL Posts

Breaking Up Queries with UNION ALL

Bert Wagner takes us through a scenario where it can be faster to combine queries with UNION ALL rather than using IN:

Even though this query reads the whole clustered index to get the Benefactor rows, the total number of logical reads is still smaller than the seek/key lookup pattern seen in the combined query with IN(). This UNION ALL version gives SQL Server the ability to build a hybrid execution plan, combining two different techniques to generate a plan with fewer overall reads.

Click through for the example.

Comments closed

Power BI: NFL Draft History

Dustin Ryan gives us a Power BI dashboard covering NFL draft history:

With the 2019 NFL Draft upon us, of course I wanted to visualize some NFL draft data in Power BI. So I put together this interesting set of visualizations based on some data I scraped from Pro Football Reference. The dataset includes drafts from 1936 to 2019 including picks through round 6 where applicable. I’ll update the dataset as the remaining rounds of the 2019 draft are completed. So feel free to take a look, interact with the dashboards, and let me know if you have any questions.

Click through for the dashboard.

Comments closed

Extracting the First Element from an Array in ADF

Rayis Imayev shows how you can find the first element in an array using Azure Data Factory:

A user recently asked me a question on my previous blog post (Setting Variables in Azure Data Factory Pipelines) about possibility extracting the first element of a variable if this variable is set of elements (array).

So as a spoiler alert, before writing a blog post and adding a bit more clarity to the existing Microsoft ADF documentation, here is a quick answer to this question.

You’ll have to click through even for the quick answer.

Comments closed

DAX Error: Multiple Columns and Scalar Values

Eugene Meidinger walks us through an error message in DAX:

Sometimes, when working with DAX, you might get the following error:

The expression refers to multiple columns. Multiple columns cannot be converted to a scalar value.

This error occurs whenever the DAX engine was expecting a single value, or scalar, and instead received  a table of values instead. This is an easy error to make because many DAX functions, such as FILTERSUMMARIZE and ALL, return table values. 

Eugene lays out when each scenario occurs, so check it out.

Comments closed

Techniques for Determining Cluster Counts

Stephanie Glen has an image which describes different techniques for figuring out how many clusters you have:

If you want to determine the optimal number of clusters in your analysis, you’re faced with an overwhelming number of (mostly subjective) choices. Note that there’s no “best” method, no “correct” k, and there isn’t even a consensus as to the definition of what a “cluster” is. With that said, this picture focuses on three popular methods that should fit almost every need: Silhouette, Elbow, and Gap Statistic.

Click through for the picture and references.

Comments closed

Scalable Anomaly Detection with Kafka and Cassandra

Paul Brebner wraps up a series on anomaly detection at scale:

The complete machine for the biggest result (48 Cassandra nodes) has 574 cores in total.  This is a lot of cores! Managing the provisioning and monitoring of this sized system by hand would be an enormous effort. With the combination of the Instaclustr managed Cassandra and Kafka clusters (automated provisioning and monitoring), and the Kubernetes (AWS EKS) managed cluster for the application deployment it was straightforward to spin up clusters on demand, run the application for a few hours, and delete the resources when finished for significant cost savings. Monitoring over 100 Pods running the application using the Prometheus Kubernetes operator worked smoothly and gave enhanced visibility into the application and the necessary access to the benchmark metrics for tuning and reporting of results.

The system (irrespective of size) was delivering an approximately constant 400 anomaly checks per second per core.

This is a good summary of what was an interesting series.

Comments closed

Thoughts on SQL Notebooks

Emanuele Meazzo takes a look at the current state of SQL Notebooks in Azure Data Studio:

I’ve personally used SQL Notebook in my day-to-day work for Data Analysis, as the possibility to tweak the code and run it in the notebook greatly enhances the presentation of the data as oppose to a commented SQL Script ,as you cannot see all the query results in the same page too as opposed to a notebook; Moreover, a notebook (with or without results) can be exported in a read-only format like html or pdf to share the info with third parties, i.e. you can automate an analysis process that include code to be shared, cool stuff.

I think there are still a few (dozen) things to iron out before it’s a great experience, but they’re on the right path with it. If you haven’t checked out Azure Data Studio and its SQL Notebooks, give it a try sometime.

Comments closed

Predicting Database Growth

James Livingston uses linear regression to plot database growth over time:

Utilizing the equation for a line, instead of solving for y we will solve for x, where:
– x corresponds to the day we will hit capacity based on current growth rate
– y corresponds to drive capacity in GB
– m is the slope of our regression line, provided by the model via lm.coef_
– b is the intercept of the regression line, also provided by the model via lm.intercept_

Click through for an example. This is one of the areas where DBAs can gain a lot by learning a bit of data science.

Comments closed

Considerations Before Failing Over an AG

David Fowler points out a few things to look at before failing over an Availability Group:

What about your SQL Agent jobs? Have you got jobs that perform actions on your data? If you have, do those jobs exist on the new primary? If they don’t then I’m happy to bet that whatever function that they were playing probably isn’t happening anymore.

One thing that I always want to make sure before I failover is, do I have all the relevant jobs ready to roll on the secondary server?

But what about the jobs that you’ve got on the old primary? There’s a fair chance that, if they’re doing any sort of data manipulation, they’re going to be failing.

Click through for additional considerations.

Comments closed