Press "Enter" to skip to content

Curated SQL Posts

Hyperparameter Tuning as Technical Debt

John Mount has an interesting take on hyperparameter tuning:

The hyper dance is the venial trick of pushing user facing technical debt and flaws as user controllable features. These controls are usually named “hyper parameters” and they are parameters or arguments that control the behavior of an algorithm. Users think “hyper parameters” must be even better than “regular parameters”, just like “hyper drive” is better than “sub-light drive.” However the etymology of the name isn’t from science fiction, it is just the need in statistical contexts to have a name for controls other than parameter, as parameter is often used to name the fit coefficients of a model (i.e. to name an output, not an input!).

In addition to this, I’d be concerned that heavy hyperparameter tuning could lead to a garden of forking paths problem where we end up accidentally doing the equivalent of p-hacking: modifying hyperparameters until we come up with the “right” answer.

Comments closed

Improving a Graph

Elizabeth Ricks has started a series on improving a particular visual:

I empathize with the plight of this anonymous creator. In previous roles, I frequently created visuals that looked like this, and was left frustrated when requests came back for “more data.” I slowly came to realize that I was assigning my audience the tedious task of figuring out for themselves what the takeaways were. My visuals should have been highlighting the interesting things to those seeing them for the first time. The five questions we’ll be discussing in this series will help us to do just that.

The first question in the series is, “What elements can I eliminate?” I think that’s a really good idea—with data visualization, less is more.

Comments closed

Archival on Delete in SQL Server

Erik Darling shows off a pattern:

Well, friends, I have good news for you. This is an easy one to implement.

Let’s say that in Stack Overflow land, when a user deletes their account we also delete all their votes. That’s not how it works, but it’s how I’m going to show you how to condense what can normally be a difficult process to isolate into a single operation.

The one gripe I have with this post is that my annoyingly loud keyboard is buckling spring, not Cherry MX Blue, thank-you-very-much.

Comments closed

Azure Data Factory and Source Control

Ahmad Yaseen shows how you can save Azure Data Factory pipelines in source control:

To overcome these limitations, Azure Data Factory provides us with the ability to integrate with a GIT repository, such as Azure DevOps or GitHub repository, that helps in tracking and versioning the pipelines changes, and incrementally save the pipeline changes during the development stage, without the need to validate the incomplete pipeline, preventing these changes from being lost in case of any crash or failure. In this case, you will be able to test the pipeline, revert any change that is detected as a bug, and publish the pipeline to the Data Factory when everything is developed and validated successfully.

Click through for the setup instructions.

Comments closed

Query Store and Cross-Database Queries

Matthew McGiffen does some research:

When I was writing the script shared in my last post Identify the (Top 20) most expensive queries across your SQL Server using Query Store a question crossed my mind:

Query Store is a configuration that is enabled per database, and the plans and stats for queries executed in that database are stored in the database itself. So what does query store do when a query spans more than one database?

Read on for the answer.

Comments closed

DataFrame Cleaning in Spark

Craig Covey has an update to the Spark Starter Guide:

Real-world datasets are hardly ever clean and pristine. They commonly include blanks, nulls, duplicates, errors, malformed text, mismatched data types, and a host of other problems that degrade data quality. No matter how much data one might have, a small amount of high quality data is more beneficial than a large amount of garbage data. All decisions derived from data will be better with higher quality data. 

In this section we will introduce some of the methods and techniques that Spark offers for dealing with “dirty data”. The term dirty data means data that needs to be improved so the decisions made from the data will be more accurate. The topic of dirty data and how to deal with it is a very broad topic with a lot of things to consider. This chapter intends to introduce the problem, show Spark techniques, and educate the user on the effects of “fixing” dirty data. 

It’s interesting to see what’s available in Spark and how you can take advantage of it.

Comments closed

Smoothing and its Inherent Risks

John Mount would like you to take care when using smoothers:

Here is a quick data-scientist / data-analyst question: what is the overall trend or shape in the following noisy data? For our specific example: How do we relate value as a noisy function (or relation) of m? This example arose in producing our tutorial “The Nature of Overfitting”.

One would think this would be safe and easy to asses in R using ggplot2::geom_smooth(), but now we are not so sure.

Here’s a quick summary of my general philosophy: the data are more interesting than a smoothed line. I’m okay putting in a smoothed line to help a reader make sense of a trend, but I wouldn’t want to have a plot with just the smoothed line. Read the whole thing from John to get well beyond my rule of thumb.

Comments closed

Handling “Duplicate” Query String Values with Power Query

Chris Webb troubleshoots an issue:

Some time ago I wrote a pair of popular posts about using the Query and RelativePath options of the Web.Contents function in Power Query and why they are important for dataset refresh. I have recently learned something extra about this subject which merits a new post, though: how to handle multiple URL query parameters with the same name.

It’s interesting to see how Power Query handles this, as there’s no defined standard behavior. Some renderers give you just the first item, some just the last, and some (like IIS + .NET) give you back a list of all items when you have a query string like ?param1=x&param1=y&param1=z.

1 Comment

Finding Power BI Premium Per User Users

Benni de Jagere does some digging:

The other day, I was chatting with one of my clients about Premium Per User, and I gave them the practical guidance to not build any production level dependencies based on PPU features or workspaces, until some of the unknowns have been cleared up. If there’s end users relying on this for their actual daily job, then I’m calling it a production level dependency. Right now, these are preview features, and this client is not actively monitoring changes in the Power BI Landscape.

Shortly after, I got a message that some of their business users did build actual production reports and dataflows in PPU workspaces. And, they were not sure who in the company actually has access to PPU. And that’s where chase down the rabbit hole began

I imagine that this will get easier over time but right now, it doesn’t seem that simple.

Comments closed

Generating Alerts from Power Automate

Ed Hansberry shows how to create a Power Automate alert off of SQL Server data:

I’m going to show you how to do this in Power Automate with just a few steps. Let’s get started. In my example, I am going to return a table when a customer has placed an order where the order quantity will not divide evenly into the case pack. So if they order 100 units and the cases contain 24 each, I want to alert the order entry person to tell them the customer has effectively ordered 4.1667 cases, which isn’t allowed. They will need to order either 96 units or 120 units to get 4 or 5 cases.

Read on to see how.

Comments closed