Press "Enter" to skip to content

Curated SQL Posts

Tidying the Confusion Matrix in R

Gary Hutson has a new package for us:

The package aim is to make it easier to convert the outputs of the lists from caret and collapse these down into row-by-row entries, specifically designed for storing the outputs in a database or row by row data frame.

This is something that the CARET library does not have as a default and I have designed this to allow the confusion matrix outputs to be stored in a data frame or database, as many a time we want to track the ML outputs and fits over time to monitor feature slippage and changes in the underlying patterns of the data.

I like the way caret shows the confusion matrix when I’m reviewing result on my own, but I definitely appreciate efforts to make it easier to handle within code—similar to how broom reads linear regression outputs. H/T R-bloggers

Comments closed

Fixing Those Pesky Wait Stats

David Alcock keeps us from having to think:

CXPACKET

Attempt to pronounce parallellellellellism correctly then set MAXDOP to 1.

PAGELATCH

Mention TempDB and contention in the same sentence. Delete TempDB

BACKUP_

Delete any long running backup jobs. If wait persists then delete all backup jobs.

Click through for plenty of excellent nuggets of advice which definitely won’t land you on the unemployment line.

Comments closed

The Joy of Parameter Sniffing

Erik Darling points out that parameter sniffing is generally a good thing:

To some degree, I get it. You’re afraid of incurring some new performance problem.

You’ve had the same mediocre performance for years, and you don’t wanna make something worse.

The thing is, you could be making things a lot better most of the time.

We should specifically talk about parameter sniffing problems rather than parameter sniffing as a problem. These sorts of problems are closer to the exception than the rule.

Comments closed

Disorderly Queries

Chad Callihan wants you to think about that ORDER BY clause:

I recently came across a scenario where an application process was not performing correctly on one database but was working fine on others. The process should have been completing in seconds but was taking minutes with no indication of activity. After some investigation, I found that the process was stuck waiting on a SELECT statement to complete. Even worse, it was holding an exclusive lock on a table which was then blocking new information from processing.

One part of the SELECT query that stood out was that it was ordering by a date field. Considering what the process was doing, there was no need to have the data ordered. Generally, it’s better to have the data sorted on the application side instead of SQL Server but in this case not even that was necessary.

There are definitely good cases where you need to use ORDER BY in a database—such as paging scenarios. But if you don’t need things in a particular order, Chad shows that you can potentially save a good deal on performance without an explicit ordering.

Comments closed

Fun with Multi-Column Unique Constraints

Aaron Bertrand has an interesting use case:

A problem that comes up occasionally with constraints in SQL Server is when a unique constraint applies to multiple columns, but the values in those columns can be populated in any order. For example, if a table holds interactions between two users, and has columns User1 and User2, there should only be one row allowed for users 100 and 200, regardless of whether the data is entered as 100, 200 or 200, 100.

Click through for one solution. Another solution would be to normalize this down further with a dbo.ConversationParticipants table.

Comments closed

Choosing an Image File Type

The folks at Jumping Rivers continue a series on image optimization:

As the JPEG compression algorithm significantly reduces file size, JPEG files are ubiquitous across the web. If you take a photo on your camera, it’s almost certainly using a JPEG storage format. Historically the file extension was .jpg as Microsoft Windows only handled three character file extensions (also .htm vs .html). But today both extensions are used (personally I prefer .jpeg, but I’m not very consistent if I’m totally honest).

If you did a little Googling on which file format to use for images, then the answer you would come across is that JPEG’s are the default choice. But remember, figures are different from standard images!

Click through for a review of three viable image formats.

Comments closed

Research with R and Production with Python

Matt Dancho and Jarrell Chalmers lay out an argument:

The decision can be challenging because they both Python and R have clear strengths.

R is exceptional for Research – Making visualizations, telling the story, producing reports, and making MVP apps with Shiny. From concept (idea) to execution (code), R users tend to be able to accomplish these tasks 3X to 5X faster than Python users, making them very productive for research.

Python is exceptional for Production ML – Integrating machine learning models into production systems where your IT infrastructure relies on automation tools like Airflow or Luigi.

They make a pretty solid argument. I’ve launched success R-based projects using SQL Server Machine Learning Services, but outside of ML Services, my team’s much more likely to deploy APIs in Python, and we’re split between Dash and Shiny for visualization. H/T R-Bloggers

Comments closed

Refreshing a Single Table in Power BI

Marc Lelijveld doesn’t want to wait for everything to reload:

If you want to refresh a Power BI dataset, we all know where to find the refresh button in Power BI Desktop as well as in the Power BI Service. By clicking it, you will trigger the entire dataset to refresh. But sometimes it is more convenient to trigger a single table to refresh. If you want to do this, you can do a simple right-click on a table in Power BI Desktop, but how does this work in the Power BI Service? In this blogpost I will describe how you can trigger a single table refresh in the Power BI Service over XMLA endpoints. Please know, this does require Power BI Premium (either Premium per User or Premium Capacity is fine).

Click through to see how.

Comments closed

Protecting Excel with Powershell

Mikey Bronowski shows us a few techniques for protecting data in Excel files using Powershell:

Last month we have been hiding things in Excel, so this week we are going to make sure they are protected as well. Excel offers multiple levels of password protection and its options:

– locking file with a password, i.e. without key phrase opening file is not possible
– protecting workbook’s structure
– lastly, protecting individual worksheets from a handful of operations

Read on to see each of those in action.

Comments closed