Press "Enter" to skip to content

Curated SQL Posts

Tidy Data Is Normalized Data

I emphasize the link between a tidy dataframe and a normalized data structure:

The kicker, as Wickham describes on pages 4-5, is that normalization is a critical part of tidying data.  Specifically, Wickham argues that tidy data should achieve third normal form.

Now, in practice, Wickham argues, we tend to need to denormalize data because analytics tools prefer having everything connected together, but the way we denormalize still retains a fairly normal structure:  we still treat observations and variables like we would in a normalized data structure, so we don’t try to pack multiple observations in the same row or multiple variables in the same column, reuse a column for multiple purposes, etc.

I had an inkling of this early on and figured I was onto something clever until I picked up Wickham’s vignette and read that yeah, that’s exactly the intent.

Comments closed

SQL Operations Studio February Release

Alan Yu announces the February release of SQL Operations Studio:

The February release includes several major repo updates and feature releases, including:

  • Added Auto-Update Installation feature
  • Added Connection Dialog ‘Database’ Drop-down
  • Added functionality for new query tabs keeping active connection
  • Fixed bugs in SQL Editor and auto-completion

For complete updates, refer to the Release Notes.

Auto-update, something that Management Studio and Power BI don’t do.

Comments closed

Navigating A SQL Server Instance With Powershell

Andy Mallon shows off the drive-based navigation available in the SQL Server Powershell module:

I recently noticed that the system objects were missing from my results when I do a Get-ChildItem. I noticed it with views, but then realized that none of the system objects showed up. What gives? I floundered through a quick Google search, where I knew I wasn’t searching for the right thing, and was not surprised when I didn’t see the answer.

I said to myself, “Andy, hold on a second & think. If something doesn’t want to open up, sometimes you just have to -force it open.”

I don’t tend to use this much, as I have recollections of it being slow.  Nonetheless, it is good to know about.

Comments closed

Find And Replace Stored Procedure Code

Jana Sattainathan has to find and replace a lot of code:

The database has 100’s of stored procedures (700+ to be somewhat precise)

Qualifying stored procedures have these characteristics

  • Procedure name ends with “_del”
  • Procedure has the string “exec” in the code
  • Procedure has the string “sp_execute” in the code

This is what needs to be done:

  • Replace CREATE PROC with ALTER PROC

  • Replace SYSTEM_USER with “ORIGINAL_LOGIN()”

  • Replace AS at the beginning of CREATE PROC with “WITH EXECUTE AS OWNER AS”

  • Comment out some SET statements

  • …in fact, there could be any number of other changes

Read on to see how Jana did it.

Comments closed

The Downside Of Nested Views

Randolph West doesn’t mince words:

Nested views are bad. Let’s get that out of the way.

What is a nested view anyway? Imagine that you have a SELECT statement you tend to use all over the place (a very common practice when checking user permissions). There are five base tables in the join, but it’s fast enough.

Instead of copying and pasting the code wherever you need it, or using a stored procedure (because for some reason you’re allergic to stored procedures), you decide to simplify your code and re-use that SELECT in the form of a database view. Now whenever you need to run that complicated query, you can instead query the view directly.

What harm could this do?

Spoilers:  a lot.

Comments closed

Diagramming Databases With Power BI

Philip Seamark shows how to visualize the relationships between tables using Power BI:

The network navigator was another good visual, and if you have an R instance installed on your local machine, you can play with some of the custom R visuals.

The catalog views could be used in a similar way to generate power bi visuals showing other object dependencies inside an MS SQL Database.  Additional columns could be added to the base query to be used in tool-tips etc.

If your database tables do not have foreign keys, here is another query that can be used to help guess relationships between tables based on column name and datatype.

Most of these techniques look like they wouldn’t work well on databases with a very large number of tables, but for an average-sized database, it can serve as an avant-garde ERD.

Comments closed

Batch Mode Memory Fractions

Joe Obbish explains what memory fractions are and how incorrect calculations can lead to tempdb spills:

There’s very little information out there about memory fractions. I would define them as information in the query plan that can give you clues about each operator’s share of the total query memory grant. This is naturally more complicated for query plans that insert into tables with columnstore indexes but that won’t be covered here. Most references will tell you not to worry about memory fractions or that they aren’t useful most of the time. Out of thousands of queries that I’ve tuned I can only think of a few for which memory fractions were relevant. Sometimes queries spill to tempdb even though SQL Server reports that a lot of query memory was unused. In these situations I generally hope for a poor cardinality estimate which leads to a memory fraction which is too low for the spilling operator. If fixing the cardinality estimate doesn’t prevent the spill then things can get a lot more complicated, assuming that you don’t just give up.

Extremely interesting post.

Comments closed

Optimal Image Colorization With Python

Sandipan Dey walks through a paper on colorization and shows some examples:

Colorization is a computer-assisted process of adding color to a monochrome image or movie. In the paper the authors presented an optimization-based colorization method that is based on a simple premise: neighboring pixels in space-time that have similar intensities should have similar colors.

This premise is formulated using a quadratic cost function  and as an optimization problem. In this approach an artist only needs to annotate the image with a few color scribbles, and the indicated colors are automatically propagated in both space and time to produce a fully colorized image or sequence.

In this article the optimization problem formulation and the way to solve it to obtain the automatically colored image will be described for the images only.

It’s an interesting approach.

Comments closed

Radar Charts With ggplot2

I have wrapped up my ggplot2 series, with the last post being on radar charts:

First, we need to install ggradar and load our relevant libraries. Then, I create a quick standardization function which divides our variable by the max value of that variable in the vector. It doesn’t handle niceties like divide by 0, but we won’t have any zero values in our data frames.

The radar_data data frame starts out simple: build up some stats by continent. Then I call the mutate_each_ function to call standardize for each variable in the vars set. mutate_each_is deprecated and I should use something different like mutate_at, but this does work in the current version of ggplot2 at least.

Finally, I call the ggradar() function. This function has a large number of parameters, but the only one you absolutely need is plot.data. I decided to change the sizes because by default, it doesn’t display well at all on Windows.

It was a lot of fun putting this series together. I think the most important part of the series was learning just how easy ggplot2 is once you sit down and think about it in a systemic manner.

Comments closed