Curated SQL – Page 714 – A Fine Slice Of SQL Server

Data Paging using Common Table Expressions

Published 2021-03-12 by Kevin Feasel

Steve Stedman takes us through one method of generating pages of data:

I can remember the first time that I worked on data paging code. I had to page through web site search results 20 at a time on a given web page. My task was to understand how it had been written and to do some bug fixing. After reviewing about 2,000 lines of code, and reviewing the seven different variables that were being used to know the current page, the next page, the previous page, the page size, the row at the top of the page, the row at the bottom of the page, and the number of pages, I finally understood what the code was intended to do. What a mess that was, but beyond the mess, the performance was horrible. The way the page worked was that based on the page you were on, all of the rows would be queried, then a loop would read through all of the rows before the current page, then loop through the rows on the current page displaying them on the page, and finally it would ignore the results after the current page. So page 1 was slow, page 2 was slower than page 1, page 3 was slower than page 2 and on and on.
Since that point I have implemented several different data paging algorithms myself, all better than the original implementation but none as elegant as the CTE way of doing data paging. I used to look at data paging as a painful task, but thanks to the SQL Server implementation of CTEs there is no more pain.

Steve also covers OFFSET and FETCH. This technique won’t be great with enormous data sets, but for moderate-sized data sets which query quickly, it works. This is one area which is quite painful, and the best (and wackiest) solution I’ve come up with in SQL Server when the initial query is quite expensive is to create tables with random names to store results and row numbers, populate a table the first time a query is run, and query that table on subsequent runs, using the RETURN value in a stored procedure to pass along the name of the table to access. Granted, that solution works best with static data and you’d want to have a method to clean up those tables after they’re no longer in use (like storing a list of those tables and their last access dates and times). So it’s a mess.

Comments closed

Fun with Disabling Joins

Published 2021-03-12 by Kevin Feasel

David Alcock gives us a “wouldn’t it be neat to see this?” scenario:

If you’ve ever seen my session on Guillotines and the Query Optimiser I demonstrated how we can use query hints to switch off certain transformation rules in a query. Transformation rules use a pattern substitution which essentially swaps one expression for another. In SQL Server terms the best example of this is a join as although we use the logical expressions such as left join or inner join the physical join type used in the execution plan operator is substituted to be something like a Hash Match or Nested Loop Join.
This means we can force the behaviour of an execution plan by disabling certain rules. Now I can’t think of any practical reason for this and remember we can also use join hints in our queries but this is different as it affects the entire optimisation process for a query and not just one join so please follow in a sandbox environment.

In case you haven’t seen David’s session, it’s available on YouTube.

Comments closed

Disliking User-Defined Data Types

Published 2021-03-12 by Kevin Feasel

Andy Levy has a bone to pick:

Here’s the thing – these types are really just aliases for native types in SQL Server, but more constrained. Constrain yourself to UDTs and you’ll have trouble right-sizing your fields. Let’s say you’ve got three data types for text data:
– myShortString (varchar(10))
– myString (varchar(256))
– myBigString (varchar(8000))
These lengths are not helping anyone. You can’t store email addresses or names in myShortString. But myString is probably way too much for that data. You’re going to waste memory because of how SQL Server estimates memory grants and your indexes will be bloated. Maybe you just need to create more UDTs to cover these situations. But that just compounds the other problems, doesn’t it?

Pushes glasses up the bridge of his nose. Teeeeechnically, an e-mail address may be up to 256 characters long, including a username of up to 64 characters (and maybe two angle brackets, depending on the host). So myString would actually be perfect. Steve Jones has a comment about 300, but that was probably the original standard of 320. Regardless, I realize how far beside the point that is, and Andy’s point is a good one, as well as the several others he makes in the post.

One quick note on defined types: they really do make a lot of sense in a domain-driven design, especially when working with functional programming languages. Defining a CustomerID as an int is fine, but if I know my customer IDs are natural numbers (1, 2, 3, …), 9 digits long, and do not contain the sequence 2345 (because my company considers this an unlucky number sequence), creating a CustomerID type which provides this sort of type checking is great because you keep the rules as close to the data as possible and ensure consistency. It’s also more restrictive than int, so you can cast back down to an int when you’re ready to interact with some remote system. So short answer, do this all day in F#, but not in the database.

Comments closed

Removing Characters Before or After a Delimiter in Power Query

Published 2021-03-12 by Kevin Feasel

Gilbert Quevauvilliers shows how to remove all characters either before or after a delimiter:

This is an example of what my column name looked like below.
Date[Yr-Mth]
What I wanted this to look like is Yr-Mth]
Below is the Syntax I used to get this working.

Click through to see how Gilbert was able to do this and subsequently how to get rid of the trailing bracket.

Comments closed

In Defense of Float

Published 2021-03-12 by Kevin Feasel

Hugo Kornelis levies a defense of floating point data types:

Let’s return to the database. Let’s figure out a way to store these numbers appropriately.
Could we use decimal (or its synonym numeric)? Well, yes. We can. We need 25 digits after the decimal place for 0.0000000000000000000004052, and 9 digits before the decimal place for 299900000, so that would fit in a decimal(34,25). But if you try to compute c² so you can then multiply that to the m, you’ll run into an overflow error.

Hugo does a good job of defending the float data type.

Comments closed

Namenode Formatting in Hadoop

Published 2021-03-11 by Kevin Feasel

The Hadoop in Real World team has a warning for us:

What does hadoop namenode -format do and is it safe to run?
Don’t run the command until you fully understand what you are trying to do.

Click through to see what the command does.

Comments closed

Batch Execution Mode in Flink’s DataStream API

Published 2021-03-11 by Kevin Feasel

Dawid Wysakowicz takes us through batch execution mode in a streaming solution:

Flink has been following the mantra that Batch is a Special Case of Streaming since the very early days. As the project evolved to address specific uses cases, different core APIs ended up being implemented for batch (DataSet API) and streaming execution (DataStream API), but the higher-level Table API/SQL was subsequently designed following this mantra of unification. With Flink 1.12, the community worked on bringing a similarly unified behaviour to the DataStream API, and took the first steps towards enabling efficient batch execution in the DataStream API.

Read on to see the progress they’ve achieved so far.

Comments closed

Two Disliked Data Types

Published 2021-03-11 by Kevin Feasel

Aaron Bertrand has two bones to pick:

I am not often one to do the bare minimum, unless it comes to looking for things around the house. After about 22 seconds I throw my hands in the air and exclaim, “I can’t find it!” As my wife so kindly added: It usually turns out to be in a spot I already looked.
But when it revolves around SQL Server and opinions, I’m all over it. So I’m not going to talk about a data type today; I’m going to talk about two of them. One of them Brent already mentioned in his invite:

I kinda-sorta disagree with Aaron’s second choice. By that I mean that I fully agree with his premise: use UTC everywhere. But if you don’t use UTC everywhere, then use DATETIMEOFFSET everywhere and apply the time zones.

Comments closed

Multiple Database Projects in Azure Data Studio

Published 2021-03-11 by Kevin Feasel

Elizabeth Noble has a new video for us:

In many companies, you’re often working with more than one database on the same server. When I use Visual Studio, I have all the databases that exist on the same server in the same solution. You can do this same thing when you’re working with Azure Data Studio.

Click through to see how.

Comments closed

Curated SQL Posts