Press "Enter" to skip to content

Category: Syntax

Thoughts on the New STRING_SPLIT

Ronen Ariely has mixed feelings on updates to the STRING_SPLIT function:

The main issue with this function, is that it returns a SET of rows with no specific order.

As you must know by now, a TABLE is a SET of rows (Rowstore table which is the more common in SQL Server) or columns (Columnstore table). The rows in the table are not stored in specific order (even if using clustered index, the rows can physically be stored in different locations on the disk, not necessarily maintained continuously one after the other. In addition, the server might read the rows in parallel and not necessarily in the order of the index. As a result, The order in which rows are returned in a result set are not guaranteed unless an ORDER BY clause is specified.

And this is the main issue with the STRING_SPLIT… until today

Read on to see how this update makes STRING_SPLIT() much better, and also how it could be even better still.

Comments closed

GROUP BY and Functional Dependencies

Lukas Eder illuminates us:

The SQL standard knows an interesting feature where you can project any functional dependencies of a primary (or unique) key that is listed in the GROUP BY clause without having to add that functional dependency to the GROUP BY clause explicitly.

I was unaware that this functionality existed (in some database platforms), and I’m not positive that I like it.

Comments closed

Fundamentals of Inline TVFs

Itzik Ben-Gan explains Inline Table-Valued Functions:

Compared to the previously covered named table expressions, iTVFs resemble mostly views. Like views, iTVFs are created as a permanent object in the database, and therefore are reusable by users who have permissions to interact with them. The main advantage iTVFs have compared to views is the fact that they support input parameters. So, the easiest way to describe an iTVF is as a parameterized view, although technically you create it with a CREATE FUNCTION statement and not with a CREATE VIEW statement.

It’s important not to confuse iTVFs with multi-statement table-valued functions (MSTVFs). The former is an inlinable named table expression based on a single query similar to a view and is the focus of this article. The latter is a programmatic module that returns a table variable as its output, with multi-statement flow in its body whose purpose is to fill the returned table variable with data.

Now that we have that sorted, click through to see examples and dive into performance ramifications.

Comments closed

Top with Percent

Kevin Wilkie is on the top shelf:

In the last blog post, we went over the extreme basics of using the TOP operator in SQL. We showed how to grab things like the TOP 10 of a certain item.

That ability will get you through a number of criteria that you will be asked to perform. But what if you’re asked to grab the top five percent of performers in your company? Or in a region? It’s kinda hard to do that if you only have what we know so far, right?

Read on for the answer.

Comments closed

UDFs and STRING_AGG

Erik Darling has a bone to pick with STRING_AGG():

If you’re like me and you got excited by the induction of STRING_AGG into the T-SQL Lexicon because of all the code odd-balling it would replace, you were likely also promptly disappointed for a few reasons.

Read on for one post which covers all of those reasons. Even with that disappointment, I’m still happy with STRING_AGG() on the whole, myself. There are some extra steps it’d be nice to eliminate in certain circumstances, but 60% of the time, it works every time.

Comments closed

Unique Constraints vs Unique Indexes

Erik Darling calls out unique key constraints:

I do love appropriately applied uniqueness. It can be helpful not just for keeping bad data out, but also help the optimizer reason about how many rows might qualify when you join or filter on that data.

The thing is, I disagree a little bit with how most people set them up, which is by creating a unique constraint.

Data modeling Kevin wants to use unique key constraints because that’s the correct thing to do. Implementation Kevin uses unique nonclustered indexes for the reasons Erik describes. Not mentioned in Erik’s post but potentially relevant is that operations on unique nonclustered indexes can be done online, whereas unique key constraint operations (creation and alteration via drop+create) are offline.

Comments closed

Partitioning vs Bucketing in Hive

The Hadoop in Real World team explains the difference between partitioning and bucketing in Apache Hive tables:

Now let’s say you also filter the sales record by sku (stock-keeping unit aka. barcode)  in addition to sale_date and country. Creating a partition on sku will result in many partitions which is not ideal as it might result in uneven and smaller partitions.

Hadoop is not efficient in processing small volumes of data. There is a better way.

Read on to understand when each technique makes sense.

Comments closed