Press "Enter" to skip to content

Category: Syntax

Grouping And Aggregating In SQL, R, And Python

Dejan Sarka has a few examples of aggregation in different languages, including SQL, R, and Python:

The query calculates the coefficient of variation (defined as the standard deviation divided the mean) for the following groups, in the order as they are listed in the GROUPING SETS clause:

  • Country and education – expression (g.EnglishCountryRegionName, c.EnglishEducation)
  • Country only – expression (g.EnglishCountryRegionName)
  • Education only – expression (c.EnglishEducation)
  • Over all dataset- expression ()

Note also the usage of the GROUPING() function in the query. This function tells you whether the NULL in a cell comes because there were NULLs in the source data and this means a group NULL, or there is a NULL in the cell because this is a hyper aggregate. For example, NULL in the Education column where the value of the GROUPING(Education) equals to 1 indicates that this is aggregated in such a way that education makes no sense in the context, for example aggregated over countries only, or over the whole dataset. I used ordering by NEWID() just to shuffle the results. I executed query multiple times before I got the desired order where all possibilities for the GROUPING() function output were included in the first few rows of the result set. Here is the result.

GROUPING SETS is an underappreciated bit of SQL syntax.

Comments closed

OFFSET – FETCH Versus ROWNUM In Oracle

Lukas Eder compares the OFFSET FETCH logic versus using ROWNUM for grabbing an ordered sub-selection of rows in Oracle:

Now, while the SQL transformation from FETCH FIRST to ROW_NUMBER() filtering is certainly correct, the execution plan doesn’t really make me happy. Consider the ROWNUM based query:

---------------------------------------------------------
| Id  | Operation                     | Name    | Rows  |
---------------------------------------------------------
|   0 | SELECT STATEMENT              |         |       |
|*  1 |  COUNT STOPKEY                |         |       |
|   2 |   VIEW                        |         |     1 |
|   3 |    TABLE ACCESS BY INDEX ROWID| FILM    |  1000 |
|   4 |     INDEX FULL SCAN           | PK_FILM |     1 |
---------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - filter(ROWNUM=1)

And compare that to the FETCH FIRST query:

-------------------------------------------------
| Id  | Operation                | Name | Rows  |
-------------------------------------------------
|   0 | SELECT STATEMENT         |      |       |
|*  1 |  VIEW                    |      |     1 |
|*  2 |   WINDOW SORT PUSHED RANK|      |  1000 |
|   3 |    TABLE ACCESS FULL     | FILM |  1000 |
-------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - filter("from$_subquery$_002"."rowlimit_$$_rownumber"<=1)
   2 - filter(ROW_NUMBER() OVER ( ORDER BY "FILM"."FILM_ID")<=1)

Lukas digs into this and is not the biggest fan of OFFSET-FETCH.  On the SQL Server side, my anecdotal experience has been that it doesn’t perform nearly as well as you’d like either.

Comments closed

Strings And Identifiers

Kenneth Fisher explains the difference between a string and an identifier:

A common mistake, and one I make frequently myself is to use a string in place of an identifier, or vise-versa. So to start, let’s have some definitions, shall we?

String

a linear sequence of characters, words, or other data.

Identifier

a sequence of characters used to identify or refer to a program or an element, such as a variable or a set of data, within it.

And because I always find examples fairly useful.

Click through for the example as well as additional explanation.

Comments closed

Identity Columns And Linked Servers

Kenneth Fisher points out an oddity when inserting data across a linked server into a table with an identity column:

So far so good. Now let’s throw in a twist. Let’s call it through a linked server.

INSERT INTO [(local)\sql2014cs].Test.dbo.IdentTest 
	VALUES ('Col1','Col2');

Msg 213, Level 16, State 1, Line 4
Column name or number of supplied values does not match table definition.

Well that’s a bit odd, right? I mean I used that exact command in the previous test. Turns out that when you do an insert across a linked server that identity column is not ignored. Which means we just need to include the identity value right? Nope.

INSERT INTO [(local)\sql2014cs].Test.dbo.IdentTest 
	VALUES (1,'Col1','Col2');

Msg 7344, Level 16, State 1, Line 4
The OLE DB provider “SQLNCLI11” for linked server “(local)\sql2014cs” could not INSERT INTO table “[(local)\sql2014cs].[Test].[dbo].[IdentTest]” because of column “Id”. The user did not have permission to write to the column.

Click through to see how to do this.

Comments closed

There’s Only One Way To Order

Matthew McGiffen notes that there is only one way to order, and that is to use the ORDER BY clause:

Everyone, at the beginning of their SQL career, get’s told that it is important to include an ORDER BY if they want the results ordered. Otherwise the order in which they are returned is not guaranteed.

But then you run queries a lot of times that don’t need a specific order – and you see that they (at least seem to) come out in the same order every time. You could (almost) be forgiven for thinking you can rely on that.

There was even a question on a Microsoft SQL certification exam a few years ago that asked what the default order was for records returned by a simple SELECT – the answer it was looking for was that it would be according to the order of the clustered index. So you can rely on that – right?

Wrong. The question was a bad question, and the answer was incorrect. Let’s look at this in action.

Order is never guaranteed to be stable unless you specify a unique ordering using ORDER BY.

Comments closed

Deferred Name Resolution In SQL Server

Kendra Little explains the concept of deferred name resolution in SQL Server:

In this case, I’m creating a temporary stored procedure (out of laziness, it means I don’t have to clean up a quick demo) –

CREATE OR ALTER PROCEDURE #test
AS
IF 1=0
    EXECUTE dbdoesnotexist.dbo.someproc;
GO

The database dbdoesnotexist does NOT exist, but I’m still allowed to create the procedure.

When I do so, I get an informational message:

The module ‘#test’ depends on the missing object ‘dbdoesnotexist.dbo.someproc’. The module will still be created; however, it cannot run successfully until the object exists.

This can be useful in some cases where you’ll be querying a table or procedure that may not exist all the time, but which will exist when a certain code block is run.

But, as Kendra points out, deferred name resolution doesn’t work everywhere, so it’s important to know the rules around when it will or will not work.

Comments closed

T-SQL Join Delete

Steve Stedman walks us through a bit of T-SQL proprietary syntax:

1
2
3
DELETE t2
FROM [dbo].[Table1] t1
INNER JOIN [dbo].[Table2] t2 on t1.favColor = t2.id;

Names have been changed to protect the innocent.

In the above delete statement which table will have rows deleted from it?

A: Table1

B: Table2

C: Both Table1 and Table2

D: Neither Table1 and Table2

Got it in one.  I like having this syntax available to me when I need it, even though it’s not ANSI standard.

Comments closed

Date And Time Functions To Avoid

Randolph West shares his thoughts on three functions he’d rather you avoid:

CURRENT_TIMESTAMP is the ANSI-equivalent of GETDATE(). ANSI is an acronym for the American National Standards Institute, and sometimes vendors will include ANSI functions in their products so they can say that they’re ANSI-compliant (which is not a bad thing, in most cases).

There are three main problems with CURRENT_TIMESTAMP:

  • No brackets. It goes against the rules about functions. So much for standards!
  • It’s functionally equivalent to GETDATE(), which uses DATETIME, which we previously identified is old and bad.
  • It’s too similar to the poorly-named TIMESTAMP data type, which has nothing to do with dates and times and should be called ROWVERSION.

Bottom line: don’t use CURRENT_TIMESTAMP.

At one point I used CURRENT_TIMESTAMP over GETDATE() with the thought of portability in mind.  Since then, my thoughts on code portability have changed and regardless, as Randolph mentions, it’s better to use DATETIME2 functions to avoid precision issues with DATETIME.

Comments closed