Press "Enter" to skip to content

Curated SQL Posts

Why Care About The Data?

Rob Collie explains some of the benefits of Power BI with respect to its ability to dig into data problems:

Many problems/opportunities can ONLY be addressed by making thousands of small behavior changes, on an ongoing basis, out in the “trenches.”  In other words, there are some which CANNOT be addressed by a small number of smart decisions made at the top of the org.  Improved pricing behavior by the sales force for instance.  Improved quality of service by field technicians, for another.

How do you make that happen?  How do you influence a large number of people to behave differently on an ongoing basis, especially when they operate at a distance (both organizationally and physically) from you?

Fancy management consulting firms charge six-seven figures to do this, but with Power BI and/or Power Pivot, it’s no longer necessary to make that outlay.  We’ve seen organizations crack this nut with in-house resources, modest budgets, and short timelines – even though the impact is absolutely enterprise-wide.  It takes some thought and iteration, for sure, but primarily, you “just” need to build a good scorecard.

This is a non-technical post, but I like it a lot because it helps motivate decisions and gives you a good reason to investigate whether your tools are doing the work you need them to do.

Comments closed

Batches And Stored Procedure Creation

Steve Jones has a warning for when you create a stored procedure:

Why is my select code in there? That was designed to be a piece of test code. Shouldn’t the BEGIN..END after the AS define my procedure?

Actually it doesn’t. the procedure doesn’t end until the CREATE PROCEDURE statement is terminated. That termination comes by ending the batch. The CREATE PROCEDURE documentation has this limitation:

The CREATE PROCEDURE statement cannot be combined with other Transact-SQL statements in a single batch.

This means that anything else you have in that batch will be considered as part of the procedure, regardless of BEGIN..END.

Judicious usage of the GO statement can help keep you out of trouble.

Comments closed

Partitioned Views With Polybase

I look at using SQL 2000-style partitioning with a set of external tables:

Using a view, we were able to create a “partitioned” Polybase experience, similar to what we had in SQL Server 2000.  This form of poor man’s partitioning allows us to segment out data sets and query them independently, something which can be helpful when storing very large amounts of data off-site and only occasionally needing to query it.  The thing to remember, though, is that if you store this in Azure Blob Storage, you will need to pull down the entire table’s worth of data to do any processing.

This leads to a concept I first heard from Ginger Grant:  pseudo-StretchDB.  Instead of paying for what Stretch offers, you get an important subset of the functionality at a much, much lower price.  If you do store the data in Azure Blob Storage, you’re paying pennies per gigabyte per month.  For cold storage, like a scenario in which you need to keep data around to keep the auditors happy but your main application doesn’t use that information, it can work fine.  But if you need to query this data frequently, performance might be a killer.

For Polybase tables without the ability to perform external pushdown, coming up with a good partitioning strategy is probably one of the two best ways to improve performance, with creating a Polybase scale-out cluster the other method.

Comments closed

Nested Sets

Nate Johnson explains the nested sets model:

Put another way, the #3 rule is that you should always operate on the tree (CrUD ops) using stored-procedures and/or triggers that encapsulate all the nitty-gritty details of maintaining the correct position values during said insert/update/delete operations.  Of course, somebody is responsible for writing those stored-procs.  Any volunteers?  Easy now, don’t raise your hands all at once!  Generally, this responsibility falls to the DBA(s) or DBDev(s).

The problem at-hand, in my current situation, was that of “moving a sub-tree”, i.e. taking a node and all its descendants, and moving it to place it under another “parent” node.  In some models, and/or in some languages, this is a simple recursive operation.  However, SQL is not spectacular at recursion — after all, we’re working in a relational engine — so let’s try to play to its strengths:

This is a straightforward look at one of the major hierarchical models in relational design.  Well worth a look.

1 Comment

TempDB And Parallelism

Kendra Little looks at cases when a query uses multiple tempdb data files:

As you might guess, things may not always get evenly accessed, even if you have evenly sized tempdb files. One of my queries did a select into a temp table. Although it used all four tempdb files whether or not it went parallel, there were more file_read events against the first tempdb file than against the other four.

It’s an interesting look at this specific question, but also as pedagogical technique.

Comments closed

BatchMode Execution

Sunil Agarwal describes BatchMode execution with columnstore indexes:

You may be wondering what is this magic number 900 rows within a batch? Well, when executing a query in BatchMode, SQL Server allocates a 64k bytes structure to group the rows. The number of rows in this structure can vary between 64 to 900 depending upon number of columns selected. For the example above, there are two columns that are referenced and X marks the rows that qualified in the BatchMode structure shown in the picture below. If SCAN is part of a bigger query execution tree,  the pointer to this structure is passed to the next operator for further processing. Not all operators can be executed in BatchMode. Please refer to Industry leading analtyics query performance for details on BatchMode Operators.

Under the right circumstances, BatchMode execution can be a major performance benefit.

Comments closed

Virtual Function Calls

Ewald Cress is thinking about virtual function calls:

A virtual function call, on the other hand, is only resolved at runtime. The compiler literally does not know what address is going to get called, and neither does the runtime except in the heat of the moment, because that is going to depend on the type of the object instance that the function is called on. Bear with me, I’ll try and simplify.

A C++ object is just a little chunk of memory: a bunch of related instance variables if you like. All objects of the same class have the same structure in this regard. If you’re wondering about functions (a.k.a. methods), these belong to the class, or put differently, to ALL objects of that class. Once compiled, each method is a chunk of memory with a known address, containing the compiled instructions.

From there, it’s a harrowing journey through bigger layers of indirection.

Comments closed

Spark 2.1

Reynold Xin announces Apache Spark 2.1:

  • Structured Streaming

    Introduced in Spark 2.0, Structured Streaming is a high-level API for building continuous applications. The main goal is to make it easier to build end-to-end streaming applications, which integrate with storage, serving systems, and batch jobs in a consistent and fault-tolerant way.

    • Event-time watermarks: This change lets applications hint to the system when events are considered “too late” and allows the system to bound internal state tracking late events.

    • Support for all file-based formats and all file-based features: With these improvements, Structured Streaming can read and write all file-based formats, e.g. JSON, text, Avro, CSV. In addition, all file-based features—e.g. partitioned files and bucketing—are supported on all formats.

    • Apache Kafka 0.10: This adds native support for Kafka 0.10, including manual assignment of starting offsets and rate limiting.

This is a pretty hefty release.  Click through to read the whole thing.

Comments closed