2016-11-11 – Curated SQL

External File Formats

Published 2016-11-11 by Kevin Feasel

Delimited text is exactly as it sounds: you can use a comma, tab, pipe, tilde, or any other delimiter (including multi-character delimiters). So let’s go through the options here. First, FORMAT_TYPE must be DELIMITEDTEXT. From there, we have a few FORMAT_OPTIONS. I mentioned FIELD_TERMINATOR, which is how we separate the values in a record. We can also use STRING_DELIMITER if there are quotes or other markers around our string values.

DATE_FORMAT makes it easier for Polybase to understand how dates are formatted in your file. The MSDN document gives you hints on how to use specific date formats, but you can’t define a custom format today, or even use multiple date formats.

It feels like there’s a new Hadoop file format every day.

Comments closed

Temporal Tables For Error Recovery

Published 2016-11-11 by Kevin Feasel

Dimitri Furman looks at using temporal tables to recover from user errors:

Have you ever got that sinking feeling after hitting the Execute button in SSMS, thinking “I should not have done that”? DML statements with missing WHERE clause, DROP statements accidentally targeting slightly mistyped (but existing) tables or databases, RESTORE statements overwriting databases with new data that haven’t been backed up, are all examples of actions prompting an “Oops…” (or worse) shortly thereafter. “Oops recovery” is the term that became popular to describe the process of fixing the consequences.

For most of these scenarios, the usual, and often the only, recovery mechanism is to restore the database from backup to a point in time just before the “oops”, known as point-in-time recovery (PITR). Even though PITR remains the most general and the most effective recovery mechanism, it does have some drawbacks and limitations: the recovery process requires a full database restore, taking the time proportional to the size of the database; a sequence of restores may be needed if multiple “oops” transactions have occurred; in the general case, there will be difficulties reconciling recovered data with data modified after the “oops” point in time, etc. Nevertheless, PITR remains the most widely applicable recovery method for SQL Server databases, both on-premises and in the cloud.

The solution entails a non-trivial amount of effort, but it is interesting.

Comments closed

CHECKDB Snapshots

Published 2016-11-11 by Kevin Feasel

Arun Sirpal shows that the CHECKDB snapshot is no longer hidden:

I am in the middle of some research regarding CHECKDB and learnt something new. From SQL Server 2014 onwards the database snapshot that is used to get to a transactionally-consistent point-in-time is no longer hidden.

Arun includes a sample to prove it.

Comments closed

Sparklyr On HDInsight

Published 2016-11-11 by Kevin Feasel

Ali Zaidi has a walkthrough on using sparklyr on HDInsight:

The majority of Spark is written in Scala (~80% of Spark core), which is a functional programming language. Functional programming languages emphasize functional purity (the output only depends on the inputs) and strive to avoid side-effects. One important component of most functional programming languages is their lazy evaluation. While it might seem odd that we would appreciate laziness from our computing tools, lazy evaluation is an effective way of ensuring computations are evaluated in the most efficient manner possible.

Lazy evaluation allows Spark SQL to highly optimize the queries. When a user submits a query to Spark SQL, Spark composes the components of the SQL query into a logical plan. The logical plan is basically a recipe Spark SQL creates in order to evaluate the desired query. Spark SQL then submits the logical plan to its highly optimized engine called Catalyst, which optimizes this plan into a physical plan of action that is executed inside Spark computation engine (a series of coordinating JVMs).

Read on for more description and code.

Comments closed

Performance Of IN

Published 2016-11-11 by Kevin Feasel

Daniel Janik looks at how the IN clause behaves differently based on the number of items in the list:

As you can see the second query is much slower and the extra value in the IN caused late filtering. This is a limitation on some types of operators such as this clustered index scan.

There isn’t just a limitation of 15 input values. There’s also one at 64. On the 65th input value the list will be converted to a constant scan which is then sorted and joined. Most interestingly enough is that the list in my demo query is already sorted ascending.

Read the whole thing.

Comments closed

Bulk Administration

Published 2016-11-11 by Kevin Feasel

Kenneth Fisher discusses the bulk administration right:

So as with all permissions we only grant them if there is an actual need right? And the best practice of least privilege says that if someone has to be able to do a bulk load on a table then we should grant the bulk load to that one table right? There’s the rub. Bulk admin permissions are at the instance level and are not granular in any way. Ie you can’t grant it specifically to a single database or table. It’s all or nothing.

Read on for Kenneth’s thoughts.

Comments closed

SSRS + Power BI Desktop

Published 2016-11-11 by Kevin Feasel

Andrew Peterson walks through the steps to check out the SSRS 2016 preview which supports Power BI Desktop:

SSRS 2016 supporting Power BI Desktop reports is now in preview on Azure. But for many of us, we’d rather be able to review this in our own virtual environment, and more specifically – VirtualBox. We’ll now you can.

Our starting point was a blog posting my Microsoft employee Christopher Finlan outlining the steps needed to setup this preview in a Hyper-V environment. A great start, but what we wanted was the ability to run it Virtual Box. Fortunately for us, running the downloaded VHD in VirtualBox is much easier than Hyper-V.

Click through for the instructions.

Comments closed

Brackets Don’t Improve Performance

Published 2016-11-11 by Kevin Feasel

Jay Robinson shows that wrapping identifiers with brackets does nothing for performance:

Anyway, this obsession had me thinking – does wrapping identifiers in square brackets save SQL Server any time? Does it say to the optimizer, “Hey, I PROMISE this whole thing inside these square brackets is an identifier. Cross my heart.” And the optimizer takes your code at its word and doesn’t look through its list of reserved keywords for one that matches AccountCreateDate or address_line_2?

The answer is… no. Throwing every identifier into square brackets doesn’t speed it up at all. Here’s the test:

Read on for the test.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

Day: November 11, 2016