Press "Enter" to skip to content

Curated SQL Posts

Measure-Object Differences

Klaas Vandenberghe notes a discrepancy in how Measure-Object works, based on syntax:

So, is the -InputObject parameter broken? Not really, but we need to be aware of the logic behind it.
Maybe we are used to work with -ComputerName for a lot of cmdlets and functions, and we rely blindly upon the ability of the command to handle whatever collection we provide. We know this executes the action we chose, like Get-DbaSqlService or Get-DbaOperatingSystemor whatever, separately to every computer in the collection. The collection is ‘fold out’, ‘unpacked’, ‘split’.

-InputObject doesn’t do that! Is that wrong? Not necessarily, it may be a choice of design to have possibilities to enquire the properties of the collection itself and those of the ‘members’ or ‘children’. It’s just a surprise the behaviour differs from pipeline input to parameter input.

Something to keep in mind when writing Powershell scripts.

Comments closed

Inference Attacks

Phil Factor explains that your technique for pseudonymizing data doesn’t necessarily anonymize the data:

It is possible to mine data for hidden gems of information by looking at significant patterns of data. Unfortunately, this sometimes means that published datasets can reveal sensitive data when the publisher didn’t intend it, or even when they tried to prevent it by suppressing any part of the data that could enable individuals to be identified

Using creative querying, linking tables in ways that weren’t originally envisaged, as well as using well-known and documented analytical techniques, it’s often possible to infer the values of ‘suppressed’ data from the values provided in other, non-suppressed data. One man’s data mining is another man’s data inference attack.

Read the whole thing.  One big problem with trying to anonymize data is that you don’t know how much the attacker knows.  Especially with outliers or smaller samples, you might be able to glean interesting information with a series of queries.  Even if the application only returns aggregated results for some N, you can often put together a set of queries where you slice the population different ways until you get hidden details on individual.  Phil covers these types of inference attacks.

Comments closed

Azure SQL Database Deployment Account Errors

Steve Jones troubleshoots an issue with Azure SQL Database:

 I’ve had most builds work really well. I tried a number of things, but kept getting a few items in the build. There were login errors or network errors, both of which bothered me since I could manually log in with SSMS from the same machine as my build agent.

I suspected a few things here, one of which was the use of named pipes for the Shadow database and TCP for Azure SQL Database.

Eventually, I decided to fall back with msbuild, ignoring VSTS, and make sure all my parameters were correct.

Read on for the rest of the story.

Comments closed

Combining Densities

Paul White explains how the SQL Server cardinality estimator will build an estimate involving multiple single-column statistics:

The task of estimating the number of rows produced by a GROUP BY clause is trivial when only a single column is involved (assuming no other predicates). For example, it is easy to see that GROUP BY Shelf will produce 21 rows; GROUP BY Bin will produce 62.

However, it is not immediately clear how SQL Server can estimate the number of distinct (Shelf, Bin) combinations for our GROUP BY Shelf, Bin query. To put the question in a slightly different way: Given 21 shelves and 62 bins, how many unique shelf and bin combinations will there be? Leaving aside physical aspects and other human knowledge of the problem domain, the answer could be anywhere from max(21, 62) = 62 to (21 * 62) = 1,302. Without more information, there is no obvious way to know where to pitch an estimate in that range.

Yet, for our example query, SQL Server estimates 744.312 rows (rounded to 744 in the Plan Explorer view) but on what basis?

Read on for debugger usage, Shannon entropy calculations, and all kinds of other fun stuff.

Comments closed

Thinking About Implicit Conversions

Bert Wagner shows how implicit conversions in a predicate can ruin query performance:

Why? Because SQL is performing that implicit conversion to the numeric datatype for every single row in my table. Hence, it can’t seek using the index because it ends up having to scan the whole table to convert every record to a number first.

And this doesn’t only happen with numbers and string conversion. Microsoft has posted an entire chart detailing what types of data type comparisons will force an implicit conversion:

This is one of those things that can easily elude you because the query will often return results in line with what you expect, so until you have a performance problem, you might not even think to check.

Comments closed

Missing Index DMV Limitations

Brent Ozar goes into detail on why you should not blindly trust missing index recommendations in SQL Server:

SQL Server’s telling us that it needs an index to do an equality search on LastAccessDate – because our query says LastAccessDate = ‘2016/11/10’.

But in reality, that’s not how you access datetime fields because it won’t give you everyone who accessed the system on 2016/11/10 – it only gives you 2016/11/10 00:00:00. Instead, you need to see everyone on that day, like this:

Read the whole thing.  The crux of this is that the missing index recommendation process only gets to see what you’re running at the time you run it, so it can’t generalize all that well; that’s your job.

Comments closed

Selecting Into A Specific Filegroup

Andrew Pruski shows off a new feature in SQL Server 2017:

Now I can run the SELECT…INTO statement using the new ON FILEGROUP option. I’m going to run an example SELECT statement to capture Sales in the UK: –

We are one step closer to CTAS on-prem…  Being able to select into a specific filegroup is nice when you want to segregate tables by filegroup to make recovery of the most critical tables faster:  having a primary filegroup, and then a filegroup for the critical tables for your application, followed by the history tables and other large tables that the app doesn’t need immediately.

Comments closed

Isolation Level Basics

Randolph West describes the primary isolation levels in SQL Server:

There are four isolation levels in SQL Server (as quoted from SQL Server Books Online):

  • Read uncommitted (the lowest level where transactions are isolated only enough to ensure that physically corrupt data is not read)
  • Read committed (Database Engine default level)
  • Repeatable read
  • Serializable (the highest level, where transactions are completely isolated from one another)

Read on for a discussion of what these mean, as well as how optimistic versus pessimistic concurrency (in this case, Read Committed Snapshot Isolation versus Read Committed) comes into play.

1 Comment

The Data Lake From 10,000 Feet

Pradeep Menon has a high-level explanation of what a data lake is and how it differs from traditional data warehouses:

With the changes in the data paradigm, a new architectural pattern has emerged. It’s called as the Data Lake Architecture. Like the water in the lake, data in a data lake is in the purest possible form. Like the lake, it caters to need to different people, those who want to fish or those who want to take a boat ride or those who want to get drinking water from it, a data lake architecture caters to multiple personas. It provides data scientists an avenue to explore data and create a hypothesis. It provides an avenue for business users to explore data. It provides an avenue for data analysts to analyze data and find patterns. It provides an avenue for reporting analysts to create reports and present to stakeholders.

The way I compare a data lake to a data warehouse or a mart is like this:

Data Lake stores data in the purest form caters to multiple stakeholders and can also be used to package data in a form that can be consumed by end-users. On the other hand, Data Warehouse is already distilled and packaged for defined purposes.

One way of thinking about this is that data warehouses are great for solving known business questions:  generating 10K reports or other regulatory compliance reporting, building the end-of-month data, and viewing standard KPIs.  By contrast, the data lake is (among other things) for spelunking, trying to answer those one-off questions people seem to have but which the warehouse never seems to have quite the right set of information.

Comments closed

A New ODBC Package For R

David Smith looks at the odbc package in R:

The odbc package is a from-the-ground-up implementation of an ODBC interface for R that provides native support for additional data types (including dates, timestamps, raw binary, and 64-bit integers) and parameterized queries. The odbc package provides connections with any ODBC-compliant database, and has been comprehensively tested on SQL Server, PostgreSQL and MySQL. Benchmarks show that it’s also somewhat faster than RODBC: 3.2 times faster for reads, and 1.9 times faster for writes.

Sounds like odbc lets you run ad hoc queries and also lets you use dplyr as an ORM, similar to using Linq in C#.

Comments closed