Press "Enter" to skip to content

Author: Kevin Feasel

Sampling Data Lake Data

Alex Whittles shows how to use U-SQL to sample data to read in Power BI:

The answer is sampling, we don’t bring in 100% of the data, but maybe 10%, or 1%, or even 0.01%, it depends how much you need to reduce your dataset. It is however critical to know how to sample data correctly in order to maintain a level of accuracy of data in your reports.

Option 1: Take the top x rows of data
Don’t do it. Ever. Just no.
What if the source data you’ve been given is pre-sorted by product or region, you’d end up with only data from products starting with ‘a’, which would give you some wildly unpredictable results.

Option 2: Take a random % sample
Now we’re talking. This option will take, for example 1 in every 100 rows of data, so it’s picking up an even distribution of data throughout the dataset. This seems a much better option, so how do we do it?

Read on for a couple of sampling methods.

Comments closed

TempDB And TDE

Bob Ward troubleshoots an oddity around sys.databases marking tempdb as encrypted even when no user databases are encrypted:

In my test I did not hit the breakpoint. And furthermore, you will notice that when you query sys.dm_database_encryption_keys, there is no row for tempdb at all.  So our debugger breakpoint proves that tempdb is not permanently encrypted. Instead, if ALL user databases have TDE disabled and you restart SQL Server, tempdb is no longer encrypted. So instead of using sys.databases, use sys.dm_database_encryption_keys to tell which databases are truly enabled for encryption. I then verified my findings in the source code. Basically, we only enable encryption for tempdb if 1) ALTER DATABASE enables any user db for TDE 2) When we startup a user database and have encryption enabled. I also verified the behavior with my colleagues in the Tiger Team (thank you Ravinder Vuppula). We will look at fixing the issue with sys.databases in the future (ironically as I said earlier it was never enabled for tempdb before SQL Server 2016).

Read on for Bob Ward’s patented Debugger Fun.  My takeaway from this is that sys.dm_database_encryption_keys is valid, whereas sys.databases’s is_encrypted column might not be.

Comments closed

Comparing Column Names

Jen McCown has a script to compare column names between tables to find case inconsistencies:

I’m reviewing the code for the upcoming Minion CheckDB, and one of the things we’re checking for is case consistency in column names. For example, if Table1 has a column named Col1, and Table2 has COL1, well that’s no good.

But, how do we easily find those mismatches on a system that’s not case sensitive? Easy: collations.

Click through for the script.

Comments closed

Bad Habits: A Full Listing

Aaron Bertrand has provided an index to his bad habits series:

Here is an ongoing list of articles that I consider to be along these lines – either promoting best practices or eradicating bad habits; not all are explicitly framed as a “bad habit,” but they do all represent in some way things I wish I observed less often. Some of my opinions are controversial, and many have evoked very passionate comment threads – so I recommend scrolling down for those, too.

It’s a pretty long list.

Comments closed

UDFs And Recompilations

Erik Darling shows how to diagnose a high recompilation problem:

So yeah, that function seems to get up to something once for every ID you pass in. Remember that in our STUFF… query, we grabbed the TOP 10 each time. In the XE session, each time we call the proc, the string splitting function compiles and executes code 10 times. Bummerino. That’s the life of a loop.

On SQL Server 2016 (and really, with any non-looping code), we can get around the constant compilations with a simple rewrite. In this case, I’m calling 2016’s STRING_SPLIT function instead of the MSTVF function.

Read on for more.

Comments closed

U-SQL Deprecation Notices

Michael Rys has a couple pieces of U-SQL syntax which will be deprecated.  First is partition by bucket:

In the upcoming refresh, we are removing the deprecated syntax PARTITION BY BUCKET and will raise an error.

Thus, if you have not yet updated your table definitions with the previously announced new syntax, please do so now or your scripts will fail starting some day in February!

The second post involves credentials:

Back in October, we announced that we simplified the U-SQL Credentials by merging the password secrets that are being created in Powershell and the other parts of the credential object into credentials that are being completely created with a Powershell command. This reduces one statement from the creation process.

During the initial phase, we did provide support for both kinds of credential objects, and still supported the old syntax.

In the upcoming February refresh, we are now automatically migrating the existing old credentials into the new format and remove the CREATE CREDENTIAL, ALTER CREDENTIAL and DROP CREDENTIAL statements.

If you’re writing U-SQL code, you’ll want to read up on the ramifications and alternatives here.

Comments closed

Principal Component Analysis Using R

Francisco Lima explains what principal component analysis is and shows how to do it in R:

Three lines of code and we see a clear separation among grape vine cultivars. In addition, the data points are evenly scattered over relatively narrow ranges in both PCs. We could next investigate which parameters contribute the most to this separation and how much variance is explained by each PC, but I will leave it for pcaMethods. We will now repeat the procedure after introducing an outlier in place of the 10th observation.

PCA is extremely useful when you have dozens of contributing factors, as it lets you narrow in on the big contributors quickly.

Comments closed

Project Normalization In UDFs

Dmitry Pilugin looks into how the optimizer (using the 2014-and-on cardinality estimator) processes user-defined functions:

If we remember, for the CE 120 it was a one row estimate, and in this case server decided, that it is cheaper to use a non-clustered index and then make a lookup into clustered. Not very effective if we remember that our predicate returns all rows.

In CE 130 there was a 365 rows estimate, which is too expensive for key lookup and server decided to make a clustered index scan.

But, wait, what we see is that in the second plan the estimate is also 1 row!

That fact seemed to me very curious and that’s why I’m writing this post. To find the answer, let’s look in more deep details at how the optimization process goes.

This was an interesting look at how the optimizer looks at scalar user-defined functions.

Comments closed

Querying A Named Instance From Powershell

Steve Jones explains how to query a named instance in Powershell:

I was looking at some sample code the other day and it looked like this.

cd sqlserver:\sql\localhost\default\databases

This allows you to browse the list of databases on your local instance. However, this is for a default instance, which I don’t have on this host. How can I get to a named instance? Usually I connect as .\SQL2016, so where does that fit in PowerShell?

Read on for the answer.

Comments closed

Why Logins Are Failing

Kenneth Fisher looks at various error log messages to show how to fix different login failures:

Error Displayed
Login failed for user ‘Kenneth_Test’. (Microsoft SQL Server, Error: 18456)

How do I fix it
Simple enough. Change the password. However, there are a few warnings here. If you change the password and it’s being used by someone (for example this is an application id) you may be breaking the application. If this is production, that could be a no-no. Just in case I like to back up the password hash before changing it.

I’ve also gone back to the well several times over the years with this blog post showing what the various error states mean when you get error 18456.

Comments closed