Month: May 2018

Azure SQL Data Warehouse Generation 2

Published 2018-05-03 by Kevin Feasel

James Serra announces changes to Azure SQL Data Warehouse:

The changes in Azure SQL DW Compute Optimized Gen2 tier are:

5x query performance via a adaptive caching technology. which takes a blended approach of using remote storage in combination with a fast SSD cache layer (using NVMes) that places data next to compute based on user access patterns and frequency
Significant improvement in serving concurrent queries (32 to 128 queries/cluster)
Removes limits on columnar data volume to enable unlimited columnar data volume
5 times higher computing power compared to the current generation by leveraging the latest hardware innovations that Azure offers via additional Service Level Objectives (DW7500c, DW10000c, DW15000c and DW30000c)
Added Transparent Data Encryption with customer-managed keys

Those are some good improvements. #2 in particular makes it possible for Azure SQL DW to be useful in a much larger number of environments.

Comments closed

Inside SQL Server 6.5

Published 2018-05-03 by Kevin Feasel

Brent Ozar reviews a blast from the past:

I picked up half a dozen used books about SQL Server 6.5, then spent a delightful weekend reading them. Seriously delightful – lemme tell you just how into it I was. Erika and I eat all weekend meals out at restaurants, but she saw me so happily curled up in my chair reading that she insisted on going out and getting tacos for us just so I wouldn’t have to get up. I was having that good of a time READING BOOKS ABOUT SQL SERVER 6.5. (Also, Erika is amazing. Moving on.)

To bring you that same fun, I wanna share with you a few pages from Inside SQL Server 6.5 by Ron Soukup, one of the fathers of SQL Server

It’s a great read. My contribution to the Old But Good oeuvre is the Handbook of Relational Database Design by Candace Fleming and Barbar von Halle. For my money, it has what I still consider the best primer on database normalization out there. It also has a bunch of stuff that we should be glad we don’t do anymore, like figuring out specific file layouts for non-clustered indexes to minimize the number of disk rotations needed to retrieve a record of data.

Comments closed

Toward Interpretable Machine Learning

Published 2018-05-02 by Kevin Feasel

Cristoph Molnar shows off a couple of R packages which help interpret ML models:

Machine learning models repeatedly outperform interpretable, parametric models like the linear regression model. The gains in performance have a price: The models operate as black boxes which are not interpretable.

Fortunately, there are many methods that can make machine learning models interpretable. The R package imlprovides tools for analysing any black box machine learning model:

Feature importance: Which were the most important features?

Feature effects: How does a feature influence the prediction? (Partial dependence plots and individual conditional expectation curves)

Explanations for single predictions: How did the feature values of a single data point affect its prediction? (LIME and Shapley value)

Surrogate trees: Can we approximate the underlying black box model with a short decision tree?

The iml package works for any classification and regression machine learning model: random forests, linear models, neural networks, xgboost, etc.

This is a must-read if you’re getting into model-building. H/T R-Bloggers

Comments closed

Using map And flatMap In Scala

Published 2018-05-02 by Kevin Feasel

Shubham Verma explains the map and flatMap functions in Scala:

Consider two sets, A = {-2, -1, 0, 1, 2} and B = {0.5, 1, 1.5, 2.5, 4, 4.5, 5, 5.5} and a function f: A => B

y = x ^ 2 + 0.5; x is an element from set A and y corresponds to an element from set B, now we see that function f is applied to every element of set A but the result could be a subset of set B also.

So from the above text, we can draw the analogy that sets A and B can be seen as any collection in programming paradigm. Now what is “f”, so “f” could be seen as a function that takes an element from A and returns an element that exists in B, the point here to note is that, as scala promotes immutability whenever we apply map (or any other transformer) on some collection of type A, it returns a new collection of the same type with elements of type B. It would be helpful to understand it from the snippet below.
val result: List[B] = List[A].map(f: A => B)
So when a map operation is applied on a collection (here a List) of type A, with passing f as its argument it applies that function to every element of List of type A returns a new collection (again a List) of type B.

Read the whole thing.

Comments closed

Smarter Distribution Database Cleanup

Published 2018-05-02 by Kevin Feasel

Sourabh Agarwal writes about an improvement to the Distribution Cleanup job:

In case of a highly active environment the cleanup job would need to delete a substantially large set of expired transactions and commands. The cleanup job uses a while loop to delete entries from the MSRepl_Commands and MSRepl_Transactions in batches of 2000 and 5000 rows, respectively. Depending on how large the MSRepl_Commands and MSRepl_Transactions Tables are, the cleanup activity can be cumbersome and time consuming, leading to several performance issues like lock blocking and sometimes replication agent failures. These hard-coded batch sizes work well when the size of the replication tables is small, but do not perform well when these replication tables become large, let’s say like 200-300 million rows.

The new implementation of the procedure allows users to parameterize the batch sizes inside the cleanup stored procedure and introduces a new adaptive approach to determine the batch sizes for each iteration. The procedure will, by default, honor any batch sizes that are mentioned explicitly as parameters. If no parameters are supplied, then the procedure will start with the default batch sizes of 2000 and 5000 rows and increase/decrease the batch sizes based on the performance of the previous iteration of the delete operation. If the time taken by the delete query improves by 50% compared to previous execution, the batch size value is increased by 20% up to a max value of 50000 rows per batch, and if the performance of the delete operation decreases by 20% as compared to the previous iteration, the batch size is decreased by 50% up to a default value of 2000/5000 rows per batch for MSReplCommands and MSRepl_Transactions, respectively.

It’s good to see Microsoft making incremental improvements to replication.

Comments closed

Unit Testing ETL Processes

Published 2018-05-02 by Kevin Feasel

Ed Elliott has a post showing how to perform unit testing on ETL processes:

In tSQLt, we can call tSQLt.FakeTable and then do an insert, if we don’t use tSQLt what do we do? Well, we need to setup the data we want, this could be by using a tool or by writing a load of insert statements. I have seen this done in various ways such as:

Writing manual insert scripts

Using a tool to setup the data

Making use of API’s in the application to setup the data we need

Some wierd and wonderful things that we shouldn’t recommend

Ultimately, for each test that you do you need to know what data you need for it. There isn’t really any way around this, and the sooner you get yourself in the position where you can setup the data you need for a test, the better.

Read the whole thing.

Comments closed

Test Your Restore Process

Published 2018-05-02 by Kevin Feasel

Clive Strong tells a tale about a mental flub while restoring a backup:

Our automated restore process works really nicely. We take full backups on Saturday and differential backups through the week. We also take log backups through the day, but we were not going to be restoring those for this task. We have a number of internal platforms we restore to in full (or in part following a cut down process) so which gives us good validation of our backup files on a regular basis. We also have regular test restores from tape just for good measure.

However, a while ago I was asked to build a new server and restore the databases up to a specific date. We didn’t need a point in time restore, just to a specific day, so I pulled the full and differentials and wrote the script to do the restore for me. The script restored the full backup and the differential backup for Sunday, Monday, Tuesday, Wednesday and Thursday. I gave it the once over and executed the script. A while later and I came back and it was unexpectedly, still running. I eventually left the office and noted it finished in the early hours and ran for many hours longer than I had anticipated.

Read on for Clive’s more detailed explanation of the whoopsie moment.

Comments closed

Rotating Encryption Keys When Using Always Encrypted

Published 2018-05-02 by Kevin Feasel

Ed Leighton-Dick explains how to rotate your encryption keys when using Always Encrypted in SQL Server:

In my last post, I talked about the process of rotating your encryption keys. It’s just one of those routine maintenance tasks that need to be done from time to time in order to keep your encryption strong for the long haul. One type of rotation I didn’t address in that post was rotation for Always Encrypted, SQL Server’s newest form of encryption.

If you recall, Always Encrypted has two associated keys: a Column Master Key and a Column Encryption Key. The Column Encryption Key (CEK) is a symmetric key, stored in SQL Server. Like other symmetric keys, the CEK is not changed during a rotation. The Column Master Key (CMK), on the other hand, is a certificate, similar to the certificates we’ve used for transparent data encryption and for in-column encryption, and it therefore needs to be rotated regularly. The biggest difference is that the CMK is stored outside of SQL Server, in the Windows certificate store by default, so DBAs may need assistance from their Windows administrators or security administrators.

Read on to see what you need to do.

Comments closed

Architecting A Power BI Environment

Published 2018-05-02 by Kevin Feasel

Reza Rad explains different architectural patterns for a Power BI implementation:

Implementing a Power BI solution is not just about developing reports, creating a data model, or using visuals. Power BI, like any other technologies, can be used in a correct, or incorrect way. Any technology can be used more effective if it harnesses the right architecture. A right architecture can be achieved after a requirement gathering and designing aspects and components of the technology to fit the requirement. In this post, you will learn about some of the most common architectures to use Power BI. You will learn about using Power BI in different architecture guidelines;

Sharing architecture
Self-service architecture
Enterprise architecture

Read on to learn more about these three patterns.

Comments closed

Operator Precedence In Complex Check Constraints

Published 2018-05-01 by Kevin Feasel

Louis Davidson walks us through the case of the disappearing parentheses:

You will see the code is not the same as when I coded it:

([Value1] IS NOT NULL AND [Value2] IS NULL OR [Value1] IS NULL AND

[Value2] IS NOT NULL OR NOT [Value1]=[Value2] AND [Value2]>(2)*[Value1])

Looking as this, one thing stands out to the eye: the parenthesis have been removed from the comparisons, but added to the literal value of 2. (Also, everything is surrounded by square brackets, but as awful as that looks, it is a common thing to see in code that is generated/parsed, such as this or a query plan.) When the person who wrote the similar code saw this, they were concerned it was wrong, then the next person who did a code review was also concerned it was wrong, and when I saw it, I was sure it was right, but only because I had a bit more faith that if this were a problem with SQL Server’s toolset, I would probably have read about it!

Read on as Louis explains the logic. My preference would be to retain the parentheses to make it easier for humans to follow.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31