Visualizing Model Input Effects

Ilknur Kaynar Kabul shows us how to use partial dependence plots and individual conditional expectation plots to view the specific effect of an input variable on a model:

A partial dependence (PD) plot depicts the functional relationship between a small number of input variables and predictions. They show how the predictions partially depend on values of the input variables of interest.  For example, a PD plot can show whether the probability of flu increases linearly with fever. It can show whether high energy level will decrease the probability of having flu. PD can also show the type of relationship, such as a step function, curvilinear, linear and so on.

The simplest PD plots are 1-way plots, which show how a model’s predictions depend on a single input. The plot below shows the relationship (according the model that we trained) between price (target) and number of bathrooms. Here, we see that house prices increase as we increase the number of bathroom up to 4. After that it does not change the house price.

These types of plots are helpful for understanding the mechanics behind a model.

Methods For Detecting Anomalies In Business Metrics

Kevin Feasel



Sergey Bryl’ gives us four methods for detecting anomalies in business data:

In this article, by  business metrics, we mean numerical indicators we regularly measure and use to track and assess the performance of a specific business process. There is a huge variety of business metrics in the industry: from conventional to unique ones. The latter are specifically developed for and used in one company or even just by one of its teams. I want to note that usually, a business metrics have dimensions, which imply the possibility of drilling down the structure of the metric. For instance, the number of sessions on the website can have dimensions: types of browsers, channels, countries, advertising campaigns, etc. where the sessions took place. The presence of a large number of dimensions per metric, on the one hand, provides a comprehensive detailed analysis, and, on the other, makes its conduct more complex.

Anomalies are abnormal values of business indicators. We cannot claim anomalies are something bad or good for business. Rather, we should see them as a signal that there have been some events that significantly influenced a business process and our goal is to determine the causes and potential consequences of such events and react immediately. Of course, from the business point of view, it is better to find such events than ignore them.

It was interesting comparing the results of the four methods.  H/T R-bloggers

Understanding Hash Match Aggregates

Itzik Ben-Gan continues his series on grouping and aggregating data by looking at the hash match aggregation process:

The estimated CPU cost for the Hash Aggregate in the plan for Query 8 is 0.166344, and in Query 9 is 0.16903.

It could be an interesting exercise to try and figure out exactly in what way the cardinality of the grouping set, the data types, and aggregate function used affect the cost; I just didn’t pursue this aspect of the costing. So, after making a choice of the grouping set and aggregate function for your query, you can reverse engineer the costing formula. For example, let’s reverse engineer the CPU costing formula for the Hash Aggregate operator when grouping by a single integer column and returning the MAX(orderdate) aggregate. The formula should be:

Operator CPU cost = <startup cost> + @numrows * <cost per row> + @numgroups * <cost per group>

Using the techniques that I demonstrated in the previous articles in the series, I got the following reverse engineered formula:

Operator CPU cost = 0.017749 + @numrows * 0.00000667857 + @numgroups * 0.0000177087

Definitely worth reading in detail.

What’s Coming With Always Encrypted?

Monica Rathbun explains a new feature coming to SQL Server:

As I discussed in part 3 there are many roads blocks the can stop the implementation of Always Encrypted (AE). In the current available versions of SQL Server 2016 and 2017, along with Azure SQL Database, the cost of using AE was way too high for many companies. There are so many code changes needed to implement AE that moving to it is not cost effective for them. Microsoft recognizes this and has found a better way to handle things like aggregations, range comparison, LIKE predicates, ORDER BYs, and other search criteria with the introduction of Secure Enclaves.  For the client discussed in part 1-3 this will make all the difference.

Per MSDN “An enclave is a protected region of memory that acts as a trusted execution environment. An enclave appears as a black box to the containing process and to other processes running on the machine. There is no way to view the data or the code inside the enclave from the outside, even with a debugger.”

If that’s a bit confusing, check out Monica’s explanation as well.

Using DATEPART In SQL Server

Randolph West shares some thoughts on the DATEPART function:

As we learned some time ago, an INT (integer) in T-SQL uses four bytes, and has a maximum value greater than zero of over 2 billion (there are more than 4 billion values in an integer if we take the negative values into account).

Why then are date and time parts expressed as an INT, which have a lot of overhead for values like 24, 31, and 60?

There are two reasons:

  1. Integers make things simpler. When writing arithmetic in T-SQL and other programming languages, the default data type is usually an integer. Having to memorise which data types are returned from built-in functions becomes unnecessary when we know it will be an INT. Yes, it uses extra memory, but four bytes is a reasonable trade-off against complexity.

  2. One of the return values is nanoseconds. Although DATETIME2(7) only has a granularity down to 100 nanoseconds, DATEPART allows us to return this value, which requires a data type large enough to contain an integer up to 1 billion (nanoseconds can range from 0 to 999,999,900 in increments of 100).

Randolph also explains what happened a few years back to cause iPhone alarms not to fire on January 1st and 2nd.

Detecting Data Breaches

K. Brian Kelley shares some thoughts on methods to detect data breaches:

We’re going to make some assumptions. First, we’ll assume no one is above suspicion. After all, a trusted employee who gets hit and falls to a phishing attack is still trustworthy, but his or her user account isn’t. Even the best can fall to such an attack. Therefore, we’ll assume every account is capable of being used for a data breach.

Second, we’ll accept as a given that there are insecure protocols and insecure procedures/behavior. From a technology side we can do some things about those areas, but there are still too many situations we can’t deal with by using technology alone. For instance, I know of a case where a client had an employee find several hundred printed pages of sensitive information just left sitting in a publicly accessible smoking area. Using Extended Events in SQL Server can’t fix that type of lapse in security. However, rather than throwing our hands up because we can’t prevent this type of situation by some configuration within SQL Server, we’ll endeavor to do what we can with what SQL Server can do for us.

As a corollary, we’ll assume that discovering data breaches is hard. If all we have is what’s in SQL Server, the odds are stacked against us. Even with sophisticated tools, detecting a breach is still a difficult endeavor. However, we’ll do our best to set up methods of detection using what we can within SQL Server.

Finally, we’ll acknowledge there’s a lot of data we can collect. That data is useless if we can’t sift through it, so we’ll take an approach that will reduce the amount of data we do collect and try to examine. Let’s look at what that approach consists of.

Worth reading in its entirety.

Executing Powershell From SSIS

Jeanne Combrinck shows us how to execute Powershell code from within SQL Server Integration Services:

It can be confusing to know which tool in the SSIS toolbox to use when trying to execute a PowerShell script from within SSIS.

The best task to use is an Execute Process Task.

Read on for an example.

Is That Share SMB1?

Andy Mallon shows us how to tell if a particular Windows share is using SMB1:

In case you’ve missed it (though, if you’re a data professional or other IT pro, you must live under a rock if you’ve missed it), SMBv1 should never ever be used. Alas, some shares out there still exist & use it.

Recent Windows updates have tried to turn off SMBv1 , alas Microsoft’s Ned Pyle (twitter|blog) still maintains a long list of products that still need SMBv1.

As a DBA, you probably don’t spend too much time managing file shares, or worrying about SMB versions, but you should be asking yourself:

Does my backup target use SMBv1?

If it does, you should fix that.


June 2018
« May Jul »