Press "Enter" to skip to content

Day: May 2, 2018

Toward Interpretable Machine Learning

Cristoph Molnar shows off a couple of R packages which help interpret ML models:

Machine learning models repeatedly outperform interpretable, parametric models like the linear regression model. The gains in performance have a price: The models operate as black boxes which are not interpretable.

Fortunately, there are many methods that can make machine learning models interpretable. The R package imlprovides tools for analysing any black box machine learning model:

  • Feature importance: Which were the most important features?
  • Feature effects: How does a feature influence the prediction? (Partial dependence plots and individual conditional expectation curves)
  • Explanations for single predictions: How did the feature values of a single data point affect its prediction? (LIME and Shapley value)
  • Surrogate trees: Can we approximate the underlying black box model with a short decision tree?
  • The iml package works for any classification and regression machine learning model: random forests, linear models, neural networks, xgboost, etc.

This is a must-read if you’re getting into model-building. H/T R-Bloggers

Comments closed

Using map And flatMap In Scala

Shubham Verma explains the map and flatMap functions in Scala:

Consider two sets, A = {-2, -1, 0, 1, 2} and B = {0.5, 1, 1.5, 2.5, 4, 4.5, 5, 5.5} and a function          f: A => B

y = x ^ 2 + 0.5;  x is an element from set A and y corresponds to an element from set B, now we see that function f is applied to every element of set A but the result could be a subset of set B also.

So from the above text, we can draw the analogy that sets A and B can be seen as any collection in programming paradigm. Now what is “f”, so “f” could be seen as a function that takes an element from A and returns an element that exists in B, the point here to note is that, as scala promotes immutability whenever we apply map (or any other transformer) on some collection of type A, it returns a new collection of the same type with elements of type B. It would be helpful to understand it from the snippet below.

val result: List[B] = List[A].map(f: A => B)

So when a map operation is applied on a collection (here a List) of type A, with passing f as its argument it applies that function to every element of List of type A returns a new collection (again a List) of type B.

Read the whole thing.

Comments closed

Smarter Distribution Database Cleanup

Sourabh Agarwal writes about an improvement to the Distribution Cleanup job:

In case of a highly active environment the cleanup job would need to delete a substantially large set of expired transactions and commands. The cleanup job uses a while loop to delete entries from the MSRepl_Commands and MSRepl_Transactions in batches of 2000 and 5000 rows, respectively. Depending on how large the MSRepl_Commands and MSRepl_Transactions Tables are, the cleanup activity can be cumbersome and time consuming, leading to several performance issues like lock blocking and sometimes replication agent failures. These hard-coded batch sizes work well when the size of the replication tables is small, but do not perform well when these replication tables become large, let’s say like 200-300 million rows.

The new implementation of the procedure allows users to parameterize the batch sizes inside the cleanup stored procedure and introduces a new adaptive approach to determine the batch sizes for each iteration. The procedure will, by default, honor any batch sizes that are mentioned explicitly as parameters. If no parameters are supplied, then the procedure will start with the default batch sizes of 2000 and 5000 rows and increase/decrease the batch sizes based on the performance of the previous iteration of the delete operation. If the time taken by the delete query improves by 50% compared to previous execution, the batch size value is increased by 20% up to a max value of 50000 rows per batch, and if the performance of the delete operation decreases by 20% as compared to the previous iteration, the batch size is decreased by 50% up to a default value of 2000/5000 rows per batch for MSReplCommands and MSRepl_Transactions, respectively.

It’s good to see Microsoft making incremental improvements to replication.

Comments closed

Unit Testing ETL Processes

Ed Elliott has a post showing how to perform unit testing on ETL processes:

In tSQLt, we can call tSQLt.FakeTable and then do an insert, if we don’t use tSQLt what do we do? Well, we need to setup the data we want, this could be by using a tool or by writing a load of insert statements. I have seen this done in various ways such as:

  1. Writing manual insert scripts
  2. Using a tool to setup the data
  3. Making use of API’s in the application to setup the data we need
  4. Some wierd and wonderful things that we shouldn’t recommend

Ultimately, for each test that you do you need to know what data you need for it. There isn’t really any way around this, and the sooner you get yourself in the position where you can setup the data you need for a test, the better.

Read the whole thing.

Comments closed

Test Your Restore Process

Clive Strong tells a tale about a mental flub while restoring a backup:

Our automated restore process works really nicely. We take full backups on Saturday and differential backups through the week. We also take log backups through the day, but we were not going to be restoring those for this task. We have a number of internal platforms we restore to in full (or in part following a cut down process) so which gives us good validation of our backup files on a regular basis. We also have regular test restores from tape just for good measure.

However, a while ago I was asked to build a new server and restore the databases up to a specific date. We didn’t need a point in time restore, just to a specific day, so I pulled the full and differentials and wrote the script to do the restore for me. The script restored the full backup and the differential backup for Sunday, Monday, Tuesday, Wednesday and Thursday. I gave it the once over and executed the script. A while later and I came back and it was unexpectedly, still running. I eventually left the office and noted it finished in the early hours and ran for many hours longer than I had anticipated.

Read on for Clive’s more detailed explanation of the whoopsie moment.

Comments closed

Rotating Encryption Keys When Using Always Encrypted

Ed Leighton-Dick explains how to rotate your encryption keys when using Always Encrypted in SQL Server:

In my last post, I talked about the process of rotating your encryption keys. It’s just one of those routine maintenance tasks that need to be done from time to time in order to keep your encryption strong for the long haul. One type of rotation I didn’t address in that post was rotation for Always Encrypted, SQL Server’s newest form of encryption.

If you recall, Always Encrypted has two associated keys: a Column Master Key and a Column Encryption Key. The Column Encryption Key (CEK) is a symmetric key, stored in SQL Server. Like other symmetric keys, the CEK is not changed during a rotation. The Column Master Key (CMK), on the other hand, is a certificate, similar to the certificates we’ve used for transparent data encryption and for in-column encryption, and it therefore needs to be rotated regularly. The biggest difference is that the CMK is stored outside of SQL Server, in the Windows certificate store by default, so DBAs may need assistance from their Windows administrators or security administrators.

Read on to see what you need to do.

Comments closed

Architecting A Power BI Environment

Reza Rad explains different architectural patterns for a Power BI implementation:

Implementing a Power BI solution is not just about developing reports, creating a data model, or using visuals. Power BI, like any other technologies, can be used in a correct, or incorrect way. Any technology can be used more effective if it harnesses the right architecture. A right architecture can be achieved after a requirement gathering and designing aspects and components of the technology to fit the requirement. In this post, you will learn about some of the most common architectures to use Power BI. You will learn about using Power BI in different architecture guidelines;

  • Sharing architecture

  • Self-service architecture

  • Enterprise architecture

Read on to learn more about these three patterns.

Comments closed