Press "Enter" to skip to content

Day: December 21, 2017

The Importance Of Model Interpretability

Ilknur Kaynar Kabul explains why it’s important that your data science models be interpretable:

Some machine learning models are simple and easy to understand. We know how changing the inputs will affect the predicted outcome and can make justification for each prediction. However, with the recent advances in machine learning and artificial intelligence, models have become very complex, including complex deep neural networks and ensembles of different models. We refer to these complex models as black box models.

Unfortunately, the complexity that gives extraordinary predictive abilities to black box models also makes them very difficult to understand and trust. The algorithms inside the black box models do not expose their secrets. They don’t, in general, provide a clear explanation of why they made a certain prediction. They just give us a probability, and they are opaque and hard to interpret. Sometimes there are thousands (even millions) of model parameters, there’s no one-to-one relationship between input features and parameters, and often combinations of multiple models using many parameters affect the prediction. Some of them are also data hungry. They need enormous amounts of data to achieve high accuracy. It’s hard to figure out what they learned from those data sets and which of those data points have more influence on the outcome than the others.

This post reminds me of a story I’d heard about a financial organization using neural networks to build accurate models, but then needing to decompose the models into complex decision trees to explain to auditors that they weren’t violating any laws in the process.

Comments closed

Data Skip Techniques In Impala

Mostafa Mokhtar, et al, explain a few methods for skipping unneeded data in Impala queries:

Each Apache Parquet file contains a footer where metadata can be stored including information like the minimum and maximum value for each column. Starting in v2.9, Impala populates the min_value and max_value fields for each column when writing Parquet files for all data types and leverages data skipping when those files are read. This approach significantly speeds up selective queries by further eliminating data beyond what static partitioning alone can do. For files written by Hive / Spark, Impala only reads the deprecated min and max fields.

The effectiveness of the Parquet min_value/max_value column statistics for data skipping can be increased by ordering (or clustering1) data when it is written by reducing the range of values that fall between the minimum and maximum value for any given file. It was for this reason that Impala 2.9 added the SORT BY clause to table DDL which directs Impala to sort data locally during an INSERT before writing the data to files.

Even if your answer is “throw more hardware at it,” there eventually comes a point where you run out of hardware (or budget).

Comments closed

Animated Dot Plots In R

John MacKintosh shows how to create dot plots in ggplot2, and then he uses gganimate to turn it into an animated plot:

If you want to follow along (go on) then you should head over to Neil’s site, download the excel file and take a look at the “how to” guide on the same page. Existing R users are already likely to be shuddering at all the manual manipulation required.

For the first attempt, I followed Neil’s approach pretty closely, resulting in a lot of code to sort and group, although ggplot2 made the actual plotting much simpler. I shared my very first attempt ( produce with barely any ggplot2 code) which was quite good, but there were a few issues – the ins/ outs being coloured blue instead of grey, and overplotting of several points.

Click through for code and explanation.  H/T R-Bloggers

Comments closed

Power BI Dashboard Sharing

Reza Rad covers one method of sharing Power BI content with users:

What dashboard sharing as the name of it explains is based on a dashboard. You can only share a dashboard with this method, not a report. Consider that you have a dashboard like below screenshot, and you want to share it. There is a share link at the top right corner of the dashboard.

Dashboard sharing have very few options to set and is very simple to configure. You just need to add the email address of people whom you want to share this report. You can also write a message for them to know that this report is shared with them.

Click through for more information.  Note that this is a paid feature.

Comments closed

Splitting Large Files Out In SQL Server

Tracy Boggiano has a script which splits out large files in a filegroup into a smaller set of files:

The solution I offer allows you to break your files into any size you want by rebuilding your indexes. You will need some disk space for it to create the new files while it runs the process then it will drop the large file.  This will also take up some space in your transaction log so if you not running your transaction log backups frequently enough you could have a lot of disk space taken up by that so watch out for that. All the code can be downloaded from my github repository here.

Read on for an explanation of the entire process.

Comments closed

Using DATEADD Instead Of DATEDIFF

Michael J. Swart points out a bit of trickery with DATEDIFF:

I assumed that the DATEDIFF function I wrote worked this way: Subtract the two dates to get a timespan value and then return the number of seconds (rounded somehow) in that timespan.

But that’s not how it works. The docs for DATEDIFF say:

“Returns the count (signed integer) of the specified datepart boundaries crossed between the specified startdate and enddate.”

There’s no rounding involved. It just counts the ticks on the clock that are heard during a given timespan.

Read the whole thing.

Comments closed

Logging Perfmon Data

Raul Gonzalez has started a new series around getting perfmon data into SQL Server.  First up is logging perfmon counters:

Here you have the template I have used to create my Data Collector, you just need to write it down to a XML file and change some of the counters which are related to SQL Server.

When I say MSSQL$MSSQL2016, that is because this counter refer to a named instance called MSSQL2016. If that was the default instance, it’d be just “SQL Server”.

Example: <Counter>\SQL Server:Buffer Manager\Page life expectancy</Counter>

Once you adjust it, you’re good to go.

Click through for a sample data collector set and some instructions on logging counter values.

Comments closed