Day: June 20, 2018

I’ve heard about a number of consulting companies, that decided to use simple linear model instead of a black box model with higher performance, because ,,client wants to understand factors that drive the prediction’’.
And usually the discussion goes as following: ,,We have tried LIME for our black-box model, it is great, but it is not working in our case’’, ,,Have you tried other explainers?’’, ,,What other explainers’’?

So here you have a map of different visual explanations for black-box models.

Check out DALEX, which includes a Jupyter notebook example. H/T R-Bloggers

Comments closed

Comparing Keras In Python Versus R

Published 2018-06-20 by Kevin Feasel

Dmitry Kisler performs image classification using Keras in both Python and R:

From the plots above, one can see that:

the accuracy of your model doesn’t depend on the language you use to build and train it (the plot shows only train accuracy, but the model doesn’t have high variance and the bias accuracy is around 99% as well).
even though 10 measurements may be not convincing, but Python would reduce (by up to 15%) the time required to train your CNN model. This is somewhat expected because R uses Python under the hood when executes Keras functions.

This is just one example, but the results are about what I’d expect.

Comments closed

Auto-Encoders And KernelML

Published 2018-06-20 by Kevin Feasel

Rohan Kotwani gives us an example where KernelML might be better than TensorFlow or PyTorch:

So what’s the point of using KernelML?

1. The parameters in each layer can be non-linear
2. Each parameter can be sampled from a different random distribution
3. The parameters can be transformed to meet certain constraints
4. Network combinations are defined in terms of numpy operations
5. Parameters are probabilistically updated
6. Each parameter update samples the loss function around a local or global minima

KerneML Specs

KernelMLis brute force optimizer that can be used to train machine learning algorithms. The package uses a combination of a machine learning and monte carlo simulations to optimize a parameter vector with a user defined loss function. Using kernelml creates a high computational cost for large complex networks because it samples the loss function using a subspace for each parameter in the parameter vector which requires many random simulations. The computational cost was reduced by enabling parallel computations with the ipyparallel. The decision to use this package was made because it effectively utilizes the cores on a machine.

It’s an interesting use case, though I would have liked to have seen a direct comparison to other frameworks.

Comments closed

Probabilistic Debugging

Published 2018-06-20 by Kevin Feasel

Adrian Colyer summarizes a fascinating academic paper:

This program has a bug. When given an already encoded input, it encodes it again (replacing % with ‘%25’). For example, the input https://example.com/t%20c?x=1 results in the output https://example.com/t%2520c?x=1, whereas in fact the output should be the same as the input in this case.

Let’s put our probabilistic thinking caps on and try and to debug the program. We ‘know’ that the url printed on line 19 is wrong, so we can assign low probability (0.05) to this value being correct. Likewise we ‘know’ that the input url on line 1 is correct, so we can assign high probability (0.95). (In probabilistic inference, it is standard not to use 0.0 or 1.0, but values close to them instead). Initially we’ll set the probability of every other program variable being set to 0.5, since we don’t know any better yet. If we can find a point in the program where the inputs are correct with relatively high probability, and the outputs are incorrect with relatively high probability, then that’s an interesting place!

Since url on line 19 has a low probability of being correct, this suggests that url on line 18, and purl_str at line 12 are also likely to be faulty. PI Debugger actually assigns these probabilities of being correct 0.0441 and 0.0832 respectively. Line 18 is a simple assignment statement, so if the chances of a bug here are fairly low. Now we trace the data flow. If purl_str at line 12 is likely to be faulty then s at line 16 is also likely to be faulty (probability 0.1176).

I’m interested to see someone create a practical implementation someday.

Comments closed

Flattening JSON In Power BI

Published 2018-06-20 by Kevin Feasel

Imke Feldmann shows how to flatten JSON data imported into Power BI:

If you work with JSON documents that are new to you, it can be very helpful to fully expand JSON to see at a glance what’s in there. The following function has you covered for this task. It returns a table with all values in column “Value” and additional columns describing where that value came from in a hierarchical form, no matter how deep the nesting goes:

Click through for a script which shows that the process is a lot more complicated than I had expected.

Comments closed

Understanding NESTING_TRANSACTION_FULL

Published 2018-06-20 by Kevin Feasel

Joe Obbish digs into the NESTING_TRANSACTION_FULL latch, explains what it does, and shows how it might be a performance bottleneck:

We know that only one worker can get the exclusive latch for the transaction at a time. Let’s use a greatly simplified model for what each parallel worker does for this query. It reads a row, does processing for a row, and goes on to the next one. Once it has enough rows to write out a log record it tries to acquire the latch. If no one else has the latch in exclusive mode it can get the latch, update some structure in the parent transaction, release the latch, and continue reading rows. If another worker has the latch in exclusive mode then it adds itself to the FIFO wait queue for the latch subresource and suspends itself. When the resource is available the worker status changes from SUSPENDED to RUNNABLE. When it changes again from RUNNABLE to RUNNINGit acquires the latch, updates some structure in the parent transaction, releases the latch, and continues working until it either needs to suspend again or hits the end of its 4 ms quantum. When it hits the end of its 4 ms quantum it will immediately select itself to run again because there are no other runnable workers on the scheduler.

So what determines the level of contention? One important factor is the number of workers that are contending over the same subresource. For this latch and type of query (rows are pretty evenly distributed between worker threads), this is simply MAXDOP. There’s a tipping point for this query where adding more workers is simply counterproductive.

For years I’ve seen people in the community state that running queries at MAXDOPthat’s too high can be harmful. I’ve always been after simple demos that show why that can happen. The NESTING_TRANSACTION_FULL latch is an excellent example of why some queries run longer if MAXDOP is increased too far. There’s simply too much contention over a shared resource.

Read the whole thing.

Comments closed

Join Elimination

Published 2018-06-20 by Kevin Feasel

Bert Wagner shows off the concept of join elimination in SQL Server:

SQL Server avoids joining to the Sales.Invoices table because it trusts the referential integrity maintained by the foreign key constraint defined on InvoiceID between Sales.InvoiceLines and Sales.Invoices; if a row exists in Sales.InvoiceLines, a row with the matching value for InvoiceID must exist in Sales.Invoices. And since we are only returning data from the Sales.InvoiceLines table, SQL Server doesn’t need to read any pages from Sales.Invoices at all.

We can verify that SQL Server is using the foreign key constraint to eliminate the join by dropping the constraint and running our query again:

ALTER TABLE [Sales].[InvoiceLines]  
DROP CONSTRAINT [FK_Sales_InvoiceLines_InvoiceID_Sales_Invoices];

Erik Darling shows that the optimizer isn’t perfect at this:

Rob Farley has my favorite material on it. There’s an incredible amount of ~~laziness~~ ingenuity built into the optimizer to keep your servers from doing unnecessary work.

That’s why I’d expect a query like this to throw away the join:

1

2

3

4

SELECT COUNT(u.Id)

FROM dbo.Users AS u

JOIN dbo.Users AS u2

ON u.Id = u2.Id;

After all, we’re joining the Users table to itself on the PK/CX. This doesn’t stand a chance at eliminating rows, producing duplicate rows, or producing NULL values. We’re only getting a count of the PK/CX, which isn’t NULLable anyway and…

So don’t do that.

Comments closed

Auto-Indentation In Power BI’s DAX Formula Bar

Published 2018-06-20 by Kevin Feasel

Chris Webb shows an easy-to-miss feature in Power BI:

The other day I discovered something new (at least to me) while writing the DAX for a measure in Power BI Desktop: when you insert a new line in your DAX expression using SHIFT-ENTER it also auto-indents the code. I asked a few people if this was new because I was sure I hadn’t seen it before, even though I always put line breaks in my code; of course Marco had and said he thought it had been around for a while. Anyway, Marco then commented that most people didn’t know you could even put line breaks in DAX and I thought to myself I should probably write a short blog post about all this, because of course line breaks and indentation make your code much more readable.

Click through for a demo as well as a couple of tips around this.

Comments closed

Multiple SYSDATETIME In The Same SELECT May Give Unexpected Results

Published 2018-06-20 by Kevin Feasel

Louis Davidson walks through a scenario he experienced:

The data is exactly as expected, even though the other two calls would have returned .902 and .903 if simply rounded off. On the other hand, looking for differences between the time1_3 and time2_3 columns:

SELECT *

FROM #hold

WHERE DATEDIFF(MILLISECOND,time1_3,time2_3) <> –1;

Returns 133 rows. With the sysdatetime values being exactly the same:

sysdatetime1                sysdatetime2

————————— —————————

2018–06–18 20:53:21.0443399 2018–06–18 20:53:21.0443399

2018–06–18 20:53:21.0493653 2018–06–18 20:53:21.0493653

But the other columns, are incorrect for our needs, as the values are the same:

time1_3                     time2_3

————————— —————————

2018–06–18 20:53:21.045     2018–06–18 20:53:21.045

2018–06–18 20:53:21.050     2018–06–18 20:53:21.050

This was the bug we experienced! Instead of being 1 millisecond different, the values were equal.

Louis’s moral to the story is to test your assumptions. The more prosaic moral is that calls to get the current time take a non-zero amount of time.

Comments closed

Checking A Drive’s Allocation Unit Size

Published 2018-06-20 by Kevin Feasel

Ryan Adams shows how to find the allocation unit size for a disk volume:

To identify the allocation unit size for a volume, we can use the fsutil.exe utility. In the output you are looking for “Bytes Per Cluster” which is your allocation unit size. Here is an example to retrieve the information for the G:\ volume.

fsutil fsinfo ntfsInfo G:

Ryan also shows how to change the allocation size, should you need to do so.

Comments closed