Press "Enter" to skip to content

Day: June 15, 2020

The Basics of A/B Testing with R

Holger von Jouanne-Diedrich walks us through a simple example of A/B testing and analysis using R:

The bad news is, that you have to understand a little bit about statistical hypothesis testing, the good news is that if you read the following post, you have everything you need (plus, as an added bonus R has all the tools you need already at hand!): From Coin Tosses to p-Hacking: Make Statistics Significant Again! (ok, reading it would make it over one minute…).

Check out that article and the example in the blog post as well. R makes it really easy to perform this sort of analysis.

Comments closed

Smoothing Out Write Behavior in Apache Flink

Dmitry Tolpeko solves an interesting problem:

It would be nice to smooth S3 write operations between two checkpoints. How to do that?

You may have already noticed there are 3 single PUT operations above made at 37:02, 37:06 and 37:09 before the checkpoint. The write size can give you a clue, it is a single part of multi-part upload to S3.

So some data sets were quite large so their data spilled before the checkpoint. Note that this is the internal spill in S3, data will not be visible until committed upon the successful Flink checkpoint.

So how can we force more writes to happen before the checkpoint so we can smooth IOPS and probably reduce the overall checkpoint latency? 

Read on for the answer.

Comments closed

Simplified Slope Graphs

Andy Kirk spots a few interesting uses of slope graphs:

As somebody who tries to consume as much visualisation work as possible, I always get a little extra joy from seeing clusters of the same techniques emerging. One such recent trend has been the use of simplified slope graphs.

By ‘simplified’ I mean they are stripped right back to a simple function of just showing the direction of change between two points in time, there are no axes and no other chart apparatus, just the trends.

I’m kind of iffy on it. I do like the map showing behavior of states over time, but the first visual had too much going on and the third visual had too much whitespace for my taste.

Comments closed

Choosing an Algorithm for Table.Join in Power Query

Chris Webb continues a series on optimizing merge performance in Power Query:

The first thing to say is that if you don’t specify a join algorithm in the sixth parameter of Table.Join (it’s an optional parameter), Power Query will try to decide which algorithm to use based on some undocumented heuristics. The same thing also happens if you use JoinAlgorithm.Dynamic in the sixth parameter of Table.Join, or if you use the Table.NestedJoin function instead, which doesn’t allow you to explicitly specify an algorithm.

There are going to be some cases where you can get better performance by explicitly specifying a join algorithm instead of relying on JoinAlgorithm.Dynamic but you’ll have to do some thorough testing to prove it. From what I’ve seen there are lots of cases where explicitly setting the algorithm will result in worse performance, although there are enough cases where doing so results in better performance to make all that testing worthwhile.

That behavior is the same as if you decided to start writing INNER LOOP JOIN or INNER HASH JOIN for your queries. In the right spot, you may have knowledge the optimizer doesn’t have and can make performance faster, but a lot of the time the approach will be a bit too heavy-handed and end up a net degradation of performance.

Comments closed

Obfuscating Data in SQL Server

Dave Mason has a data obfuscator:

In a previous post, I explored an option for generating fake data in sql server using Machine Learning services and the R language. I’ve expanded on that by creating some stored procedures that can be used for both generating data sets of fake data, and for obfuscating existing SQL Server data with fake data.

The code is available in a Github repository. For now, it consists of ten stored procedures. 

Unlike something like Dynamic Data Masking, this is a permanent update to the table. That makes it quite helpful for getting production distributions and use cases into non-production environments.

Comments closed