# Day: June 15, 2020

The bad news is, that you have to understand a little bit about statistical hypothesis testing, the good news is that if you read the following post, you have everything you need (plus, as an added bonus R has all the tools you need already at hand!): From Coin Tosses to p-Hacking: Make Statistics Significant Again! (ok, reading it would make it over one minute…).

Check out that article and the example in the blog post as well. R makes it really easy to perform this sort of analysis.

It would be nice to smooth S3 write operations between two checkpoints. How to do that?

You may have already noticed there are 3 single PUT operations above made at 37:02, 37:06 and 37:09 before the checkpoint. The write size can give you a clue, it is a single part of multi-part upload to S3.

So some data sets were quite large so their data spilled before the checkpoint. Note that this is the internal spill in S3, data will not be visible until committed upon the successful Flink checkpoint.

So how can we force more writes to happen before the checkpoint so we can smooth IOPS and probably reduce the overall checkpoint latency?

As somebody who tries to consume as much visualisation work as possible, I always get a little extra joy from seeing clusters of the same techniques emerging. One such recent trend has been the use of simplified slope graphs.

By ‘simplified’ I mean they are stripped right back to a simple function of just showing the direction of change between two points in time, there are no axes and no other chart apparatus, just the trends.

I’m kind of iffy on it. I do like the map showing behavior of states over time, but the first visual had too much going on and the third visual had too much whitespace for my taste.

The first thing to say is that if you don’t specify a join algorithm in the sixth parameter of Table.Join (it’s an optional parameter), Power Query will try to decide which algorithm to use based on some undocumented heuristics. The same thing also happens if you use JoinAlgorithm.Dynamic in the sixth parameter of Table.Join, or if you use the Table.NestedJoin function instead, which doesn’t allow you to explicitly specify an algorithm.

There are going to be some cases where you can get better performance by explicitly specifying a join algorithm instead of relying on JoinAlgorithm.Dynamic but you’ll have to do some thorough testing to prove it. From what I’ve seen there are lots of cases where explicitly setting the algorithm will result in worse performance, although there are enough cases where doing so results in better performance to make all that testing worthwhile.

That behavior is the same as if you decided to start writing `INNER LOOP JOIN` or `INNER HASH JOIN` for your queries. In the right spot, you may have knowledge the optimizer doesn’t have and can make performance faster, but a lot of the time the approach will be a bit too heavy-handed and end up a net degradation of performance.

I am convinced they got tired of me saying, “Well, you’ll need SSIS if you want to read Excel…” Probably not, though. Either way… score! :{>

Read on to see how easy it is to select Excel as a file format for your Blob Storage data.