Press "Enter" to skip to content

Author: Kevin Feasel

Smoothing Out Write Behavior in Apache Flink

Dmitry Tolpeko solves an interesting problem:

It would be nice to smooth S3 write operations between two checkpoints. How to do that?

You may have already noticed there are 3 single PUT operations above made at 37:02, 37:06 and 37:09 before the checkpoint. The write size can give you a clue, it is a single part of multi-part upload to S3.

So some data sets were quite large so their data spilled before the checkpoint. Note that this is the internal spill in S3, data will not be visible until committed upon the successful Flink checkpoint.

So how can we force more writes to happen before the checkpoint so we can smooth IOPS and probably reduce the overall checkpoint latency? 

Read on for the answer.

Comments closed

The Basics of A/B Testing with R

Holger von Jouanne-Diedrich walks us through a simple example of A/B testing and analysis using R:

The bad news is, that you have to understand a little bit about statistical hypothesis testing, the good news is that if you read the following post, you have everything you need (plus, as an added bonus R has all the tools you need already at hand!): From Coin Tosses to p-Hacking: Make Statistics Significant Again! (ok, reading it would make it over one minute…).

Check out that article and the example in the blog post as well. R makes it really easy to perform this sort of analysis.

Comments closed

Choosing an Algorithm for Table.Join in Power Query

Chris Webb continues a series on optimizing merge performance in Power Query:

The first thing to say is that if you don’t specify a join algorithm in the sixth parameter of Table.Join (it’s an optional parameter), Power Query will try to decide which algorithm to use based on some undocumented heuristics. The same thing also happens if you use JoinAlgorithm.Dynamic in the sixth parameter of Table.Join, or if you use the Table.NestedJoin function instead, which doesn’t allow you to explicitly specify an algorithm.

There are going to be some cases where you can get better performance by explicitly specifying a join algorithm instead of relying on JoinAlgorithm.Dynamic but you’ll have to do some thorough testing to prove it. From what I’ve seen there are lots of cases where explicitly setting the algorithm will result in worse performance, although there are enough cases where doing so results in better performance to make all that testing worthwhile.

That behavior is the same as if you decided to start writing INNER LOOP JOIN or INNER HASH JOIN for your queries. In the right spot, you may have knowledge the optimizer doesn’t have and can make performance faster, but a lot of the time the approach will be a bit too heavy-handed and end up a net degradation of performance.

Comments closed

Simplified Slope Graphs

Andy Kirk spots a few interesting uses of slope graphs:

As somebody who tries to consume as much visualisation work as possible, I always get a little extra joy from seeing clusters of the same techniques emerging. One such recent trend has been the use of simplified slope graphs.

By ‘simplified’ I mean they are stripped right back to a simple function of just showing the direction of change between two points in time, there are no axes and no other chart apparatus, just the trends.

I’m kind of iffy on it. I do like the map showing behavior of states over time, but the first visual had too much going on and the third visual had too much whitespace for my taste.

Comments closed

Obfuscating Data in SQL Server

Dave Mason has a data obfuscator:

In a previous post, I explored an option for generating fake data in sql server using Machine Learning services and the R language. I’ve expanded on that by creating some stored procedures that can be used for both generating data sets of fake data, and for obfuscating existing SQL Server data with fake data.

The code is available in a Github repository. For now, it consists of ten stored procedures. 

Unlike something like Dynamic Data Masking, this is a permanent update to the table. That makes it quite helpful for getting production distributions and use cases into non-production environments.

Comments closed

Open-sourcing Kube2Hadoop

Cong Gu, et al, announce the open-sourcing of a project:

By default, there is a gap between the security model of Kubernetes and Hadoop. Specifically, Hadoop uses Kerberos, a three-party protocol built on symmetric key cryptography to ensure any clients accessing the cluster are who they claim to be. In order to avoid frequent authentication checks against a Kerberos server, Delegation Tokens, a lightweight two-party authentication method, was introduced to complement Kerberos authentication. The Hadoop delegation token by default has a lifespan of one day and can be renewed for up to seven days. Kubernetes, on the other hand, uses a certificate-based approach for authentication, and does not expose the owner of a job in any of its public-facing APIs. Therefore, it is not possible to securely determine the authorized user from within the pod using the native Kubernetes API and then use that username to fetch the Hadoop delegation token for HDFS access.  

To allow for Kubernetes workloads to securely access HDFS, we built Kube2Hadoop, a scalable and secure integration with HDFS Kerberos. This enables AI modelers at LinkedIn to use HDFS data in Kubernetes pods with access control through a user account or a headless account. Headless accounts are oftentimes used to denote a virtual team that is working on projects that would share the same data within the team. The data acquired can then be used in their model exploration and training with KubeFlow components such as the tf-operator and mpi-operator. In this blog, we will describe the design and authentication model of Kube2Hadoop. 

Read on to see how it works and a link to the GitHub repo.

Comments closed

Incremental Refresh of Any Power BI Data Source

Gilbert Quevauvilliers wants incrementally to refresh all the sources:

The pattern that I am talking about is the following below, which will be used as my example below.

– Connect to a data source which can query fold.
> In my example I have installed and configured an Azure SQL Serverless DB
> In this database I have a date table.
– Configure date table to use Incremental refreshing as per the blog post
> Incremental refresh in Power BI
– Create a function which will then use the Date value as part of the parameter
> In my example I have got CSV files which have Exchange rate information from Azure Blob Storage.
> The file names of the CSV files is the date.
– Invoke the Function within the date table to extract the required information.

I know what you might be thinking, that as soon as I add in the column with the function it breaks the Query Folding. That is what I thought too.

The great news is that Incremental refreshing DOES STILL WORK!

Read on for the demonstration.

Comments closed

Analysis versus Reporting and Power BI

Rob Collie thinks about industry movement between analysis and reporting. Part one gives us some backstory:

Excel was about to make a large investment in BI-related capabilities, and the powers that be had selected me to lead our part in it. I was excited, but now I needed a crash course in “what the hell is BI?” I was given multiple tutors, and they all were quick to introduce the concept of Analysis versus Reporting. The “versus” seemed to be pretty important. It wasn’t an “and” – no, the “versus” was chosen deliberately in these sermons. You see, these were Two Very Different Things.

I struggled mightily to grasp this difference. I was told that interactive things like PivotTables were Analysis tools – NOT Reporting tools! Reports were something completely different. “But,” I pointed out, “they’re called ‘Insert PivotTable Report’ on the Excel menu today!” (This was Excel 2003). “Yeah,” said the mentors, “…we might want to fix that.”

Part two explains why analysis and reporting are both important:

Another “meta characteristic” of paginated reports is that they TEND to display details rather than aggregations. EX: specific transactions rather than emergent trends. In paginated reports, you’re MORE likely (but not guaranteed!) to be looking at “raw” rows of data from the original database, whereas in a Power BI report, you’re more likely (but again, not guaranteed!) to NOT be seeing raw individual rows, but rather intelligent aggregations of MANY rows. But either way, more detail means you’re more likely to need multiple pages.

Rob’s right on the money. And I’m looking forward to part three of the series.

Comments closed

Altering the Database without Rolling Back Users

Kenneth Fisher wants to change a database:

If this strikes a bit too close to home for you then you need to look at the ROLLBACK clause. It’s great for killing and rolling back all of the current connections before making my change.

But this is a pretty sensitive app and if there’s something running I have to let it finish. No ROLLBACK allowed. But I’m also not going to wait forever to see if my alter is going to happen. Turns out there is a nice easy option for this too.

Click through to see the option, as well as the message you get if it can’t work immediately.

Comments closed