An Apache Flume Overview

Daniel Berman takes us through an overview of Apache Flume:

Apache Flume was developed by Cloudera to provide a way to quickly and reliably stream large volumes of log files generated by web servers into Hadoop. There, applications can perform further analysis on data in a distributed environment. Initially, Apache Flume was developed to handle only log data. Later, it was equipped to handle event data as well.

Click through to get a code-free, high-level understanding of Flume and where it can work for you.

ggforce Updates

Thomas Lin Pedersen has some ggforce updates for us:

Now, the above plot has some obvious shortcomings. The diagonal is pretty useless for starters, and it is often that these panels are used to plot the distributions of the individual variables. Using e.g. geom_density() won’t work as it always start at 0, thus messing with the y-scale of each row. ggforce provides two new geoms tailored for the diagonal: geom_autodensity() and geom_autohistogram() which automatically positions itself inside the panel without affecting the y-scale. We’d still need to have this geom only in the diagonal, but facet_matrix() provides exactly this sort of control

There are some interesting improvements in here.

Dealing with a Massive Transaction Log File

Kevin Hill takes us through troubleshooting an oversized transaction log file:

I had a customer call in this week after struggling to resolve an issue for a couple of days,  Their Transaction Log file had grown to 400+GB and was filling the drive to the point that had to keep adding space to the drive to keep operations online.  This was for a core internal system.  There are no staff DBAs here, but plenty of very sharp sysadmins, developers, network folks, etc.

Note: I have written about the most common cause of this issue here, but this was not the resolution in this case. The embedded video is far and away number one in hits of all my SQL videos, showing just how common log file growth issues are.

I like the way Kevin does this. He frames the story, takes you through his actions, and gives you a chance to understand his troubleshooting process. Most problems become an order of magnitude easier to solve if you have a reasoned-through process (and enough practice to follow when three levels of management are staring at your screen during a crisis).

Testing TPC-H with Batch Mode

Niko Neugebauer looks at TPC-H query testing in both row mode and batch mode:

I executed every single query enough times so that the execution would be run totally In-Memory (64GB of RAM is enough because we have our data compressed, as I mentioned earlier in the SETUP part). This would allow me to mimic a busy system that has enough resources to process the reading workload. Since the Batch Execution Mode is focusing on the CPU improvements, I decided to sample not only the total elapsed time, but the CPU times so that we can do some judgements of the CPU bandwidth variation. Lowering the CPU consumption fo the high-demanding CPU queries is a key in order to improve the overall system parallelism (well, watch-out for the memory, of-course).

Each successful execution was sampled at least 3 (and in some query cases over 5 times) and then the result would be averaged, so that we can have a higher confidence.

Niko has some interesting findings, some good for SQL Server and some not so good.

Finding Parameter Aliases in Powershell

Mike Robbins has a quick script to find aliases for parameters in Powershell cmdlets:

While sitting through Jeff Hicks‘ Advanced PowerShell Scripting Workshop at PowerShell on the River in Chattanooga today, he mentioned there being a “Cn” alias for the ComputerName parameter of commands in PowerShell.

I’ve previously written a one-liner to find parameter aliases and at one time Microsoft had starting adding parameter aliases to the help for commands as referenced in that same blog article, but it appears that they’ve discontinued adding them to the help and removed the ones they previously added to it.

To be honest, I don’t like aliases that much. Between tab completion and nearest match (where I can type “C” and get “ComputerName” because it’s the only parameter which starts with a C), that satisfies one-off scripts. And for reusable scripts, I’d want to use full names so that the next person doesn’t hunt me down because all of my parameters look like “C” or “Dep”.

The If Statement and Friends in Powershell

Kevin Marquette explains what is possible with if in Powershell:

The -not operator flips an expression from $false to $true or from $true to $false. Here is an example where we want to perform an action when Test-Path is $false.
if ( -not ( Test-Path -Path $path ) )

There’s plenty of good stuff here, so check it out.

Memory-Optimized Table Types

Rob Farley hates spelling “optimized” the best way:

Let me start by saying that if you really want to get the most out of this feature, you will dive deep into questions like durability and natively-compiled stored procedures, which can really make your database fly if the conditions are right. Arguably, any process you’re doing (such as ETL) where the data doesn’t have to survive a system restart should be considered for Memory-Optimized Tables with durability set to SCHEMA_ONLY (I say ‘considered’ because the answer isn’t always obvious – at the moment inserting into memory-optimised tables won’t run in parallel, and this could be a show-stopper for you).

But today I’m going to mention one of the quick-wins available: Table Variables that use User-defined Table Types

This can absolutely help you out, especially in versions of SQL Server prior to 2019 where temporary object metadata contention is a real issue on busy servers.

Proving ETL Correctness

Ed Elliott shares a few techniques for testing ETL processes:

Reconciliation is the process of going to your source system, getting a number and validating that number on the target. This ranges from being easy to impossible, so you need to decide what to reconcile on a case by case basis.

In its simplest form, we can go to a source system and find out things like how many records are to be copied, sum up totals and run other aggregations that we can then validate as correct (or not!) on the target system.

Ed has put together a thoughtful approach to validating data loads regardless of the source.


August 2019
« Jul