Choose Your Hadoop File Format

Kevin Feasel



Alex Woodie explains three of the most common Hadoop file formats:

You have many choices when it comes to storing and processing data on Hadoop, which can be both a blessing and a curse. The data may arrive in your Hadoop cluster in a human readable format like JSON or XML, or as a CSV file, but that doesn’t mean that’s the best way to actually store data.

In fact, storing data in Hadoop using those raw formats is terribly inefficient. Plus, those file formats cannot be stored in a parallel manner. Since you’re using Hadoop in the first place, it’s likely that storage efficiency and parallelism are high on the list of priorities, which means you need something else.

Luckily for you, the big data community has basically settled on three optimized file formats for use in Hadoop clusters: Optimized Row Columnar (ORC), Avro, and Parquet. While these file formats share some similarities, each of them are unique and bring their own relative advantages and disadvantages.

Read the whole thing.  I’m partial to ORC and Avro but won’t blink if someone recommends Parquet.

Using The Map Function In R

Kevin Feasel



Nicolas Attalides on using purrr:

The best place to start when exploring the purrr package is the map function. The reader will notice that these functions are utilised in a very similar way to the apply family of functions. The subtle difference is that the purrr functions are consistent and the user can be assured of the output – as opposed to some cases when using for example sapply as I demonstrate later on.

My considered belief is, Always Be Purrring.  H/T R-bloggers

Storing An Encrypted Password In The Solr Configuration File

Jon Morisi shows us how to store an encrypted password in Solr’s configuration file, rather than storing the password in plaintext:

The config file has a lot of options, in short this is where you configure a database connection string and reference your jdbc jar file. Full details are here.  By default any of the examples that come with the Solr distribution use a plain text username and password.  This can be potentially viewed from the front end:
http://hostname:8983/solr/ > Select Collection from the drop-down > Click data Import > expand configuration
Obviously we do not want to store our username and password in plain text.  The config file includes an option to encrypt the password and then store the key in a separate file.

Storing passwords in plaintext is a classic mistake that I see far too often.  And then when someone checks in that config file to a public GitHub repo…

T-SQL Join Delete

Steve Stedman walks us through a bit of T-SQL proprietary syntax:

FROM [dbo].[Table1] t1
INNER JOIN [dbo].[Table2] t2 on t1.favColor =;

Names have been changed to protect the innocent.

In the above delete statement which table will have rows deleted from it?

A: Table1

B: Table2

C: Both Table1 and Table2

D: Neither Table1 and Table2

Got it in one.  I like having this syntax available to me when I need it, even though it’s not ANSI standard.

Installing OpenSSH Server: Windows 10 Edition

Anthony Nocentino shows us how to install OpenSSH server on Windows 10 update 1803:

So in yesterday’s post we learned that the OpenSSH client is included with the Windows 10, Update 1803!  Guess, what else is included in this server, an OpenSSH Server! Yes, that’s right…you can now run an OpenSSH server on your Windows 10 system and get a remote terminal! So in this post, let’s check out what we need to do to get OpenSSH Server up and running.

First, we’ll need to ensure we update the system to Windows 10, Update 1803. Do that using your normal update mechanisms.

With that installed, let’s check out the new Windows Capabilities (Features) available in this Update, we can use PowerShell to search through them.

Anthony goes through the steps for configuration, so check that out.

Updating A Table Using Change Data Capture Without Downtime

Robert Blackburn takes us through the steps of updating a table which uses Change Data Capture without taking a downtime window:


  1. Stop jobs that process CDC (SSIS).

  2. Inside a transaction with isolation level serializable: Alter Table schema and create temporary CDC table

  3. Copy old CDC rows to new table excluding dup rows (based on [__$seqval])

  4. Disable old (original) CDC table (schema is outdated). Will drop table

Click through for the rest of the steps and an example script.

Issues With TDE And Backup Compression

Ned Otter describes the troubled history of the union of Transparent Data Encryption and backup compression:

The history of TDE and backup compression is that until SQL 2016, they were great features that didn’t play well together – if TDE was in play, backup compression didn’t work well, or at all.

However, with the release of SQL 2016, Microsoft aimed to have these two awesome features get along better (the blog post announcing this feature interoperability is here). Then there was this “you need to patch” post, due to edge cases that might cause your backup to not be restored. So if you haven’t patched in a while, now would be a good time to do so, because Microsoft says those issues have been resolved (although that seems to be disputed here).

My sympathies definitely lie toward backup compression over TDE if forced to choose between the two.

Tips For Managing Business Logic

Kevin Feasel



Tim Mitchell has a few tips for managing business logic, particularly when building ETL processes:

First things first: let’s get on the same page with what is meant by business logic. When I refer to business logic (also commonly referred to as business rules), I’m talking about the processing rules that are used to transform an organization’s data so that it is accurate, understandable, and usable. In almost all cases, these business rules are not designed to change the meaning of the data, but to clarify and make it easier to comprehend. Business logic may be applied when data arrives from other sources, or to existing data to reflect changes that have taken place.

Business logic is usually highly customized for and by each organization. The amount of processing required is heavily dependent on factors such as source data quality, reporting granularity, technical skill level of the intended audience, and even company culture.

It’s worth reading the whole thing.


May 2018
« Apr Jun »