Keep That Data Raw

Kevin Feasel

2017-09-18

Data, ETL

Archana Madhavan argues that you should retain your raw data:

When your pipeline already has to read every line of your data, it’s tempting to make it perform some fancy transformations. But you should steer clear of these add-ons so that you:

  • Avoid flawed calculations. If you have thousands of machines running your pipeline in real-time, sure, it’s easy to collect your data — but not so easy to tell if those machines are performing the right calculations.

  • Won’t limit yourself to the aggregates you decided on in the past. If you’re performing actions on your data as it streams by, you only get one shot. If you change your mind about what you want to calculate, you can only get those new stats going forward — your old data is already set in stone.

  • Won’t break the pipeline. If you start doing fancy stuff on the pipeline, you’re eventually going to break it. So you may have a great idea for a new calculation, but if you implement it, you’re putting the hundreds of other calculations used by your coworkers in jeopardy. When a pipeline breaks down, you may never get that data.

The problem is that even if the cost of storage is much cheaper than before, there’s a fairly long tail before you get into potential revenue generation.  I like the idea, but selling it is hard when you generate a huge amount of data.

Related Posts

Azure Data Factory v2 And Decompression

Ben Jarvis notes a file naming bug with Azure Data Factory v2 when decompressing files: ADF V2 natively supports decompression of files as documented at https://docs.microsoft.com/en-us/azure/data-factory/supported-file-formats-and-compression-codecs#compression-support. With this functionality ADF should change the extension of the file when it is decompressed so 1234_567.csv.gz would become 1234_567.csv however, I’ve noticed that this doesn’t happen in all cases. […]

Read More

Loading CSVs Into Azure Using dbatools

Stuart Moore has a quick Powershell script which loads CSV data into Azure SQL Database using dbatools: To get some of this data usable for reporting we’re importing it into Azure SQL Database so people can start working their way through it, and we can fix up errors before we push it through into Azure Data Lake […]

Read More

Categories

September 2017
MTWTFSS
« Aug Oct »
 123
45678910
11121314151617
18192021222324
252627282930