Keep That Data Raw

Kevin Feasel

2017-09-18

Data, ETL

Archana Madhavan argues that you should retain your raw data:

When your pipeline already has to read every line of your data, it’s tempting to make it perform some fancy transformations. But you should steer clear of these add-ons so that you:

  • Avoid flawed calculations. If you have thousands of machines running your pipeline in real-time, sure, it’s easy to collect your data — but not so easy to tell if those machines are performing the right calculations.

  • Won’t limit yourself to the aggregates you decided on in the past. If you’re performing actions on your data as it streams by, you only get one shot. If you change your mind about what you want to calculate, you can only get those new stats going forward — your old data is already set in stone.

  • Won’t break the pipeline. If you start doing fancy stuff on the pipeline, you’re eventually going to break it. So you may have a great idea for a new calculation, but if you implement it, you’re putting the hundreds of other calculations used by your coworkers in jeopardy. When a pipeline breaks down, you may never get that data.

The problem is that even if the cost of storage is much cheaper than before, there’s a fairly long tail before you get into potential revenue generation.  I like the idea, but selling it is hard when you generate a huge amount of data.

Related Posts

Copying Cassandra Data to HDFS

Landon Robinson shows how you can use Spark to extract data from Cassandra and move it into HDFS: Cassandra is a great open-source solution for accessing data at web scale, thanks in no small part to its low-latency performance. And if you’re a power user of Cassandra, there’s a high probability you’ll want to analyze the data […]

Read More

Containers and Data

Randolph West argues that you should keep data and containers separated: Where it gets interesting is that the SQL Server container is also where the database files are stored by default. I raised a point (which Grant and others have already noted in the past) that persisted storage volumes allow us to throw away a SQL Server […]

Read More

Categories

September 2017
MTWTFSS
« Aug Oct »
 123
45678910
11121314151617
18192021222324
252627282930