On Whether Relational Data Belongs In A Data Lake

Melissa Coates debates whether relational data really belongs in a data lake:

For certain types of data, writing it to the data lake really is frequently the best choice. This is often true for low latency IoT data, semi-structured data like logs, and varying structures such as social media data. However, the handling of structured data which originates from a relational database is much less clear.

Most data lake technologies store data as files (like csv, json, or parquet). This means that when we extract relational data into a file stored in a data lake, we lose valuable metadata from the database such as data types, constraints, foreign keys, etc. I tend to say that we “de-relationalize” data when we write it to a file in the data lake. If we’re going to turn right around and load that data to a relational database destination, is it the right call to write it out to a file in the data lake as an intermediary step?

Click through for considerations on both sides of the fence.

Related Posts

Azure Data Lake Storage Generation 2

James Baker announces updates to Azure Data Lake Storage Gen2: As we’ve discussed many times, the performance of the storage layer has an outsized impact on the total cost of ownership (TCO) for your complete analytics pipeline. This is due to the fact that every percentage point improvement in storage performance results in that same […]

Read More

When Structured Data Makes Sense In A Data Lake

James Serra ponders when it makes sense to add structured data to a data lake: I would not say it’s common place to load structured data into the data lake, but I do see it frequently. In most cases it is not necessary to first copy relational source data into the data lake and then […]

Read More

Categories

November 2018
MTWTFSS
« Oct Dec »
 1234
567891011
12131415161718
19202122232425
2627282930