For certain types of data, writing it to the data lake really is frequently the best choice. This is often true for low latency IoT data, semi-structured data like logs, and varying structures such as social media data. However, the handling of structured data which originates from a relational database is much less clear.
Most data lake technologies store data as files (like csv, json, or parquet). This means that when we extract relational data into a file stored in a data lake, we lose valuable metadata from the database such as data types, constraints, foreign keys, etc. I tend to say that we “de-relationalize” data when we write it to a file in the data lake. If we’re going to turn right around and load that data to a relational database destination, is it the right call to write it out to a file in the data lake as an intermediary step?
Click through for considerations on both sides of the fence.