I think the ultimate question is: Can all the benefits of a traditional relational data warehouse be implemented inside of a Hadoop data lake with interactive querying via Hive LLAP or Spark SQL, or should I use both a data lake and a relational data warehouse in my big data solution? The short answer is you should use both. The rest of this post will dig into the reasons why.
I touched on this ultimate question in a blog that is now over a few years old at Hadoop and Data Warehouses so this is a good time to provide an update. I also touched on this topic in my blogs Use cases of various products for a big data cloud solution, Data lake details, Why use a data lake?and What is a data lake? and my presentation
Read on for James’s argument, which is good. My argument is summed up as follows: the purpose of a data warehouse is to solve known business problems—that is, to help build reports that people on the business side need based on established requirements. The purpose of a data lake is to hold all kinds of data and curate it for when people come looking for something they didn’t know they needed.