Data Lake Zoning

Parth Patel, et al, explain that there ought to be several zones of data within a data lake:

Within a Data Lake, zones allow the logical and/or physical separation of data that keeps the environment secure, organized, and Agile. Typically, the use of 3 or 4 zones is encouraged, but fewer or more may be leveraged. A generic 4-zone system might include the following:

  1. Transient Zone — Used to hold ephemeral data, such as temporary copies, streaming spools, or other short-lived data before being ingested.
  2. Raw Zone – The zone in which raw data will be maintained. This is also the zone where sensitive data must be encrypted, tokenized, or otherwise secured.
  3. Trusted Zone – After Data Quality, Validation, or other processing is performed on data in the Raw Zone, it becomes the “source of truth” in this zone for downstream systems.
  4. Refined Zone – Manipulated and enriched data is kept in this zone. This is used to store the output from tools like Hive or external tools that will write into to the Data Lake.

Your particular situation may differ but I’d consider this to be good advice no matter where or how you’re storing data, such as a classical data warehouse or an ODS.

Related Posts

Spark Changes In HDP 2.6

Vinay Shukla and Syed Mahmood talk about what’s new with Spark and Zeppelin in the Hortonworks Data Platform 2.6 update: SPARKR & PYSPARK Most data scientists use R & Python and with SparkR & PySpark respectively they can continue to leverage their familiarity with the R & Python languages. However, they need to use the Spark […]

Read More

Presto On HDInsight

Ashish Thapliyal shows how to install Presto on an HDInsight cluster: What is Presto? Presto is a distributed SQL query engine optimized for ad-hoc analysis at interactive speed. It supports standard ANSI SQL, including complex queries, aggregations, joins, and window functions. Presto is becoming popular SQL interactive query engine that has grabbed the attention and mind-share […]

Read More

Categories

April 2017
MTWTFSS
« Mar May »
 12
3456789
10111213141516
17181920212223
24252627282930