With data lakes becoming popular, and Azure Data Lake Store (ADLS) Gen2 being used for many of them, a common question I am asked about is “How can I access data in ADLS Gen2 instead of a copy of the data in another product (i.e. Azure SQL Data Warehouse)?”. The benefits of accessing ADLS Gen2 directly is less ETL, less cost, to see if the data in the data lake has value before making it part of ETL, for a one-time report, for a data scientist who wants to use the data to train a model, or for using a compute solution that points to ADLS Gen2 to clean your data. While these are all valid reasons, you still want to have a relational database (see Is the traditional data warehouse dead?). The trade-off in accessing data directly in ADLS Gen2 is slower performance, limited concurrency, limited data security (no row-level, column-level, dynamic data masking, etc) and the difficulty in accessing it compared to accessing a relational database.
Since ADLS Gen2 is just storage, you need other technologies to copy data to it or to read data in it.
Read on for the solution.
As a result, companies tend to have a lot of raw, unstructured data that they’ve collected from various sources sitting stagnant in data lakes. Without a way to reliably combine historical data with real-time streaming data, and add structure to the data so that it can be fed into machine learning models, these data lakes can quickly become convoluted, unorganized messes that have given rise to the term “data swamps.”
Before a single data point has been transformed or analyzed, data engineers have already run into their first dilemma: how to bring together processing of historical (“batch”) data, and real-time streaming data. Traditionally, one might use a lambda architecture to bridge this gap, but that presents problems of its own stemming from lambda’s complexity, as well as its tendency to cause data loss or corruption.
Read the whole thing.
Clean the data and optionally aggregate it as it sits in source system. The tool used for this would depend on the source system that stores the data (i.e. if SQL Server, you would use stored procedures). The only benefit with this option is if you aggregate the data, you will move less data from the source system to Azure, which can be helpful if you have a small pipe to Azure and don’t need the row-level details. The disadvantages are: the raw source data is not available in the data lake, so you would always need to go back to source system if you needed to get it again, and it may not even still exist in the source system; you would put extra stress on the source system when doing the cleaning which could affect end users using the system; it could take a long time to clean the data as the source system may not have fast performance; and you would not be able to use other tools (i.e. Hadoop, Databricks) to clean it. Strongly advise against this option
Read on for additional options and James’s recommendations.
– You may need to consider separate storage accounts if you need to segregate access control (RBAC), virtual networks, access keys, and the like. (Note that RBAC can also be set at the container level too, but ACL type permissions only apply to ADLS Gen2 and not to blob storage.)
– If you don’t need the hierarchical namespace whatsoever (for non-analytical use cases), this could mean a separate storage account. The storage cost is the same but transaction costs are higher when the HNS is enabled (discussed in item #8 of this post).
Click through for more details, including several more tips about Azure Storage Accounts, Azure Blob Storage Containers, and the Azure Storage Blobs themselves.
A couple of people have asked me recently about how to ‘bone up’ on the new data lake service in Azure. The way I see it, there are two aspects: A, the technology itself and B, data lake principles and architectural best practices. Below are some links to resources that you should find helpful.
There’s a lot of good stuff there.
When to use Blob vs ADLS Gen2
New analytics projects should use ADLS Gen2, and current Blob storage should be converted to ADLS Gen2, unless these are non-analytical use cases that only need object storage rather than hierarchical storage (i.e. video, images, backup files), in which case you can use Blob Storage and save a bit of money on transaction costs (storage costs will be the same between Blob and ADLS Gen2 but transaction costs will be a bit higher for ADLS Gen2 due to the overhead of namespaces).
Looks like there are still some things missing from Gen2, so don’t automatically jump on an upgrade. Read the documentation first to make sure you aren’t relying on something which isn’t there yet.
Q: Partitioning by date is common. Where should the dates go in the folder hierarchy?
Almost always, you will want the dates to be at the end of the folder path. This is because we often need to set security at specific folder levels (such as by subject area), but we rarely set up security based on time elements.
Optimal for folder security: \SubjectArea\DataSource\YYYY\MM\DD\FileData_YYYY_MM_DD.csv
Tedious for folder security: \YYYY\MM\DD\SubjectArea\DataSource\FileData_YYYY_MM_DD.csv
Click through for all of Melissa’s advice in FAQ form.
Databricks is a recent addition to Azure that is greatly influencing the technology choices that people are making when determining how to process data. Prior to the introduction of Databricks to Azure in March of 2018, if you had a lot of unstructured data which was stored in HDFS clusters, and wanted to analyze it in a scalable fashion, the choice was Data Lake and using USQL with Data Lake Analytics. With the introduction of Databricks, there is now a choice for analysis between Data Lake Analytics and Databricks for analyzing data.
Click through for the comparison.
As we’ve discussed many times, the performance of the storage layer has an outsized impact on the total cost of ownership (TCO) for your complete analytics pipeline. This is due to the fact that every percentage point improvement in storage performance results in that same percentage reduction in the requirement for the very expensive compute layer. Given that the disaggregated storage model allows us to scale compute and storage independently, that percentage reduction in compute requirement results in almost the same (compute typically equates to 90 percent of the TCO) reduction in TCO.
So, when I say that ADLS Gen2 provides performance improvements ranging from 10-50 percent, depending on the nature of the workload over existing storage solutions, this equates to VERY significant reductions in the monthly analytics spend. It also has the added benefit of providing your insights sooner!
Check out all of the changes.
I would not say it’s common place to load structured data into the data lake, but I do see it frequently.
In most cases it is not necessary to first copy relational source data into the data lake and then into the data warehouse, especially when keeping in mind the effort to migrate existing ETL jobs that are already copying source data into the data warehouse, but there are some good uses cases to do just that:
There are some good reasons in here, so check them out.