Clean the data and optionally aggregate it as it sits in source system. The tool used for this would depend on the source system that stores the data (i.e. if SQL Server, you would use stored procedures). The only benefit with this option is if you aggregate the data, you will move less data from the source system to Azure, which can be helpful if you have a small pipe to Azure and don’t need the row-level details. The disadvantages are: the raw source data is not available in the data lake, so you would always need to go back to source system if you needed to get it again, and it may not even still exist in the source system; you would put extra stress on the source system when doing the cleaning which could affect end users using the system; it could take a long time to clean the data as the source system may not have fast performance; and you would not be able to use other tools (i.e. Hadoop, Databricks) to clean it. Strongly advise against this option
Read on for additional options and James’s recommendations.
– You may need to consider separate storage accounts if you need to segregate access control (RBAC), virtual networks, access keys, and the like. (Note that RBAC can also be set at the container level too, but ACL type permissions only apply to ADLS Gen2 and not to blob storage.)
– If you don’t need the hierarchical namespace whatsoever (for non-analytical use cases), this could mean a separate storage account. The storage cost is the same but transaction costs are higher when the HNS is enabled (discussed in item #8 of this post).
Click through for more details, including several more tips about Azure Storage Accounts, Azure Blob Storage Containers, and the Azure Storage Blobs themselves.
A couple of people have asked me recently about how to ‘bone up’ on the new data lake service in Azure. The way I see it, there are two aspects: A, the technology itself and B, data lake principles and architectural best practices. Below are some links to resources that you should find helpful.
There’s a lot of good stuff there.
When to use Blob vs ADLS Gen2
New analytics projects should use ADLS Gen2, and current Blob storage should be converted to ADLS Gen2, unless these are non-analytical use cases that only need object storage rather than hierarchical storage (i.e. video, images, backup files), in which case you can use Blob Storage and save a bit of money on transaction costs (storage costs will be the same between Blob and ADLS Gen2 but transaction costs will be a bit higher for ADLS Gen2 due to the overhead of namespaces).
Looks like there are still some things missing from Gen2, so don’t automatically jump on an upgrade. Read the documentation first to make sure you aren’t relying on something which isn’t there yet.
Q: Partitioning by date is common. Where should the dates go in the folder hierarchy?
Almost always, you will want the dates to be at the end of the folder path. This is because we often need to set security at specific folder levels (such as by subject area), but we rarely set up security based on time elements.
Optimal for folder security: \SubjectArea\DataSource\YYYY\MM\DD\FileData_YYYY_MM_DD.csv
Tedious for folder security: \YYYY\MM\DD\SubjectArea\DataSource\FileData_YYYY_MM_DD.csv
Click through for all of Melissa’s advice in FAQ form.
Databricks is a recent addition to Azure that is greatly influencing the technology choices that people are making when determining how to process data. Prior to the introduction of Databricks to Azure in March of 2018, if you had a lot of unstructured data which was stored in HDFS clusters, and wanted to analyze it in a scalable fashion, the choice was Data Lake and using USQL with Data Lake Analytics. With the introduction of Databricks, there is now a choice for analysis between Data Lake Analytics and Databricks for analyzing data.
Click through for the comparison.
As we’ve discussed many times, the performance of the storage layer has an outsized impact on the total cost of ownership (TCO) for your complete analytics pipeline. This is due to the fact that every percentage point improvement in storage performance results in that same percentage reduction in the requirement for the very expensive compute layer. Given that the disaggregated storage model allows us to scale compute and storage independently, that percentage reduction in compute requirement results in almost the same (compute typically equates to 90 percent of the TCO) reduction in TCO.
So, when I say that ADLS Gen2 provides performance improvements ranging from 10-50 percent, depending on the nature of the workload over existing storage solutions, this equates to VERY significant reductions in the monthly analytics spend. It also has the added benefit of providing your insights sooner!
Check out all of the changes.
I would not say it’s common place to load structured data into the data lake, but I do see it frequently.
In most cases it is not necessary to first copy relational source data into the data lake and then into the data warehouse, especially when keeping in mind the effort to migrate existing ETL jobs that are already copying source data into the data warehouse, but there are some good uses cases to do just that:
There are some good reasons in here, so check them out.
For certain types of data, writing it to the data lake really is frequently the best choice. This is often true for low latency IoT data, semi-structured data like logs, and varying structures such as social media data. However, the handling of structured data which originates from a relational database is much less clear.
Most data lake technologies store data as files (like csv, json, or parquet). This means that when we extract relational data into a file stored in a data lake, we lose valuable metadata from the database such as data types, constraints, foreign keys, etc. I tend to say that we “de-relationalize” data when we write it to a file in the data lake. If we’re going to turn right around and load that data to a relational database destination, is it the right call to write it out to a file in the data lake as an intermediary step?
Click through for considerations on both sides of the fence.
An online service provided by Microsoft as part of Power BI (software as a service, or SaaS).
In effect dataflows are an online data collection and storage tool.
- Collection: It uses Power Query to connect to the data at the source and transform that data as needed.
- You will need to be able to access the data either through a cloud service (such as Dynamics 365) or to your PC/Network via a gateway.
- You can also use Power Query to write queries from scratch, such as my Power BI calendar table.
- Storage: Dataflows then stores that data in a table in the cloud so it can be used directly inside PowerBI.com, but more importantly (from my view) directly from Power BI Desktop.
Dataflows leverage the Power Query skills you have learnt (or are learning) using other tools (like Power BI Desktop, Power Query for Excel) allowing you to reuse those same skills in this online tool.
Tables that are created as a result of the dataflow are stored in an Azure Data Lake.
- If you don’t know what that is, don’t worry – I don’t understand it either. The point is it doesn’t matter because it is all done automatically for you by the tool.
Dataflows include the concept of the common data service (CDS) or common data model directly in the tool and you don’t have to know what it is, nor care.
If you don’t know what that is, don’t worry – it doesn’t matter now/yet.
This will become very important in the future as it will make the process of getting data out of complex databases (such as MS Dynamics 365) much easier in the future.
Click through for more detail as well as some good uses for Dataflows.