Databricks is a recent addition to Azure that is greatly influencing the technology choices that people are making when determining how to process data. Prior to the introduction of Databricks to Azure in March of 2018, if you had a lot of unstructured data which was stored in HDFS clusters, and wanted to analyze it in a scalable fashion, the choice was Data Lake and using USQL with Data Lake Analytics. With the introduction of Databricks, there is now a choice for analysis between Data Lake Analytics and Databricks for analyzing data.
Click through for the comparison.
As we’ve discussed many times, the performance of the storage layer has an outsized impact on the total cost of ownership (TCO) for your complete analytics pipeline. This is due to the fact that every percentage point improvement in storage performance results in that same percentage reduction in the requirement for the very expensive compute layer. Given that the disaggregated storage model allows us to scale compute and storage independently, that percentage reduction in compute requirement results in almost the same (compute typically equates to 90 percent of the TCO) reduction in TCO.
So, when I say that ADLS Gen2 provides performance improvements ranging from 10-50 percent, depending on the nature of the workload over existing storage solutions, this equates to VERY significant reductions in the monthly analytics spend. It also has the added benefit of providing your insights sooner!
Check out all of the changes.
I would not say it’s common place to load structured data into the data lake, but I do see it frequently.
In most cases it is not necessary to first copy relational source data into the data lake and then into the data warehouse, especially when keeping in mind the effort to migrate existing ETL jobs that are already copying source data into the data warehouse, but there are some good uses cases to do just that:
There are some good reasons in here, so check them out.
For certain types of data, writing it to the data lake really is frequently the best choice. This is often true for low latency IoT data, semi-structured data like logs, and varying structures such as social media data. However, the handling of structured data which originates from a relational database is much less clear.
Most data lake technologies store data as files (like csv, json, or parquet). This means that when we extract relational data into a file stored in a data lake, we lose valuable metadata from the database such as data types, constraints, foreign keys, etc. I tend to say that we “de-relationalize” data when we write it to a file in the data lake. If we’re going to turn right around and load that data to a relational database destination, is it the right call to write it out to a file in the data lake as an intermediary step?
Click through for considerations on both sides of the fence.
An online service provided by Microsoft as part of Power BI (software as a service, or SaaS).
In effect dataflows are an online data collection and storage tool.
- Collection: It uses Power Query to connect to the data at the source and transform that data as needed.
- You will need to be able to access the data either through a cloud service (such as Dynamics 365) or to your PC/Network via a gateway.
- You can also use Power Query to write queries from scratch, such as my Power BI calendar table.
- Storage: Dataflows then stores that data in a table in the cloud so it can be used directly inside PowerBI.com, but more importantly (from my view) directly from Power BI Desktop.
Dataflows leverage the Power Query skills you have learnt (or are learning) using other tools (like Power BI Desktop, Power Query for Excel) allowing you to reuse those same skills in this online tool.
Tables that are created as a result of the dataflow are stored in an Azure Data Lake.
- If you don’t know what that is, don’t worry – I don’t understand it either. The point is it doesn’t matter because it is all done automatically for you by the tool.
Dataflows include the concept of the common data service (CDS) or common data model directly in the tool and you don’t have to know what it is, nor care.
If you don’t know what that is, don’t worry – it doesn’t matter now/yet.
This will become very important in the future as it will make the process of getting data out of complex databases (such as MS Dynamics 365) much easier in the future.
Click through for more detail as well as some good uses for Dataflows.
To understand how data is consumed, we need to figure out answers to some basic questions like:
- Which datasets (tables/views/DBs) are accessed frequently?
- When are the queries run most frequently?
- Which users or applications are heavily utilizing the resources?
- What type of queries are running frequently?
The most accessed object can easily benefit from optimization like compression, columnar file format, or data decomposition. A separate queue can be assigned to heavy-resource-utilizing apps or users to balance the load on a cluster. Cluster resources can be scaled up during the timeframe when a large number of queries are mostly run to meet SLAs and scaled down during low usage tide to save cost.
Hive Hooks are convenient ways to answer some of the above questions and more!
Read on to learn how.
SQL Server 2019 big data clusters provide a complete AI platform. Data can be easily ingested via Spark Streaming or traditional SQL inserts and stored in HDFS, relational tables, graph, or JSON/XML. Data can be prepared by using either Spark jobs or Transact-SQL (T-SQL) queries and fed into machine learning model training routines in either Spark or the SQL Server master instance using a variety of programming languages, including Java, Python, R, and Scala. The resulting models can then be operationalized in batch scoring jobs in Spark, in T-SQL stored procedures for real-time scoring, or encapsulated in REST API containers hosted in the big data cluster.
SQL Server big data clusters provide all the tools and systems to ingest, store, and prepare data for analysis as well as to train the machine learning models, store the models, and operationalize them.
Data can be ingested using Spark Streaming, by inserting data directly to HDFS through the HDFS API, or by inserting data into SQL Server through standard T-SQL insert queries. The data can be stored in files in HDFS, or partitioned and stored in data pools, or stored in the SQL Server master instance in tables, graph, or JSON/XML. Either T-SQL or Spark can be used to prepare data by running batch jobs to transform the data, aggregate it, or perform other data wrangling tasks.
Data scientists can choose either to use SQL Server Machine Learning Services in the master instance to run R, Python, or Java model training scripts or to use Spark. In either case, the full library of open-source machine learning libraries, such as TensorFlow or Caffe, can be used to train models.
Lastly, once the models are trained, they can be operationalized in the SQL Server master instance using real-time, native scoring via the PREDICT function in a stored procedure in the SQL Server master instance; or you can use batch scoring over the data in HDFS with Spark. Alternatively, using tools provided with the big data cluster, data engineers can easily wrap the model in a REST API and provision the API + model as a container on the big data cluster as a scoring microservice for easy integration into any application.
I’ve wanted Spark integration ever since 2016 and we’re going to get it.
The power of Azure Data Lake is that you can use a variety of different file types to process data (from Azure Data Lake Analytics). But in order to use JSON, you need to register some assemblies first.
The assemblies are available on Github for download. Unfortunately you need to download the solution, and compile it on your machine. So I’ve also made the 2 DLL’s you need available via direct download:
Click through for links to the assemblies and instructions on how to register them. And to continue my long-running joke that every .NET project has as a core requirement Newtonsoft.Json.
Migrating data from one Data Lake to the other
We started out with a test version of a Data Lake, and this week I needed to migrate data to the production version of our Data Lake. After a lot of trial and error I couldn’t find a good way to migrate data. In the end I found a tool called AdlCopy. This is a command-line tool that copies files for you. Let me show you how easy it is.
Download & Install
AdlCopy needs to be installed on your machine. You can find the download here. By default the tool will install the files in “C:\Users\\Documents\AdlCopy\”, but this can be changed in the setup wizard.
Click through to see how to use this tool.
In short, Dataflows integrates data lake and ETL technology directly into Power BI, so anyone with Power Query skills (yes – Power Query is now part of Power BI service and not just Power BI Desktop and is called Power Query online) can create, customize and manage data within their Power BI experience (think of it as self-service data prep). Dataflows include a standard schema, called the Common Data Model (CDM), that contains the most common business entities across the major functions such as marketing, sales, service, finance, along with connectors that ingest data from the most common sources into these schemas. This greatly simplifies modeling and integration challenges (it prevents multiple metadata/definition on the same data). You can also extend the CDM by creating custom entities. Lastly – Microsoft and their partners will be shipping out-of-the-box applications that run on Power BI that populate data in the Common Data Model and deliver insights through Power BI.
A dataflow is not just the data itself, but also logic on how the data is manipulated. Dataflows belong to the Data Warehouse/Mart/Lake family. Its main job is to aggregate, cleanse, transform, integrate and harmonize data from a large and growing set of supported on-premises and cloud-based data sources including Dynamics 365, Salesforce, Azure SQL Database, Excel, SharePoint. Dataflows hold a collection of data-lake stored entities (i.e. tables) which are stored in internal Power BI Common Data Model compliant folders in Azure Data Lake Storage Gen2.
Also check out the comments for some clarification on why you’d want to use Dataflows rather than doing the work directly in the data lake.