Most common patterns using Azure Data Lake Store (ADLS) involve customers ingesting and storing raw data into ADLS. This data is then cooked and prepared by analytic workloads like Azure Data Lake Analytics and HDInsight. Once cooked this data is then explored using engines like Azure SQL Data Warehouse. One key pain point for customers is having to wait for a substantial time after the data was cooked to be able to explore it and gather insights. This was because the data stored in ADLS would have to be loaded into SQL Data Warehouse using tools row-by-row insertion. But now, you don’t have to wait that long anymore. With the new SQL Data Warehouse PolyBase support for ADLS, you will now be able to load and access the cooked data rapidly and lessen your time to start performing interactive analytics. PolyBase support will allow to you access unstructured/semi-structured files in ADLS faster because of a highly scalable loading design. You can load the files stored in ADLS into SQL Data Warehouse to perform analytics with fast response times or you use can the files in ADLS as external tables. So get ready to unlock the value stored in your petabytes of data stored in ADLS.
I’ve been waiting for this support, and I’m happy that they were able to integrate the two products.
NOTE: The most important habit you can start with in Azure is putting everything into discrete, planned, Resource Groups. These make management so much easier.
Once the account is set, the first thing you need is to create a Runbook. There is a collection of them for your use within Azure. None of them are immediately applicable for what I need. I’m just writing a really simple Powershell script to do what I want:
Runbooks are an important part of Azure maintenance, and this is a gentle introduction to them.
Logical Data Warehouse
A logical data warehouse (LDW) builds upon the traditional DW by providing unified data access to multiple platforms. Conceptually, the logical data warehouse is a view layer that abstractly accesses distributed systems such as relational DBs, NoSQL DBs, data lakes, in-memory data structures, and so forth, consolidating and relating the data in a virtual layer. This availability of data on various platforms adds flexibility to a traditional DW, and speeds up data availability. The tradeoff for this flexibility can be slower performance for user queries, though the full-fledged LDW vendors employ an array of optimization techniques to mitigate performance issues. A logical data warehouse is broader than just data virtualization and distributed processing which can be thought of as enabling technologies. According to Gartner a full-fledged LDW system also involves metadata management, repository management, taxonomy/ontology resolution, auditing & performance services, as well as service level agreement management.
If you’re just getting started with the topic, check this out, as it will probably clear up several concepts.
You can use this one month free trial to do POCs and try out SQL DW up to 200 DWU and 2TB of data. You must sign up by December 31st 2016. Please note that once the one month free trial is over, you will start getting billed at general availability pricing rates. For more information on the free trial, and to sign up, go here.
This is great because you can quickly run out of credits otherwise.
For an introduction to this interesting Hadoop project, check out this article. Apache Kylin originally from eBay, is a Distributed Analytics Engine that provides SQL and OLAP access to Hadoop datasets utilizing Hive and HBase. It can use called through SparkSQL as well making for a very useful project. This project let’s you work with PowerBI, Tableau and Excel with more tool support coming soon. You can doMOLAP cubes and support many users with fast queries over billions of rows. Apache Kylin provides JDBC and ODBC drivers.
There are a few interesting options here.
For this analysis, we used TPC-DS on a 3TB dataset and selected 70 out of 99 the queries that run without any modifications or uses variants on both Redshift and Impala. We wanted to use a larger dataset (similar to what we’ve used in previous benchmarks), but due to Redshift’s data load times, we had to reduce the data size. (Note: This benchmark is derived from the TPC-DS benchmark and, as such, is not directly comparable to published TPC-DS results.)
This is coming from one of the two vendors, so take it with however many grains of salt you’d like.
Case: we’ve integrated two sources of customers. We want to add a third source.
Q: How do we at the same time know that our current integration and solutions will continue to work while at the same time integrating the new sources?
A: Test it.
Q: How do we get faster deployments and more stability?
A: Automate the tests, so they can run continuously.
This is an interesting concept; do read the whole thing.
The information shown here is the DSQL (Distributed SQL) plan – When you send a SQL query to SQL Data Warehouse, the Control node processes a query and converts the code to DSQL then the Control node sends the command to run in each of the compute nodes.
The returned query plan depicts sequential SQL statements; when the query runs it may involve parallelized operations, so some of the sequential statements shown may run at the same time. More information can be found at the following URL https://msdn.microsoft.com/en-us/library/mt631615.aspx.
Arun also looks at running a simple Power BI report off of Azure SQL Data Warehouse; click through for that.
Unlike the sys.dm_exec_requests view in SQL Server, the sys.dm_pdw_exec_requests view actually keeps up to 10000 records with the information of a request even after it has executed. This capability is very useful as you can track specific query executions as long as their records are still among the 10000 kept by the view. As time passes the oldest records are phased out in favor of more recent ones.
This is an interesting look at some of the differences between Azure SQL Data Warehouse and a “normal” SQL Server installation. Good reading.
The fiscal calendar assumes calendar months with the fiscal calendar shifted by some number of months. It also assumes that the fiscal year is ahead of the calendar year. That is, fiscal year 2017 starts in the first day of some month in 2016. In my example script, the fiscal year starts in October. If you have some 4-4-5, 4-5-4, or other calendar, the fiscal calendar calculations in this script won’t help you.
To add this date dimension to your SSAS Tabular project:
Right click on the Model.bim file in the Solution Explorer and choose View Code.
If you have at least one table in the model already, locate the end of the definition of the previous table, inside the table array. Add a comma after the end curly brace for the previous table.
Paste the JSON/TMSL from my Gist.
Save and close the file.
Right click on Model.Bim and choose View Designer.
Click through for the script and additional explanation.