For the purposes of this post the TSQL shown is elementary (don’t be surprised by that), the point is really about SHUFFLE. So, I select the estimated plan for the following code.SELECT SOD.[SalesOrderID],SOD.[ProductID], SOH.[TotalDue] FROM [SalesLT].[SalesOrderDetail] SOD JOIN [SalesLT].[SalesOrderHeader] SOH ON SOH.[SalesOrderID] = SOD.[SalesOrderID] WHERE SOH.[TotalDue] > 1000
Shuffle me once, why not shuffle me twice. If you REALLY want to see the EXPLAIN command output, then it looks like this snippet below.
The DSQL operation clearly states SHUFFLE_MOVE. Why am I getting this? What does it mean?
Shuffling data isn’t the worst thing in the world, but it is a fairly expensive operation all things considered. Ideally, your warehouse architecture limits the number of shuffle operations, but considering that you can only hash on one key, sometimes it’s inevitable.
Recently we had a requirement to perform SQL Spatial functions on data that was stored in Azure SQL DW. Seems simple enough as spatial has been in SQL for many years, but unfortunately, SQL Spatial functions are not natively supported in Azure SQL DW (yet)!
If interested – this is the link to the Azure Feedback feature request to make this available in Azure SQL DW – https://feedback.azure.com/forums/307516-sql-data-warehouse/suggestions/10508991-support-for-spatial-data-type
AND SO — to use spatial data in Azure SQL DW we need to look at alternative methods. Luckily a recent new feature in Azure SQL DB in the form of Elastic Query to Azure SQL DW now gives us the ability to perform these SQL Spatial functions on data within Azure SQL DW via a very simple method!
Check out that Azure Feedback item if you’d like to see native spatial support rather than using elastic query. In the meantime, click through to see Rolf’s workaround.
One difference between regular tables and temporary tables is how they are typically used. Temporary tables are session scoped which means that adding them into a process or report will probably cause them to be created multiple times. Temporary tables might be very similar to regular tables but most regular tables are not re-written into, every time they are queried.
The disk writes involved in populating the temporary table might be more expensive than the reads would be if you were to modify your query to include the logic into one, larger, query. The frequency of the report or process will be a factor into how much of a performance hit you get by using the temporary tables. If you are using temporary tables to make debugging a procedure easier or to enhance readability, make sure you understand the IO cost of performing writes and then reading that data back into a subsequent query.
Read on for more.
Data virtualization integrates data from disparate sources, locations and formats, without replicating or moving the data, to create a single “virtual” data layer that delivers unified data services to support multiple applications and users.
Data movement is the process of extracting data from source systems and bringing it into the data warehouse and is commonly called ETL, which stands for extraction, transformation, and loading.
If you are building a data warehouse, should you move all the source data into the data warehouse, or should you create a virtualization layer on top of the source data and keep it where it is?
Read on for James’s thoughts.
Given the data processing needs of a data warehouse, they tend to be implemented on massively parallel processing (MPP) systems. The MPP architecture replies upon a shared nothing concept for distributing data across various slices. Compute nodes are layered on top of the storage and processes queries for data residing in its local slice. The control node is responsible for taking a query and dividing it up into smaller queries to be run in parallel on the compute nodes.
Read the whole thing.
Azure SQL Data Warehouse is a massively parallel processing (MPP) architecture designed for large-scale data warehouses. An MPP system creates logical / physical slices of the data. In SQL Data Warehouse’s case, the data has 60 logical slices, at all performance tiers. This means that a single table can have up to 60 different object_ids. This is why, in SQL Data Warehouse, there is the concept of physical and logical object_ids along with physical names.
Below is a query for finding row counts of tables in SQL Data Warehouse which accounts for the differences in architecture between my earlier script, written for SQL Server, and SQL Data Warehouse.
Click through for the script.
I think the ultimate question is: Can all the benefits of a traditional relational data warehouse be implemented inside of a Hadoop data lake with interactive querying via Hive LLAP or Spark SQL, or should I use both a data lake and a relational data warehouse in my big data solution? The short answer is you should use both. The rest of this post will dig into the reasons why.
I touched on this ultimate question in a blog that is now over a few years old at Hadoop and Data Warehouses so this is a good time to provide an update. I also touched on this topic in my blogs Use cases of various products for a big data cloud solution, Data lake details, Why use a data lake?and What is a data lake? and my presentation
Read on for James’s argument, which is good. My argument is summed up as follows: the purpose of a data warehouse is to solve known business problems—that is, to help build reports that people on the business side need based on established requirements. The purpose of a data lake is to hold all kinds of data and curate it for when people come looking for something they didn’t know they needed.
From a certain viewpoint, one could state that Data Virtualization is focused on the way the world should work: when integrating data, one shouldn’t have to store it everywhere. Why not let the system decide when to store? For some, to adopt this view might mean a paradigm shift: suddenly, the Data Warehouse the go-to integration point any more!
From this viewpoint, DWA a tool “from the trenches”: after years of struggle and hard work to build our warehouses, we’ve developed some smart ways to automate our warehouse-building based on abstract models.
Worth reading the whole thing.
Q4: How large is your database?
It is difficult to pinpoint an exact number for the absolute minimum size recommended for Azure SQL DW. Many data professionals in the industry see the minimum “practical” data size for Azure SQL DW in the 1-4TB range. Microsoft documentation has recently stated as low as 250GB for a minimum size. Since Azure SQL DW is an MPP (massively parallel processing) system, you experience a significant performance penalty with small data sizes because of the overhead incurred to distribute and consolidate across the nodes (which are distributions in a “shared-nothing” architecture). We recommend Azure SQL DW for a data warehouse which is starting to approach 1TB and expected to continue growing.
Great advice here. I’ve heard too often of people looking at the name “Azure SQL Data Warehouse” and figuring that because they have data warehouses on-prem, this is the appropriate analog. Azure SQL DW is not a typical data warehousing environment; it’s more of a specialized tool than that, so click through to see if it fits your needs.
Turns out there are three basic zones or areas to a data lake. Raw, Managed, and Presentation.
The raw zone should be optimized for fast storage. The goal is to get the data in as quickly as possible. Don’t make any changes to this data. You want it stored as close to the original format as possible. It sounds just like staged data to me. Data you’d build an extract package to get from source to your staging environment, right?
Maybe you’re thinking this is just a coincidence…let’s move on.
Spoilers: it’s not a coincidence.