The concept of workload management is a key factor for Azure SQL DW as there is only limited concurrency slots available and depending on the resource class, these slots can fill up pretty quickly. Once the concurrency slots are full, queries are queued until a sufficiently sized slot is opened up. Let’s recap what Resource Classes are and how they affect workload management.
A Resource Class is a pre-configured database role that determines how much resource is allocated to queries coming from users that belong to that role. For example, an ETL service account may use a “large” resource class and be allocated a generous amount of the server, however an analyst may use a “small” resource class and therefore only use up a small amount of the server with their queries. There are actually 2 types of resource class, Dynamic and Static. The Dynamic resource classes will grant a set percentage of memory to a query and actual value of this percentage will vary as the Warehouse scales up and down. The key factor is that an xLargeRc (extra-large resource class) will always take up 70% of the Server and will not allow any other queries to be run concurrently. No matter how much you scale up the Warehouse, queries run with an xLargeRc will run one at a time. Conversely, queries run with a smallrc will only be allocated 4% of the Server and therefore as a Warehouse scales up, this 4% becomes a larger amount of resource and can therefore process data quicker.
This looks like a useful addition. Click through for a few examples of how it will work.
Let’s start with the positive.
Snowflake is a really scalable database. Storage is virtually limitless, since the data is stored on blob storage (S3 on AWS and Blob Storage on Azure). The compute layer (called warehouses) is completely separated from the storage layer and you can scale it independently from storage.
It is really easy to use. This is one of Snowflake’s core goals: make it easy to use for everyone. Most of the technical aspects (clustering, storage etc) are hidden from the user. If you thought SQL Server is easy with it’s “next-next-finish” installation, you’ll be blown away by Snowflake. I really like this aspect, since you have really powerful data warehousing at your finger tips, and the only thing you have to worry about is how to get your data into it. With Azure SQL DW for example, you have to about distribution of the data, how you are going to set things up etc. Not here.
It’s not all positive, but Koen seems quite happy to work with the product.
There are numerous improvements that went into HDP 3.0 and the performance improvements shown are an aggregate result of all of them. Here are some of the more noteworthy improvements related to Druid-Hive integration :
Druid Expressions Support – HIVE-18893/ CALCITE-2170 added support for Druid expressions in Hive. In HDP 3.0, Hive can push the computation of SQL expressions as part of a Druid query and they can be evaluated by Druid.
Use of Scan Query instead of Select Query – In HDP 3.0 we use Druid Scan query instead of Select Query. Scan Query is a streaming version of Select Query which returns the results in a compact streaming format. Scan query also does not need all the results to be retained in memory before they can be returned to Hive. This improves the memory usage of the historical nodes too.
GroupBy Query Improvements – Many optimizations are done in order to address the performance of GroupBy queries on Druid side. Main ones are –
Better column pruning – In some cases when hive cannot push any operator to druid, hive ended up pulling all the columns from druid. This led to lots of unnecessary data transfer between druid and hive. HIVE-15619 improved the column pruner logic to only fetch the columns from druid which are required to answer a query.
Druid Version upgrade from 0.10.0 to 0.12.2 – HDP 3.0 comes with latest version of Druid i.e 0.12.2 which has many new features, performance enhancements and bug-fixes over the previous version.
Druid is still a specialty technology which doesn’t fit every use case, but if it does fit your use, you’ll get a lot of performance benefit out of it.
The Need to Vacuum
As Redshift does not reclaim free space automatically, updates and delete operations can frequently lead to table growth. Equally, it’s important as new entries are added, that the data is maintained in a sorted sequence.
The VACUUM command is used to re-sequence data, and reclaim disk space as a result of DELETE and UPDATE operations. Although it won’t block other processes, it can be a resource-intensive operation, especially for data stored using interleaved sort keys.
It should be run periodically to ensure consistent performance and to reduce disk usage.
Some of this is good Postgres advice; some of it is good MPP advice (and serves well, for example, when dealing with Azure SQL Data Warehouse); the rest is Redshift-specific.
Using a query label in Azure SQL DW (Data Warehouse) can be a really handy technique to track queries via DMVs. You might want to do this to see what problematic queries are doing under the covers.
Let’s check out an example. First I will show you how things would look without using a query label. I connect to SQL DW and issue the following basic example query.
It’s an interesting approach and solves a problem I saw in Polybase around figuring out which session details were yours after the fact.
The question is how are backups done with Azure SQL DW?
It is very different from Azure SQL DB (which you would expect). Azure SQL DW has a totally different architecture to its classic database counter-part. Restore points are the key here. Automatic ones are taken throughout the day and are kept for seven days only. Worst case scenario is the time between the restore points will be eight hours hence giving an eight hour RPO (Recovery Point Objective).
You can also create manual restore points, as Arun shows.
I wrote the updated content from a practical point of view, totally hype-free. The table of contents:
- Modern Data Architecture
- Business Needs Driving Data Architectures to Evolve and Adapt
- Principles of a Modern Data Architecture
- Data Lake + Data Warehouse: Complementary Solutions
- Tips for Designing a Data Lake
- Azure Technologies for Implementing a Data Lake
- Considerations for a Successful Data Lake in the Cloud
- Getting Started with a Data Lake
To download the ebook, BlueGranite will ask for you to register your information. That’s common for premium content like this. We take a low-key approach to sales, so I can assure you that registration only means you’ll receive notifications of new content that you may find interesting.
It’s the length of a good-sized paper, so you won’t have to invest dozens of hours of time to get the story.
Access to the table columns can be controlled based on the user’s execution context or their group membership with the standard GRANT T-SQL statement. To secure your data, you simply define a security policy via the GRANT statement to your table columns. For example, if you would like to limit access to PII data in your customers table, you can simply GRANT SELECT permissions on specific columns to the ContractEmp role:GRANT SELECT ON dbo.Customers (CustomerId, FirstName, LastName) TO ContractEmp;
This capability is available now in all Azure regions with no additional charge.
This has been in regular SQL Server for a long time, so it’s good to see it make its way into Azure SQL Data Warehouse, and in a manner which doesn’t involve creating user-defined functions for predicates like Row-Level Security.
Auditing is disabled by default and the UI experience depends on the region to which the logical server is deployed. For instance, in UK South, the portal offers no options to manage auditing:
In North Europe, the portal allows Table Auditing (table-storage based) to be enabled on the SQL Data Warehouse scope, but it isn’t possible to enable Blob Auditing:
On top of that, Blob Auditing behaves differently when enabled on a logical server level in different regions. In locations that support Table Auditing, turning on Blob Auditing automatically enables it in all databases, including SQL Data Warehouses—and that’s expected. In other regions, Blob Auditing is not automatically enabled and has to be turned on programmatically by calling ARM REST API.
I imagine the plan is to support this across the board but it’s rolling out region by region.
Previously, SQL DW supported only automated snapshots guaranteeing an eight-hour recovery point objective (RPO). While this snapshot policy provided high levels of protection, customers asked for more control over restore points to enable more efficient data warehouse management capabilities leading to quicker times of recovery in the event of any workload interruptions or user errors.
Now, with user-defined restore points, in addition to the automated snapshots, you can initiate snapshots before and after significant operations on your data warehouse. With more granular restore points, you ensure that each restore point is logically consistent and limit the impact and reduce recovery time of restoring the data warehouse should this be needed. User-defined restore points can also be labeled so they are easy to identify afterwards.
Creating a user-defined restore point is a one-liner in Powershell, and it’s something you could do after each warehouse load, for example.