Access to the table columns can be controlled based on the user’s execution context or their group membership with the standard GRANT T-SQL statement. To secure your data, you simply define a security policy via the GRANT statement to your table columns. For example, if you would like to limit access to PII data in your customers table, you can simply GRANT SELECT permissions on specific columns to the ContractEmp role:GRANT SELECT ON dbo.Customers (CustomerId, FirstName, LastName) TO ContractEmp;
This capability is available now in all Azure regions with no additional charge.
This has been in regular SQL Server for a long time, so it’s good to see it make its way into Azure SQL Data Warehouse, and in a manner which doesn’t involve creating user-defined functions for predicates like Row-Level Security.
Auditing is disabled by default and the UI experience depends on the region to which the logical server is deployed. For instance, in UK South, the portal offers no options to manage auditing:
In North Europe, the portal allows Table Auditing (table-storage based) to be enabled on the SQL Data Warehouse scope, but it isn’t possible to enable Blob Auditing:
On top of that, Blob Auditing behaves differently when enabled on a logical server level in different regions. In locations that support Table Auditing, turning on Blob Auditing automatically enables it in all databases, including SQL Data Warehouses—and that’s expected. In other regions, Blob Auditing is not automatically enabled and has to be turned on programmatically by calling ARM REST API.
I imagine the plan is to support this across the board but it’s rolling out region by region.
Previously, SQL DW supported only automated snapshots guaranteeing an eight-hour recovery point objective (RPO). While this snapshot policy provided high levels of protection, customers asked for more control over restore points to enable more efficient data warehouse management capabilities leading to quicker times of recovery in the event of any workload interruptions or user errors.
Now, with user-defined restore points, in addition to the automated snapshots, you can initiate snapshots before and after significant operations on your data warehouse. With more granular restore points, you ensure that each restore point is logically consistent and limit the impact and reduce recovery time of restoring the data warehouse should this be needed. User-defined restore points can also be labeled so they are easy to identify afterwards.
Creating a user-defined restore point is a one-liner in Powershell, and it’s something you could do after each warehouse load, for example.
For the purposes of this post the TSQL shown is elementary (don’t be surprised by that), the point is really about SHUFFLE. So, I select the estimated plan for the following code.SELECT SOD.[SalesOrderID],SOD.[ProductID], SOH.[TotalDue] FROM [SalesLT].[SalesOrderDetail] SOD JOIN [SalesLT].[SalesOrderHeader] SOH ON SOH.[SalesOrderID] = SOD.[SalesOrderID] WHERE SOH.[TotalDue] > 1000
Shuffle me once, why not shuffle me twice. If you REALLY want to see the EXPLAIN command output, then it looks like this snippet below.
The DSQL operation clearly states SHUFFLE_MOVE. Why am I getting this? What does it mean?
Shuffling data isn’t the worst thing in the world, but it is a fairly expensive operation all things considered. Ideally, your warehouse architecture limits the number of shuffle operations, but considering that you can only hash on one key, sometimes it’s inevitable.
Recently we had a requirement to perform SQL Spatial functions on data that was stored in Azure SQL DW. Seems simple enough as spatial has been in SQL for many years, but unfortunately, SQL Spatial functions are not natively supported in Azure SQL DW (yet)!
If interested – this is the link to the Azure Feedback feature request to make this available in Azure SQL DW – https://feedback.azure.com/forums/307516-sql-data-warehouse/suggestions/10508991-support-for-spatial-data-type
AND SO — to use spatial data in Azure SQL DW we need to look at alternative methods. Luckily a recent new feature in Azure SQL DB in the form of Elastic Query to Azure SQL DW now gives us the ability to perform these SQL Spatial functions on data within Azure SQL DW via a very simple method!
Check out that Azure Feedback item if you’d like to see native spatial support rather than using elastic query. In the meantime, click through to see Rolf’s workaround.
One difference between regular tables and temporary tables is how they are typically used. Temporary tables are session scoped which means that adding them into a process or report will probably cause them to be created multiple times. Temporary tables might be very similar to regular tables but most regular tables are not re-written into, every time they are queried.
The disk writes involved in populating the temporary table might be more expensive than the reads would be if you were to modify your query to include the logic into one, larger, query. The frequency of the report or process will be a factor into how much of a performance hit you get by using the temporary tables. If you are using temporary tables to make debugging a procedure easier or to enhance readability, make sure you understand the IO cost of performing writes and then reading that data back into a subsequent query.
Read on for more.
Data virtualization integrates data from disparate sources, locations and formats, without replicating or moving the data, to create a single “virtual” data layer that delivers unified data services to support multiple applications and users.
Data movement is the process of extracting data from source systems and bringing it into the data warehouse and is commonly called ETL, which stands for extraction, transformation, and loading.
If you are building a data warehouse, should you move all the source data into the data warehouse, or should you create a virtualization layer on top of the source data and keep it where it is?
Read on for James’s thoughts.
Given the data processing needs of a data warehouse, they tend to be implemented on massively parallel processing (MPP) systems. The MPP architecture replies upon a shared nothing concept for distributing data across various slices. Compute nodes are layered on top of the storage and processes queries for data residing in its local slice. The control node is responsible for taking a query and dividing it up into smaller queries to be run in parallel on the compute nodes.
Read the whole thing.
Azure SQL Data Warehouse is a massively parallel processing (MPP) architecture designed for large-scale data warehouses. An MPP system creates logical / physical slices of the data. In SQL Data Warehouse’s case, the data has 60 logical slices, at all performance tiers. This means that a single table can have up to 60 different object_ids. This is why, in SQL Data Warehouse, there is the concept of physical and logical object_ids along with physical names.
Below is a query for finding row counts of tables in SQL Data Warehouse which accounts for the differences in architecture between my earlier script, written for SQL Server, and SQL Data Warehouse.
Click through for the script.
I think the ultimate question is: Can all the benefits of a traditional relational data warehouse be implemented inside of a Hadoop data lake with interactive querying via Hive LLAP or Spark SQL, or should I use both a data lake and a relational data warehouse in my big data solution? The short answer is you should use both. The rest of this post will dig into the reasons why.
I touched on this ultimate question in a blog that is now over a few years old at Hadoop and Data Warehouses so this is a good time to provide an update. I also touched on this topic in my blogs Use cases of various products for a big data cloud solution, Data lake details, Why use a data lake?and What is a data lake? and my presentation
Read on for James’s argument, which is good. My argument is summed up as follows: the purpose of a data warehouse is to solve known business problems—that is, to help build reports that people on the business side need based on established requirements. The purpose of a data lake is to hold all kinds of data and curate it for when people come looking for something they didn’t know they needed.