The performance benchmarks on CDH 6.0 show that enabling Parquet vectorization significantly improves performance for a typical ETL workload. In the test workload (TPC-DS), enabling parquet vectorization gave 26.5% performance improvement on average (geomean value of runtime for all the queries). Vectorization achieves these performance improvements by reducing the number of virtual function calls and leveraging the SIMD instructions on modern processors. A query is vectorized in Hive when certain conditions like supported column data-types and expressions are satisfied. However, if the query cannot be vectorized its execution falls back to a non-vectorized execution. Overall, for workloads which use the Parquet file format on most modern processors, enabling Parquet vectorization can lead to better query performance in CDH 6.0 and beyond.
This is worth looking into, especially if you are on the Cloudera stack.
Hortonworks announced the general availability of HDP 3.0 this year. You may read more about it here. Bundled with HDP 3.0, Apache Hive 3 with LLAP took a significant leap as a Enterprise Ready Real time Database Warehouse with transactional capabilities that continues to serve BI workloads with lower latencies. HDP 3.0 comes with exciting new capabilities – ACID support, materialized views, SQL constraints and Query result cache to name a few. Additionally, we continued to build and improve on the performance enhancements introduced in earlier releases.
In this blog, we will provide an update on our performance benchmark blog, comparing performance of HDP 3.0 to HDP 2.6.5. The noteworthy difference in benchmark is that all tables are by default transactional and written in ACID format, which means there are additional metadata (ROW_ID) columns to uniquely identify each row and support transactional semantics. Another key database capability used and tested here is SQL constraints. The hive-testbench schema has been enhanced to declare Primary-Foreign key, not null and unique constraints.
Their headline is that Hive 3 is up to 2x faster than Hive 2, with huge gains in a few of the queries.
The concept of workload management is a key factor for Azure SQL DW as there is only limited concurrency slots available and depending on the resource class, these slots can fill up pretty quickly. Once the concurrency slots are full, queries are queued until a sufficiently sized slot is opened up. Let’s recap what Resource Classes are and how they affect workload management.
A Resource Class is a pre-configured database role that determines how much resource is allocated to queries coming from users that belong to that role. For example, an ETL service account may use a “large” resource class and be allocated a generous amount of the server, however an analyst may use a “small” resource class and therefore only use up a small amount of the server with their queries. There are actually 2 types of resource class, Dynamic and Static. The Dynamic resource classes will grant a set percentage of memory to a query and actual value of this percentage will vary as the Warehouse scales up and down. The key factor is that an xLargeRc (extra-large resource class) will always take up 70% of the Server and will not allow any other queries to be run concurrently. No matter how much you scale up the Warehouse, queries run with an xLargeRc will run one at a time. Conversely, queries run with a smallrc will only be allocated 4% of the Server and therefore as a Warehouse scales up, this 4% becomes a larger amount of resource and can therefore process data quicker.
This looks like a useful addition. Click through for a few examples of how it will work.
Let’s start with the positive.
Snowflake is a really scalable database. Storage is virtually limitless, since the data is stored on blob storage (S3 on AWS and Blob Storage on Azure). The compute layer (called warehouses) is completely separated from the storage layer and you can scale it independently from storage.
It is really easy to use. This is one of Snowflake’s core goals: make it easy to use for everyone. Most of the technical aspects (clustering, storage etc) are hidden from the user. If you thought SQL Server is easy with it’s “next-next-finish” installation, you’ll be blown away by Snowflake. I really like this aspect, since you have really powerful data warehousing at your finger tips, and the only thing you have to worry about is how to get your data into it. With Azure SQL DW for example, you have to about distribution of the data, how you are going to set things up etc. Not here.
It’s not all positive, but Koen seems quite happy to work with the product.
There are numerous improvements that went into HDP 3.0 and the performance improvements shown are an aggregate result of all of them. Here are some of the more noteworthy improvements related to Druid-Hive integration :
Druid Expressions Support – HIVE-18893/ CALCITE-2170 added support for Druid expressions in Hive. In HDP 3.0, Hive can push the computation of SQL expressions as part of a Druid query and they can be evaluated by Druid.
Use of Scan Query instead of Select Query – In HDP 3.0 we use Druid Scan query instead of Select Query. Scan Query is a streaming version of Select Query which returns the results in a compact streaming format. Scan query also does not need all the results to be retained in memory before they can be returned to Hive. This improves the memory usage of the historical nodes too.
GroupBy Query Improvements – Many optimizations are done in order to address the performance of GroupBy queries on Druid side. Main ones are –
Better column pruning – In some cases when hive cannot push any operator to druid, hive ended up pulling all the columns from druid. This led to lots of unnecessary data transfer between druid and hive. HIVE-15619 improved the column pruner logic to only fetch the columns from druid which are required to answer a query.
Druid Version upgrade from 0.10.0 to 0.12.2 – HDP 3.0 comes with latest version of Druid i.e 0.12.2 which has many new features, performance enhancements and bug-fixes over the previous version.
Druid is still a specialty technology which doesn’t fit every use case, but if it does fit your use, you’ll get a lot of performance benefit out of it.
The Need to Vacuum
As Redshift does not reclaim free space automatically, updates and delete operations can frequently lead to table growth. Equally, it’s important as new entries are added, that the data is maintained in a sorted sequence.
The VACUUM command is used to re-sequence data, and reclaim disk space as a result of DELETE and UPDATE operations. Although it won’t block other processes, it can be a resource-intensive operation, especially for data stored using interleaved sort keys.
It should be run periodically to ensure consistent performance and to reduce disk usage.
Some of this is good Postgres advice; some of it is good MPP advice (and serves well, for example, when dealing with Azure SQL Data Warehouse); the rest is Redshift-specific.
Using a query label in Azure SQL DW (Data Warehouse) can be a really handy technique to track queries via DMVs. You might want to do this to see what problematic queries are doing under the covers.
Let’s check out an example. First I will show you how things would look without using a query label. I connect to SQL DW and issue the following basic example query.
It’s an interesting approach and solves a problem I saw in Polybase around figuring out which session details were yours after the fact.
The question is how are backups done with Azure SQL DW?
It is very different from Azure SQL DB (which you would expect). Azure SQL DW has a totally different architecture to its classic database counter-part. Restore points are the key here. Automatic ones are taken throughout the day and are kept for seven days only. Worst case scenario is the time between the restore points will be eight hours hence giving an eight hour RPO (Recovery Point Objective).
You can also create manual restore points, as Arun shows.
I wrote the updated content from a practical point of view, totally hype-free. The table of contents:
- Modern Data Architecture
- Business Needs Driving Data Architectures to Evolve and Adapt
- Principles of a Modern Data Architecture
- Data Lake + Data Warehouse: Complementary Solutions
- Tips for Designing a Data Lake
- Azure Technologies for Implementing a Data Lake
- Considerations for a Successful Data Lake in the Cloud
- Getting Started with a Data Lake
To download the ebook, BlueGranite will ask for you to register your information. That’s common for premium content like this. We take a low-key approach to sales, so I can assure you that registration only means you’ll receive notifications of new content that you may find interesting.
It’s the length of a good-sized paper, so you won’t have to invest dozens of hours of time to get the story.
Access to the table columns can be controlled based on the user’s execution context or their group membership with the standard GRANT T-SQL statement. To secure your data, you simply define a security policy via the GRANT statement to your table columns. For example, if you would like to limit access to PII data in your customers table, you can simply GRANT SELECT permissions on specific columns to the ContractEmp role:GRANT SELECT ON dbo.Customers (CustomerId, FirstName, LastName) TO ContractEmp;
This capability is available now in all Azure regions with no additional charge.
This has been in regular SQL Server for a long time, so it’s good to see it make its way into Azure SQL Data Warehouse, and in a manner which doesn’t involve creating user-defined functions for predicates like Row-Level Security.