Before being loaded into a Snowflake table, the data can be optionally staged, which is essentially just a pointer to a location where the files are stored. There are different types of stages including:
– User stages, which each user will have by default
– Table stages, which each table will have by default
– Internal named stages, meaning staged within Snowflake
Internal named stages are the best option for regular data loads, if you are thinking along the lines of your standard daily ETL process. One benefit of these is the flexibility in that they are database objects, so you can grant privileges to roles to access these objects as you would expect. Alternatively, there are external stages, such as Azure Blob storage.
Read on to see what comes next.
Time travel in Snowflake is similar to temporal tables in SQL Server: it allows you to query the history rows of a table. If you delete or update some rows, you can retrieve the status of the table at the point in time before you executed that statement. The biggest difference is that time travel is applied by default on all tables in Snowflake, while in SQL Server you have to enable it for each table specifically. Another difference is Snowflake only keeps history for 1 day, configurable up to 90 days. In SQL Server, history is kept forever unless you specify a retention policy.
How does time travel work? Snowflake is built for the cloud and its storage is designed for working with immutable blobs. You can imagine that for every statement you execute on a table, a copy of the file is made. This means you have multiple copies of your table, for different points in time. Retrieving time travel data is then quite easy: the system has only to search for the specific file that was valid for that point in time. Let’s take a look at how it works.
It looks interesting, though the “Snowflake doesn’t have backups like you know them in SQL Server” gives pause.
First things first: what is a semi-additive calculation? Any calculation can be either additive, non-additive or semi-additive. An additive measure uses SUM to aggregate over any attribute. The sales amount is a perfect example of an additive measure. Indeed, the sales amount for all customers is the sum of the individual sales for each customer; at the same time, the amount over a year is the sum of the amounts for each month.
A non-additive measure does not use SUM over any dimension. Distinct count is the simplest example: the distinct count of products sold over a month is not the sum of the distinct counts of individual days. The same happens with any other dimension: a distinct count of products sold in a country is not the sum of the distinct counts of the products sold in each city in the country.
Semi-additive calculations are the hardest ones: a semi-additive measure uses SUM to aggregate over some dimensions and a different aggregation over other dimensions – a typical example being time.
Semi-additive measures are probably the trickiest of the three, as you can easily work with additive measures and you know you won’t be able to do much with non-additive measures.
I would not mind quoting the Druid documentation for this purpose: “Druid is a data store designed for high-performance slice-and-dice analytics (“OLAP“-style) on large data sets. Druid is most often used as a data store for powering GUI analytical applications, or as a backend for highly-concurrent APIs that need fast aggregations.”
You might be wondering where is “SQL” in that? Actually, the fact is Druid is designed for special kind of SQL workloads which we can relate with powering the GUI analytical applications which require low latency query response. But in this post, we will only look in the “how part” of it using Druid to quickly run queries.
Click through to see how.
Apache Druid is a distributed, high-performance columnar store for real-time analytics on a large dataset. Druid core design combines the OLAP analytics, time series database and search system to create a single operational analysis. Druid is most suitable for data with high cardinality column or queries having higher aggregation or group by.
Druid has very specific use cases. If you don’t fit one of the use cases, it’s not a good solution at all; but if you do fit one of the use cases, it’s excellent.
In particular, we’ll look at an example scenario that addresses Data Drift – where new information is added mid-stream and when that occurs the new table structure and new column values are created in Snowflake automatically.
To illustrate, let’s take HTTP web server logs generated by Apache web server (for example) as our main source of data. Here’s what a typical log line looks like:
22.214.171.124 - - [14/Jun/2014:10:30:19 -0400] "GET /department/outdoors/category/kids'%20golf%20clubs/product/Polar%20Loop%20Activity%20Tracker HTTP/1.1" 200 1026 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"
Click through for the demonstration.
Since our preview announcement, hundreds of customers have been enabling Query Store to provide insight on query performance. We’re excited to share the general availability of Query Store worldwide for Azure SQL Data Warehouse.
Query Store automatically captures a history of queries, plans, and runtime statistics and retains them for your review when monitoring your data warehouse. Query Store separates data by time windows so you can see database usage patterns and understand when plan changes happen.
Given its power in the on-prem product, I’m glad that Azure SQL Data Warehouse is getting Query Store as well.
The performance benchmarks on CDH 6.0 show that enabling Parquet vectorization significantly improves performance for a typical ETL workload. In the test workload (TPC-DS), enabling parquet vectorization gave 26.5% performance improvement on average (geomean value of runtime for all the queries). Vectorization achieves these performance improvements by reducing the number of virtual function calls and leveraging the SIMD instructions on modern processors. A query is vectorized in Hive when certain conditions like supported column data-types and expressions are satisfied. However, if the query cannot be vectorized its execution falls back to a non-vectorized execution. Overall, for workloads which use the Parquet file format on most modern processors, enabling Parquet vectorization can lead to better query performance in CDH 6.0 and beyond.
This is worth looking into, especially if you are on the Cloudera stack.
Hortonworks announced the general availability of HDP 3.0 this year. You may read more about it here. Bundled with HDP 3.0, Apache Hive 3 with LLAP took a significant leap as a Enterprise Ready Real time Database Warehouse with transactional capabilities that continues to serve BI workloads with lower latencies. HDP 3.0 comes with exciting new capabilities – ACID support, materialized views, SQL constraints and Query result cache to name a few. Additionally, we continued to build and improve on the performance enhancements introduced in earlier releases.
In this blog, we will provide an update on our performance benchmark blog, comparing performance of HDP 3.0 to HDP 2.6.5. The noteworthy difference in benchmark is that all tables are by default transactional and written in ACID format, which means there are additional metadata (ROW_ID) columns to uniquely identify each row and support transactional semantics. Another key database capability used and tested here is SQL constraints. The hive-testbench schema has been enhanced to declare Primary-Foreign key, not null and unique constraints.
Their headline is that Hive 3 is up to 2x faster than Hive 2, with huge gains in a few of the queries.
The concept of workload management is a key factor for Azure SQL DW as there is only limited concurrency slots available and depending on the resource class, these slots can fill up pretty quickly. Once the concurrency slots are full, queries are queued until a sufficiently sized slot is opened up. Let’s recap what Resource Classes are and how they affect workload management.
A Resource Class is a pre-configured database role that determines how much resource is allocated to queries coming from users that belong to that role. For example, an ETL service account may use a “large” resource class and be allocated a generous amount of the server, however an analyst may use a “small” resource class and therefore only use up a small amount of the server with their queries. There are actually 2 types of resource class, Dynamic and Static. The Dynamic resource classes will grant a set percentage of memory to a query and actual value of this percentage will vary as the Warehouse scales up and down. The key factor is that an xLargeRc (extra-large resource class) will always take up 70% of the Server and will not allow any other queries to be run concurrently. No matter how much you scale up the Warehouse, queries run with an xLargeRc will run one at a time. Conversely, queries run with a smallrc will only be allocated 4% of the Server and therefore as a Warehouse scales up, this 4% becomes a larger amount of resource and can therefore process data quicker.
This looks like a useful addition. Click through for a few examples of how it will work.