As mentioned earlier, the most commonly encountered approach is Option 2, the snapshot fact table. The main drawback of this approach is that the fact table’s size will grow extremely fast. For example, if you want to calculate the headcount in a company with 10,000 employees on average, and you want 5 years of historical data, you will add 10,000 rows per day to your fact table – that gives you (10,000 * 365 * 5 =) 18,250,000 rows after 5 years.
If you used the first approach, Option 1, the fact table would be (10,000 * 5 =) 50,000 rows after 5 years, assuming your employees change position or quit the company once a year, on average.
The snapshot fact table (Option 2) is (18,250,000 / 50,000 =) 365 times bigger. On the bright side, as the data is very repetitive, you might get a very good compression ratio on these tables.
Check it out. Semi-additive measures are not as common as additive measures, but you’re liable to have a couple of them in your data model.
Every once in a while, I come across a data warehouse where the data load uses a full truncate and reload pattern to populate a fact or dimension. While it may not be the end of the world for a small table, it does concern me and I usually recommend to redesign the load. My thoughts below on why this is an anti-pattern are true for using the actual TRUNCATE TABLE statement as well as executing a DELETE statement with no WHERE clause.
Read on for some great advice, including an exception to the rule.
There is no dark magic involved in improving the efficiency of your queries. Based on whom you ask this can be considered as a standout feature or a major hindrance but I am not a fan of tuning queries according to my workload as I feel the way data evolves so quickly in organizations. It becomes more tricky to play catch up to turn the all necessary knobs to make the query faster.
Snowflake claims they tune all the queries “automagically” via a dynamic query optimization engine. No need for any indexes, updating statistics, partition keys or pre-shard data for even distribution when you scale up. All of this will be done by their patent-pending dynamic optimization.
But I still feel snowflake can work on providing the necessary knobs for people who would like to tune their queries.
Read the whole thing. In short, it’s a technology worth looking at, but it’s not going to work perfectly in all cases.
The OAC environment that Opal gave me access possessed an example schema/data based on an Audio-Video store revenue for multiple years.
I’d never worked with the OAC before, but I was quickly able to find five methods to connect Power BI to it, either to import or to direct connect to the data:
1. Data Direct makes a Rest API to connect to the Oracle Cloud.
2. Use the web interface with Power BI and “scrape the website”, allowing PBI to build the table from example.
3. Connect with an ODBC driver
4. Via the OAC interface, export to Excel and then import into Power BI
5. Via the OAC interface, export to a table and then import into Power BI as a CSV file.
So, uh, yeah, you can do it. At least five ways.
Data classifications in Azure SQL DW entered public preview in March 2019. They allow you to label columns in your data warehouse with their information type and sensitivity level. There are built-in classifications, but you can also add custom classifications. This could be an important feature for auditing your storage and use of sensitive data as well as compliance with data regulations such as GDPR. You can export a report of all labeled columns, and you can see who is querying sensitive columns in your audit logs. The Azure Portal will even recommend classifications based upon your column names and data types. You can add the recommended classifications with a simple click of a button.
But read the whole thing, as Meagan sees a problem with it when you use a popular loading technique.
To make a full transition from the existing DW model to an alternative Data Vault I removed all Surrogate Keys and other attributes that are only necessary to support Kimball data warehouse methodology. Also, I needed to add necessary Hash keys to all my Hub, Link and Satellite tables. The target environment for my Data Vault would be SQL Azure database and I decided to use a built-in crc32 function of the Mapping Data Flow to calculate hash keys (HK) of my business data sourcing keys and composite hash keys of satellite tables attributes (HDIFF).
Data Vault is somewhere on my list of things to learn. It’s not at the top of the list, but that’s not a slight against it.
Before being loaded into a Snowflake table, the data can be optionally staged, which is essentially just a pointer to a location where the files are stored. There are different types of stages including:
– User stages, which each user will have by default
– Table stages, which each table will have by default
– Internal named stages, meaning staged within Snowflake
Internal named stages are the best option for regular data loads, if you are thinking along the lines of your standard daily ETL process. One benefit of these is the flexibility in that they are database objects, so you can grant privileges to roles to access these objects as you would expect. Alternatively, there are external stages, such as Azure Blob storage.
Read on to see what comes next.
Time travel in Snowflake is similar to temporal tables in SQL Server: it allows you to query the history rows of a table. If you delete or update some rows, you can retrieve the status of the table at the point in time before you executed that statement. The biggest difference is that time travel is applied by default on all tables in Snowflake, while in SQL Server you have to enable it for each table specifically. Another difference is Snowflake only keeps history for 1 day, configurable up to 90 days. In SQL Server, history is kept forever unless you specify a retention policy.
How does time travel work? Snowflake is built for the cloud and its storage is designed for working with immutable blobs. You can imagine that for every statement you execute on a table, a copy of the file is made. This means you have multiple copies of your table, for different points in time. Retrieving time travel data is then quite easy: the system has only to search for the specific file that was valid for that point in time. Let’s take a look at how it works.
It looks interesting, though the “Snowflake doesn’t have backups like you know them in SQL Server” gives pause.
First things first: what is a semi-additive calculation? Any calculation can be either additive, non-additive or semi-additive. An additive measure uses SUM to aggregate over any attribute. The sales amount is a perfect example of an additive measure. Indeed, the sales amount for all customers is the sum of the individual sales for each customer; at the same time, the amount over a year is the sum of the amounts for each month.
A non-additive measure does not use SUM over any dimension. Distinct count is the simplest example: the distinct count of products sold over a month is not the sum of the distinct counts of individual days. The same happens with any other dimension: a distinct count of products sold in a country is not the sum of the distinct counts of the products sold in each city in the country.
Semi-additive calculations are the hardest ones: a semi-additive measure uses SUM to aggregate over some dimensions and a different aggregation over other dimensions – a typical example being time.
Semi-additive measures are probably the trickiest of the three, as you can easily work with additive measures and you know you won’t be able to do much with non-additive measures.
I would not mind quoting the Druid documentation for this purpose: “Druid is a data store designed for high-performance slice-and-dice analytics (“OLAP“-style) on large data sets. Druid is most often used as a data store for powering GUI analytical applications, or as a backend for highly-concurrent APIs that need fast aggregations.”
You might be wondering where is “SQL” in that? Actually, the fact is Druid is designed for special kind of SQL workloads which we can relate with powering the GUI analytical applications which require low latency query response. But in this post, we will only look in the “how part” of it using Druid to quickly run queries.
Click through to see how.