Press "Enter" to skip to content

Category: Warehousing

The Premise Of Cloud Data Warehousing

Derik Hammer explains how cloud data warehouses differ from their on-prem cousins:

Given the data processing needs of a data warehouse, they tend to be implemented on massively parallel processing (MPP) systems. The MPP architecture replies upon a shared nothing concept for distributing data across various slices. Compute nodes are layered on top of the storage and processes queries for data residing in its local slice. The control node is responsible for taking a query and dividing it up into smaller queries to be run in parallel on the compute nodes.

Read the whole thing.

Comments closed

Row Counts From Statistics In Azure DW

Derik Hammer has a script to estimate row counts in an Azure SQL Data Warehouse table:

Azure SQL Data Warehouse is a massively parallel processing (MPP) architecture designed for large-scale data warehouses. An MPP system creates logical / physical slices of the data. In SQL Data Warehouse’s case, the data has 60 logical slices, at all performance tiers. This means that a single table can have up to 60 different object_ids. This is why, in SQL Data Warehouse, there is the concept of physical and logical object_ids along with physical names.

Below is a query for finding row counts of tables in SQL Data Warehouse which accounts for the differences in architecture between my earlier script, written for SQL Server, and SQL Data Warehouse.

Click through for the script.

Comments closed

The Need For Multiple Warehouse Architectures

James Serra argues in favor of a data lake approach and a traditional data warehouse:

I think the ultimate question is: Can all the benefits of a traditional relational data warehouse be implemented inside of a Hadoop data lake with interactive querying via Hive LLAP or Spark SQL, or should I use both a data lake and a relational data warehouse in my big data solution?  The short answer is you should use both.  The rest of this post will dig into the reasons why.

I touched on this ultimate question in a blog that is now over a few years old at Hadoop and Data Warehouses so this is a good time to provide an update.  I also touched on this topic in my blogs Use cases of various products for a big data cloud solutionData lake detailsWhy use a data lake?and What is a data lake? and my presentation Big data architectures and the data lake.  

Read on for James’s argument, which is good.  My argument is summed up as follows:  the purpose of a data warehouse is to solve known business problems—that is, to help build reports that people on the business side need based on established requirements.  The purpose of a data lake is to hold all kinds of data and curate it for when people come looking for something they didn’t know they needed.

Comments closed

Data Warehousing Versus Data Virtualization

Koos van Strien contrasts data virtualization with data warehouse automation:

From a certain viewpoint, one could state that Data Virtualization is focused on the way the world should work: when integrating data, one shouldn’t have to store it everywhere. Why not let the system decide when to store? For some, to adopt this view might mean a paradigm shift: suddenly, the Data Warehouse the go-to integration point any more!

From this viewpoint, DWA a tool “from the trenches”: after years of struggle and hard work to build our warehouses, we’ve developed some smart ways to automate our warehouse-building based on abstract models.

Worth reading the whole thing.

Comments closed

Is Azure SQL DW A Good Fit For You?

Melissa Coates has a nice choose-your-own-adventure story around Azure SQL Data Warehouse:

Q4: How large is your database?

It is difficult to pinpoint an exact number for the absolute minimum size recommended for Azure SQL DW. Many data professionals in the industry see the minimum “practical” data size for Azure SQL DW in the 1-4TB range. Microsoft documentation has recently stated as low as 250GB for a minimum size. Since Azure SQL DW is an MPP (massively parallel processing) system, you experience a significant performance penalty with small data sizes because of the overhead incurred to distribute and consolidate across the nodes (which are distributions in a “shared-nothing” architecture). We recommend Azure SQL DW for a data warehouse which is starting to approach 1TB and expected to continue growing.

Great advice here.  I’ve heard too often of people looking at the name “Azure SQL Data Warehouse” and figuring that because they have data warehouses on-prem, this is the appropriate analog.  Azure SQL DW is not a typical data warehousing environment; it’s more of a specialized tool than that, so click through to see if it fits your needs.

Comments closed

Data Lakes Aren’t New

Shannon Lowder reveals one of the deep, dark data lake secrets:

Turns out there are three basic zones or areas to a data lake. Raw, Managed, and Presentation.

The raw zone should be optimized for fast storage.  The goal is to get the data in as quickly as possible.  Don’t make any changes to this data.  You want it stored as close to the original format as possible.  It sounds just like staged data to me.  Data you’d build an extract package to get from source to your staging environment, right?

Maybe you’re thinking this is just a coincidence…let’s move on.

Spoilers:  it’s not a coincidence.

Comments closed

Adding Public Holidays To A Date Dimension

Reza Rad continues his series on Power BI date dimensions:

To get public holidays live, you first need an API that is giving you up-to-date information. There are some web pages that has the list of public holidays. I have already explained in another blog post how to use a web page and query public holidays from there. That method uses custom functions as well, here you can read about that.

The method of reading data from a web page has an issue already; Web.Page function from Power Query is used to pull data from that page, and this function needs a gateway configuration to work. There is another function Xml.Document that can work even without the gateway. So because of this reason, we’ll use Xml.Document and get data from an API that provides the result set as XML.

WebCal.fi is a great free website with calendars for 36 countries which I do recommend for this example. This website, provides the calendars through XML format. There are other websites that give you the calendar details through a paid subscription. However, this website is a great free one which can be used for this example. WebCal.fi is created by User Point Inc.

This was an interesting approach to the problem, one I did not expect when first reading the article.  I figured it’d be some sort of date calculation script.

Comments closed

Fiscal Year Columns In A Power BI Date Dimension

Reza Rad takes a date dimension in Power BI and adds fiscal year details:

As you can see in the image above; June 2017 considered as fiscal year 2017. However, July 2017 is part of fiscal year 2018. So the simple logic can be like this:

if (calendar month >= fiscal year start)

then fiscal year = calendar year

else fiscal year = calendar year + 1

This code is pseudo code. don’t write that exactly in M! Let’s now implement it in M;

If you have to deal with multiple fiscal years (e.g., state and federal government fiscal years), the process is the same, only repeated.

Comments closed

Azure SQL Data Warehouse Patterns

Murshed Zaman shows us a couple of patterns and anti-patterns for Azure SQL Data Warehouse:

Azure SQL DW is a Massively Parallel Processing (MPP) data warehousing service. It is a service because Microsoft maintains the infrastructure and software patching to make sure it’s always on up to date hardware and software on Azure. The service makes it easy for a customer to start loading their tables on day one and start running queries quickly and allows scaling of compute nodes when needed.

In an MPP database, table data is distributed among many servers (known as compute or slave nodes), and in many MPP systems shared-nothing storage subsystems are attached to those servers. Queries come through a head (or master) node where the location metadata for all the tables/data blocks resides. This head node knows how to deconstruct the query into smaller queries, introduce various data movement operations as needed, and pass smaller queries on to the compute nodes for parallel execution. Data movement is needed to align the data by the join keys from the original query. The topic of data movement in an MPP system is a whole another blog topic by itself, that we will tackle in a different blog. Besides Azure SQL DW, some other examples of a MPP data warehouses are Hadoop (Hive and Spark), Teradata, Amazon RedShift, Vertica, etc.

The opposite of MPP is SMP (Symmetric Multiprocessing) which basically means the traditional one server systems. Until the invention of MPP we had SMP systems. In database world the examples are traditional SQL Server, Oracle, MySQL etc. These SMP databases can also be used for both OLTP and OLAP purposes.

Murshed spends the majority of this blog post covering things you should not do, which is probably for the best.

Comments closed

Basics Of Azure SQL Data Warehouse

Minette Steynberg has an article introducing Azure SQL Data Warehouse:

Azure SQL DW is best used for analytical workloads that makes use of large volumes of data and needs to consolidate disparate data into a single location.

Azure SQL DW has been specifically designed to deal with very large volumes of data. In fact, if there is too little data it may perform poorly because the data is distributed. You can imagine that if you had only 10 rows per distribution, the cost of consolidating the data will be way more than the benefit gained by distributing it.

SQL DW is a good place to consolidate disparate data, transform, shape and aggregate it, and then perform analysis on it. It is ideal for running burst workloads, such as month end financial reporting etc.

Azure SQL DW should not be used when small row by row updates are expected as in OLTP workloads. It should only be used for large scale batch operations.

Azure SQL Data Warehouse is fantastic when you’ve got a setup like above and are willing to pay a premium to make things faster.  And with appropriately distributed data, it certainly does get faster.

Comments closed