One difference between regular tables and temporary tables is how they are typically used. Temporary tables are session scoped which means that adding them into a process or report will probably cause them to be created multiple times. Temporary tables might be very similar to regular tables but most regular tables are not re-written into, every time they are queried.
The disk writes involved in populating the temporary table might be more expensive than the reads would be if you were to modify your query to include the logic into one, larger, query. The frequency of the report or process will be a factor into how much of a performance hit you get by using the temporary tables. If you are using temporary tables to make debugging a procedure easier or to enhance readability, make sure you understand the IO cost of performing writes and then reading that data back into a subsequent query.
Read on for more.
Data virtualization integrates data from disparate sources, locations and formats, without replicating or moving the data, to create a single “virtual” data layer that delivers unified data services to support multiple applications and users.
Data movement is the process of extracting data from source systems and bringing it into the data warehouse and is commonly called ETL, which stands for extraction, transformation, and loading.
If you are building a data warehouse, should you move all the source data into the data warehouse, or should you create a virtualization layer on top of the source data and keep it where it is?
Read on for James’s thoughts.
Given the data processing needs of a data warehouse, they tend to be implemented on massively parallel processing (MPP) systems. The MPP architecture replies upon a shared nothing concept for distributing data across various slices. Compute nodes are layered on top of the storage and processes queries for data residing in its local slice. The control node is responsible for taking a query and dividing it up into smaller queries to be run in parallel on the compute nodes.
Read the whole thing.
Azure SQL Data Warehouse is a massively parallel processing (MPP) architecture designed for large-scale data warehouses. An MPP system creates logical / physical slices of the data. In SQL Data Warehouse’s case, the data has 60 logical slices, at all performance tiers. This means that a single table can have up to 60 different object_ids. This is why, in SQL Data Warehouse, there is the concept of physical and logical object_ids along with physical names.
Below is a query for finding row counts of tables in SQL Data Warehouse which accounts for the differences in architecture between my earlier script, written for SQL Server, and SQL Data Warehouse.
Click through for the script.
I think the ultimate question is: Can all the benefits of a traditional relational data warehouse be implemented inside of a Hadoop data lake with interactive querying via Hive LLAP or Spark SQL, or should I use both a data lake and a relational data warehouse in my big data solution? The short answer is you should use both. The rest of this post will dig into the reasons why.
I touched on this ultimate question in a blog that is now over a few years old at Hadoop and Data Warehouses so this is a good time to provide an update. I also touched on this topic in my blogs Use cases of various products for a big data cloud solution, Data lake details, Why use a data lake?and What is a data lake? and my presentation
Read on for James’s argument, which is good. My argument is summed up as follows: the purpose of a data warehouse is to solve known business problems—that is, to help build reports that people on the business side need based on established requirements. The purpose of a data lake is to hold all kinds of data and curate it for when people come looking for something they didn’t know they needed.
From a certain viewpoint, one could state that Data Virtualization is focused on the way the world should work: when integrating data, one shouldn’t have to store it everywhere. Why not let the system decide when to store? For some, to adopt this view might mean a paradigm shift: suddenly, the Data Warehouse the go-to integration point any more!
From this viewpoint, DWA a tool “from the trenches”: after years of struggle and hard work to build our warehouses, we’ve developed some smart ways to automate our warehouse-building based on abstract models.
Worth reading the whole thing.
Q4: How large is your database?
It is difficult to pinpoint an exact number for the absolute minimum size recommended for Azure SQL DW. Many data professionals in the industry see the minimum “practical” data size for Azure SQL DW in the 1-4TB range. Microsoft documentation has recently stated as low as 250GB for a minimum size. Since Azure SQL DW is an MPP (massively parallel processing) system, you experience a significant performance penalty with small data sizes because of the overhead incurred to distribute and consolidate across the nodes (which are distributions in a “shared-nothing” architecture). We recommend Azure SQL DW for a data warehouse which is starting to approach 1TB and expected to continue growing.
Great advice here. I’ve heard too often of people looking at the name “Azure SQL Data Warehouse” and figuring that because they have data warehouses on-prem, this is the appropriate analog. Azure SQL DW is not a typical data warehousing environment; it’s more of a specialized tool than that, so click through to see if it fits your needs.
Turns out there are three basic zones or areas to a data lake. Raw, Managed, and Presentation.
The raw zone should be optimized for fast storage. The goal is to get the data in as quickly as possible. Don’t make any changes to this data. You want it stored as close to the original format as possible. It sounds just like staged data to me. Data you’d build an extract package to get from source to your staging environment, right?
Maybe you’re thinking this is just a coincidence…let’s move on.
Spoilers: it’s not a coincidence.
To get public holidays live, you first need an API that is giving you up-to-date information. There are some web pages that has the list of public holidays. I have already explained in another blog post how to use a web page and query public holidays from there. That method uses custom functions as well, here you can read about that.
The method of reading data from a web page has an issue already; Web.Page function from Power Query is used to pull data from that page, and this function needs a gateway configuration to work. There is another function Xml.Document that can work even without the gateway. So because of this reason, we’ll use Xml.Document and get data from an API that provides the result set as XML.
WebCal.fi is a great free website with calendars for 36 countries which I do recommend for this example. This website, provides the calendars through XML format. There are other websites that give you the calendar details through a paid subscription. However, this website is a great free one which can be used for this example. WebCal.fi is created by User Point Inc.
This was an interesting approach to the problem, one I did not expect when first reading the article. I figured it’d be some sort of date calculation script.
As you can see in the image above; June 2017 considered as fiscal year 2017. However, July 2017 is part of fiscal year 2018. So the simple logic can be like this:
if (calendar month >= fiscal year start)
then fiscal year = calendar year
else fiscal year = calendar year + 1
This code is pseudo code. don’t write that exactly in M! Let’s now implement it in M;
If you have to deal with multiple fiscal years (e.g., state and federal government fiscal years), the process is the same, only repeated.