Warehousing – Page 14

Have One Data Model per Business Area

Published 2022-07-15 by Kevin Feasel

James McGillivray offers us an important piece of advice:

I cannot stress this enough. If people are consuming your data in multiple places, the data needs to come from the same data model. That can be an Enterprise Data Warehouse, a Data Mart, a Power BI Model, or any other data source, but at some point you need to be able to track the data back to a single place. If you don’t do this, you will spend THE REST OF YOUR DAYS explaining the differences between the data models to business and customers, and reconciling the differences over and over again.

Read on to learn why this is so important.

Comments closed

Choosing among Snowflake Editions

Published 2022-07-13 by Kevin Feasel

Arun Sirpal helps us make a choice:

You have decided that snowflake is the technology you want to use to build your next gen data platform, you have decided your cloud provider (Azure, AWS, GCP) then next you need to think about what edition of snowflake suits your business needs?

Read on to see what the major differences are between editions.

Comments closed

Power BI as an Enterprise Data Warehouse

Published 2022-07-01 by Kevin Feasel

James Serra follows Betteridge’s Law of Headlines:

With Power BI continuing to get many great new features, including the latest in Datamarts (see my blog Power BI Datamarts), I’m starting to hear customers ask “Can I just build my entire enterprise data warehouse solution in Power BI”? In other words, can I just use Power BI and all its built-in features instead of using Azure Data Lake Gen2, Azure Data Factory (ADF), Azure Synapse, Databricks, etc? The short answer is “No”.

Read on to understand why Power BI shouldn’t be your data warehouse.

Comments closed

Semantics Layers for Data Lakehouses

Published 2022-06-27 by Kevin Feasel

Jans Aasman explains why semantic modeling is so important for a data lakehouse:

Data lakehouses would not exist — especially not at enterprise scale — without semantic consistency. The provisioning of a universal semantic layer is not only one of the key attributes of this emergent data architecture, but also one of its cardinal enablers.
In fact, the critical distinction between a data lake and a data lakehouse is that the latter supplies a vital semantic understanding of data so users can view and comprehend these enterprise assets. It paves the way for data governance, metadata management, role-based access, and data quality.

For a deeper dive into the topic, Kyle Hale has a post covering this with Databricks and Power BI as examples.

Comments closed

Power BI Field Parameters and Type 2 SCDs with Bonus Fields

Published 2022-06-20 by Kevin Feasel

Koen Verbeeck extends the type 2 slowly changing dimension:

Power BI field parameters are a new feature in Power BI Desktop, and it’s one of the best of the past months. In short, Power BI field parameters allow you to easily switch between dimensions attributes or measures in a filter. Previously, you had to do all sorts of DAX wizardry to make this happen, but now it’s just a couple of clicks.
The goal of this blog post is not to tell you exactly how they work, but rather showcase an interesting use case. You can find more info about Power BI field parameters in the official blog post, but also here, here and here. The use case I’m talking about is slowly changing dimensions of Type 2, you know, the one where we insert a record for every change. Often, I also include an extra column for each column of which we’re tracking history: the “current value column”. For example, if we keep history of the department for an employee, I have a column “CurrentDepartment”. If a type 2 change occurs, the values of this columns are updated to the last known value for this dimension member. This allows to answer different types of questions, because sometimes users are interested in the historical values, but sometimes they just want to know the current value.

Read on for the use case as well as how you might combine field parameters with the idea of current values on type-2 slowly changing dimensions.

Comments closed

Finding Duplicates in Type 2 SCDs

Published 2022-05-31 by Kevin Feasel

Dinesh Asanka wants to verify some Type 2 slowly changing dimension results:

As we discussed in a previous article, Implementing Slowly Changing Dimensions (SCDs) in Data Warehouses, there are three main types of slowly changing dimensions, such as Type 1, Type 2, and Type 3. Out of these Type 1 is the simple dimension where you will simply maintain only the latest version of the attribute. For example, if the employee got promoted to Senior Software Engineer from Software Engineer, you will simply overwrite the existing value to the new value so that the historical aspect is lost.
Type 2 Slowly Changing Dimensions are used to track historical data in a data warehouse. This is the most common approach in dimension. This article uses a sample database of AdventureworksDW which is the sample database for the data warehouse.

Click through for one way to compare, one which you could build using dynamic SQL.

Comments closed

Recursive Subdirectory Reads with Hive

Published 2022-05-18 by Kevin Feasel

The Hadoop in Real World team wants to dig deeper:

Let’s say you have a Hive table and the Hive table is pointing at a location or directory which has several sub directories and each subdirectories has files underneath it.
When you query the table however, Hive is only reading the files at the top level folder and ignoring all the files under the subdirectories.

Click through for the solution.

Comments closed

Optimizing Hive Performance with Tez

Published 2022-05-11 by Kevin Feasel

Jay Desai has some recommendations around tuning Tez queries:

Tuning Hive on Tez queries can never be done in a one-size-fits-all approach. The performance on queries depends on the size of the data, file types, query design, and query patterns. During performance testing, evaluate and validate configuration parameters and any SQL modifications. It is advisable to make one change at a time during performance testing of the workload, and would be best to assess the impact of tuning changes in your development and QA environments before using them in production environments. Cloudera WXM can assist in evaluating the benefits of query changes during performance testing.

Click through for several configuration and query considerations.

Comments closed

Things Not to Include in Data Warehouses

Published 2022-05-04 by Kevin Feasel

Erik Darling compiles a list:

This is a list of things I see in data warehouses that make me physically ill:
– Unique constraints of any kind: Primary Keys, Indexes, etc. Make things unique during your staging process. Don’t make your indexes do that work.

Read on for the full list. I agree with everything except clustered row-store indexes. Those make a lot of sense on dimension tables, tied to the Kimball-style surrogate keys you create in the warehouse itself.

The other part I disagree with is non-clustered columnstore indexes, which I’ve rarely found good use for. Clustrered columnstore indexes are outstanding but the non-clustered variety…meh at best. This answer comes primarily because the pattern I tend to use for warehouse queries is to drive from the fact table, aggregate as much as I can there, and connect to the dimensions for further information at the end. If your warehouse access patterns differ radically from this, you might get more out of non-clustered columnstore indexes. Maybe.

Comments closed

S3 and Redshift Data Movement with Role Chaining

Published 2022-04-29 by Kevin Feasel

Sudipta Mitra, et al, talk AWS security:

This post presents an approach that you can apply at scale to achieve fine-grained access controls to resources in S3 buckets and Amazon Redshift schemas for tenants, including groups of users belonging to the same business unit down to the individual user level. This solution provides tenant isolation and data security. In this approach, we use the bridge model to store data and control access for each tenant at the individual schema level in the same Amazon Redshift database. We utilize ASSUMEROLE and role chaining to provide fine-grained access control when data is being copied and unloaded between Amazon Redshift and Amazon S3, so the data flows within each tenant’s namespace. Role chaining also streamlines the new tenant onboarding process.

Read on for an overview and tutorial.

Comments closed

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

Category: Warehousing