Press "Enter" to skip to content

Category: Warehousing

Semantics Layers for Data Lakehouses

Jans Aasman explains why semantic modeling is so important for a data lakehouse:

Data lakehouses would not exist — especially not at enterprise scale — without semantic consistency. The provisioning of a universal semantic layer is not only one of the key attributes of this emergent data architecture, but also one of its cardinal enablers.

In fact, the critical distinction between a data lake and a data lakehouse is that the latter supplies a vital semantic understanding of data so users can view and comprehend these enterprise assets. It paves the way for data governance, metadata management, role-based access, and data quality.

For a deeper dive into the topic, Kyle Hale has a post covering this with Databricks and Power BI as examples.

Comments closed

Power BI Field Parameters and Type 2 SCDs with Bonus Fields

Koen Verbeeck extends the type 2 slowly changing dimension:

Power BI field parameters are a new feature in Power BI Desktop, and it’s one of the best of the past months. In short, Power BI field parameters allow you to easily switch between dimensions attributes or measures in a filter. Previously, you had to do all sorts of DAX wizardry to make this happen, but now it’s just a couple of clicks.

The goal of this blog post is not to tell you exactly how they work, but rather showcase an interesting use case. You can find more info about Power BI field parameters in the official blog post, but also herehere and here. The use case I’m talking about is slowly changing dimensions of Type 2, you know, the one where we insert a record for every change. Often, I also include an extra column for each column of which we’re tracking history: the “current value column”. For example, if we keep history of the department for an employee, I have a column “CurrentDepartment”. If a type 2 change occurs, the values of this columns are updated to the last known value for this dimension member. This allows to answer different types of questions, because sometimes users are interested in the historical values, but sometimes they just want to know the current value.

Read on for the use case as well as how you might combine field parameters with the idea of current values on type-2 slowly changing dimensions.

Comments closed

Finding Duplicates in Type 2 SCDs

Dinesh Asanka wants to verify some Type 2 slowly changing dimension results:

As we discussed in a previous article, Implementing Slowly Changing Dimensions (SCDs) in Data Warehouses, there are three main types of slowly changing dimensions, such as Type 1, Type 2, and Type 3. Out of these Type 1 is the simple dimension where you will simply maintain only the latest version of the attribute. For example, if the employee got promoted to Senior Software Engineer from Software Engineer, you will simply overwrite the existing value to the new value so that the historical aspect is lost.

Type 2 Slowly Changing Dimensions are used to track historical data in a data warehouse. This is the most common approach in dimension. This article uses a sample database of AdventureworksDW which is the sample database for the data warehouse.

Click through for one way to compare, one which you could build using dynamic SQL.

Comments closed

Optimizing Hive Performance with Tez

Jay Desai has some recommendations around tuning Tez queries:

Tuning Hive on Tez queries can never be done in a one-size-fits-all approach. The performance on queries depends on the size of the data, file types, query design, and query patterns. During performance testing, evaluate and validate configuration parameters and any SQL modifications. It is advisable to make one change at a time during performance testing of the workload, and would be best to assess the impact of tuning changes in your development and QA environments before using them in production environments. Cloudera WXM can assist in evaluating the benefits of query changes during performance testing.

Click through for several configuration and query considerations.

Comments closed

Things Not to Include in Data Warehouses

Erik Darling compiles a list:

This is a list of things I see in data warehouses that make me physically ill:

– Unique constraints of any kind: Primary Keys, Indexes, etc. Make things unique during your staging process. Don’t make your indexes do that work.

Read on for the full list. I agree with everything except clustered row-store indexes. Those make a lot of sense on dimension tables, tied to the Kimball-style surrogate keys you create in the warehouse itself.

The other part I disagree with is non-clustered columnstore indexes, which I’ve rarely found good use for. Clustrered columnstore indexes are outstanding but the non-clustered variety…meh at best. This answer comes primarily because the pattern I tend to use for warehouse queries is to drive from the fact table, aggregate as much as I can there, and connect to the dimensions for further information at the end. If your warehouse access patterns differ radically from this, you might get more out of non-clustered columnstore indexes. Maybe.

Comments closed

S3 and Redshift Data Movement with Role Chaining

Sudipta Mitra, et al, talk AWS security:

This post presents an approach that you can apply at scale to achieve fine-grained access controls to resources in S3 buckets and Amazon Redshift schemas for tenants, including groups of users belonging to the same business unit down to the individual user level. This solution provides tenant isolation and data security. In this approach, we use the bridge model to store data and control access for each tenant at the individual schema level in the same Amazon Redshift database. We utilize ASSUMEROLE and role chaining to provide fine-grained access control when data is being copied and unloaded between Amazon Redshift and Amazon S3, so the data flows within each tenant’s namespace. Role chaining also streamlines the new tenant onboarding process.

Read on for an overview and tutorial.

Comments closed

Understanding the Data Lakehouse

Tom Jordan explains what data lakehouses are:

When we are thinking about data platforms, there are many different services and architectures that can be used – sometimes this can be a bit overwhelming! Data warehouses, data models, data lakes and reports are all typical components of an enterprise data platform, which have different uses and skills required. However, in the past few years a new architecture has been rising; the data lakehouse. This is an architecture that borrows ideas and concepts from several different areas, which we will be exploring in greater detail in this blog.

Click through to learn more about the origin of this term and how it draws + differs from both a data lake and a data warehouse.

Comments closed

Role-Based Access Controls in Redshift

Milind Oke, et al, describe RBAC in Amazon Redshift:

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. With Amazon Redshift, you can analyze all your data to derive holistic insights about your business and your customers. One of the challenges with security is that enterprises don’t want to have a concentration of superuser privileges amongst a handful of users. Instead, enterprises want to design their overarching security posture based on the specific duties performed via roles and assign these elevated privilege roles to different users. By assigning different privileges to different roles and assigning these roles to different users, enterprises can have more granular control of elevated user access.

In this post, we explore the role-based access control (RBAC) features of Amazon Redshift and how you can use roles to simplify managing privileges required to your end-users. We also cover new system views and functions introduced alongside RBAC.

Read on to learn about system-defined roles as well as creating user-customizable roles.

Comments closed

Lakehouse, Mesh, and Fabric

James Serra is back in blue:

(NOTE: I have returned to Microsoft and am working as a Solution Architect in Microsoft Industry Solutions, formally known as Microsoft Consulting Services (MCS), where I help customers build solutions on Azure. Contact your Microsoft account executive for more info. That being said: the views and opinions in this blog are mine and not that of Microsoft).

There certainly has been a lot of discussion lately on the topic of Data Lakehouse, Data Mesh, and Data Fabric, and how they compare to the Modern Data Warehouse. There is no clear definition of all these data architectures, and I have created a presentation using my own take that I have been presenting frequently internally at Microsoft and externally to customers and at conferences. Hopefully these presentations, blog posts, and videos can help clarify all these data architectures for you:

Click through for several useful resources to help differentiate these topics.

Comments closed