Press "Enter" to skip to content

Author: Kevin Feasel

The Dunder Mifflin Data Set

Tim Mitchell has a new data set for us:

I’ve been a fan of Dunder Mifflin ever since I first learned about this small midwestern paper company. Over the years I’ve gotten to know their people and processes, following from a distance their successesfailures, and various adventures. Who would have known the paper business would be so interesting?

Based on what I learned about this company, I built this Dunder Mifflin data set based on the old Northwind structure, adapting it to meet the needs of this small paper company. It includes most of the employees, regional locations (both current and now-closed), and has a modestly-sized set of sales data for demos and testing.

Check out Tim’s GitHub repo and give it a try.

Comments closed

Power BI Desktop: Analyze in Excel

Marco Russo has gone and done it now:

Less than a month ago, Microsoft introduced the External Tools feature in the Power BI Desktop July 2020 release. By using DAX Studio, you were already able to create a PivotTable in Excel connected to the model hosted by Power BI Desktop. However, this would require three clicks (DAX Studio / Advanced / Excel). This is why I thought the External Tools feature was something many users would like to use without having to open – or even install – a larger tool like DAX Studio is.

It’s interesting to see what the community has made so far from the External Tools feature.

Comments closed

The Performance Hit of Disabling the Identity Cache

Tibor Karaszi explains why you probably want to keep identity caching on:

Should you care about the gap? In most cases: no. The identity value should be meaningless. In many cases I think that it is just an aesthetic issue to not have these gaps. (I’ve seen cases where you do run into problems because of the gap, I should add – but not frequently.)

For the SEQUENCE object, we have the CACHE option to specify how many values to cache. I.e., max values we can jump if we have a hard shutdown.

For identity, we have the IDENTITY CACHE database scoped configuration, introduced in SQL Server 2017. Caching on or off. On is default. We also have trace flag 272, at the instance level.

However, disabling the caching isn’t free. 

In an ideal world, there are zero cases where you care about the gap. Identity integers and sequences are surrogate keys, and “surrogate” here means that it has no inherent business value—otherwise it’d be a natural key. Subsequently ascribing value to it is folly, and if you are in a scenario in which you need guaranteed sequences which always increase by exactly 1 and never have gaps (think something like check numbers or invoice numbers, things which accountants really want to see in a fixed order), identity integers and sequences aren’t the right tools for you.

But read on to see how much faster caching of identity values can make insert performance.

Comments closed

Modifying Graph Edges with T-SQL

Louis Davidson shows how you can update edges in SQL Server’s graph functionality:

As I have been writing a section on SQL Server graph tables in my Database Design book, (and prepping for a hopeful book on the subject next year), I find that there are a few really annoying things about dealing with graph tables. This blog serves to clear up the first, most annoying of them. Inserting, updating, and deleting edges.

Because the key values in the graph database structures are hidden, you can’t just insert a new edge without translating your table’s key values to the graph database internal values. Edges aren’t even available for an update of the from or to references. As I wrote stored procedures to do this, I realized “why not use a view and trigger to make this happen”. So I did. The result is that I can insert, delete, and even update graph tables using normal SQL syntax. What makes this better than the stored procedure is that I can insert multiple rows simultaneously.

Read on to see what this entails.

Comments closed

The Downside of EAV-Style Measures in Power BI

Chris Webb explains why you should try to stick to the fact-dimensional model in Power BI:

In this fact table the dimension keys remain the same, but the Value column stores all the data from the Sales, Tax and Volume Sold measures in the original table and the Measure Name column tells you what type of measure value is stored on any given row. Let’s call this approach the Measures Dimension approach.

There are some advantages to building fact tables using the Measures Dimension approach, for example:

– You can now use a slicer in a report to select the measures that appear in a visual
– You can now easily add new measures without having to add new columns in your fact table
– You can use row-level security to control which measures a user has access to

Generally speaking, though, any time you deviate from a conventional dimensional model you risk running into problems later on and this is no exception. Let’s go through the disadvantages of modelling data using a Measures Dimension.

Read on for several good reasons (and yes, “things are formatted wrong” is a good reason!).

Comments closed

Multi-Armed Bandit Problems

Brian Amadio takes us through one of my favorite classes of problem:

Multi-armed bandits have become a popular alternative to traditional A/B testing for online experimentation at Stitch Fix. We’ve recently decided to extend our experimentation platform to include multi-armed bandits as a first-class feature. This post gives an overview of our experimentation platform architecture, explains some of the theory behind multi-armed bandits, and finally shows how we incorporate them into our platform.

The post gives a good explanation of the concept, as well as the implementation strategy.

Comments closed

Domain Models: Purity vs Completeness

Vladimir Khorikov reflects on domain modeling:

This is an example of a rich domain model: all business rules (also known as domain logic) are located in the domain classes. There’s one such rule currently — that we can only assign to the user an email that belongs to the corporate domain of that user’s company. There’s no way for the client code to bypass this invariant — a hallmark of a highly encapsulated domain model.

We can also say that our domain model is completeA complete domain model is a model that contains all the application’s domain logic. In other words, there’s no domain logic fragmentation.

Domain logic fragmentation is when the domain logic resides in layers other than the domain layer. In our example, the UserController (which belongs to the application services layer) doesn’t contain any such logic, it serves solely as a coordinator between the domain layer and the database.

Domain modeling doesn’t land on too many database administrators’ doorsteps, but I enjoyed the article.

Comments closed

Fixing Availability Group Issues with Alerts

Wayne Sheffield automates a few problems away:

The first of the Availability Group issues to discuss is that, for whatever reason, data is no longer moving between the primary replica and a secondary replica. This puts the Data Movement in a Suspended state.

If the data movement remains suspended for too long, you might have to take some undesired actions to get things back in sync. Things like removing the database from the AG, restoring log files, then reattaching it to the AG. When the data movement becomes suspended, we want to get it flowing again as soon as possible. Let’s have SQL Server try to get the data flowing again.

Read on for more, including a second issues that Wayne helps solve.

Comments closed

Historical Dimensions in a Kimball-Style Model

Vince Iacoboni takes a stab at improving the Kimball model:

We owe a lot to Ralph Kimball and friends. His practical warehouse design and conformed-dimension bus architecture are the industry standard. Business users can understand and query these warehouses directly and gain valuable insights into the business. Kimball’s practical approach focuses squarely on clarity and ease of use for the business users of the warehouse. Kudos to you and yours, Mr. Kimball.

That said, can the mainstay Type 2 slowly changing dimension be improved? I here present the concept of historical dimensions as a way to solve some issues with the basic Type 2 slowly changing dimension promoted by Kimball. As we will see, clearly distinguishing between current and past dimension values pays off in clarity of design, flexibility of presentation, and ease of ETL maintenance.

As I was reading this, I was thinking “This sounds like a type 4 SCD” and Vince walks us through the differences between the two ideas. I’m not absolutely sold on the idea, but it is certainly interesting.

Comments closed

Raw Data in the Data Lake

Steve Cardella uses wrestling as a metaphor where I would have used sewage:

Raw. Unfiltered. Data. The raw zone – it’s the dark underbelly of your data lake, where anything can happen. The CRM data just body-slammed the accounting data, while the HR data is taking a chair to the marketing data. It’s all a rumble for the championship belt, right? Oh, wait – we’re talking data lakes. Sorry. If the raw zone isn’t where data goes to duke it out, then what is the raw zone of a data lake? How should it be set up?

First, let’s take a time-out to give some context. A data lake is a central storage pool for enterprise data; we pour information into it from all kinds of sources. Those sources might include anything from databases to raw audio and video footage, in unstructured, semi-structured, and structured formats. A data warehouse, conversely, only houses structured data. The data lake is divided into one or more zones of data, with varying degrees of transformation and cleanliness (see this video for more: Data Lake Zones, Topology, and Security). The raw zone is the foundation upon which all other data lake zones are built.

Read on to understand the importance of raw data in a data lake, and the equal importance of making sure end users don’t see that stuff very often. Also, Steve gets bonus points for using my favorite term for the Aristotelian opposite of a data lake: the data swamp.

Comments closed