Press "Enter" to skip to content

Day: January 2, 2024

Notes on Linear Markov Chains

John Mount has some thoughts for us:

I want to collect some “great things to know about linear Markov chains.”

For this note we are working with a Markov chain on states that are the integers 0 through k (k > 0). A Markov chain is an iterative random process with time tracked as an increasing integer t, and the next state of the chain depending only on the current (soon to be previous) state. For our linear Markov chain the only possible next states from state i are: i (called a “self loop” when present), i+1 (called up or right), and i-1 (called down or left). In no case does the chain progress below 0 or above k.

Click through for notes on two variants of this sort of linear Markov chain, as well as a pair of appendices containing derivation notes and Python code.

Comments closed

Reversion to the Mean

Holger von Jouanne-Diedrich explains an important statistical concept we all too often forget:

In the realm of business and leadership, one statistical phenomenon often goes unrecognized yet significantly influences our understanding of performance and success. This is the concept of reversion to the mean (also called regression to the mean). This seemingly simple statistical occurrence can profoundly impact how we perceive management strategies, leadership effectiveness, and even the fate of those gracing the covers of prominent magazines. To understand what is going on, read on!

Read on for a video in German and an article in English, with some bonus R code to sell the story.

Comments closed

Fabric and Databricks

A rare two-part compare and contrast!

First, Chen Hirsh directly contrasts Microsoft Fabric and Databricks:

Microsoft recently announced the general availability of Microsoft Fabric, which contains all (or most) cloud Data analytics services from Microsoft. This is a good opportunity to compare it with another popular data platform, which is also available in Azure (and other cloud services) – Databricks.

Before we start, I should note that Fabric is quite new, and it’s still hard to evaluate its performance and stability. Also, both products have many features, and I only try to discuss the main differences.

Then, Eugene Meidinger keys us in on the similarities:

One of the things that helps to understand Fabric is that it’s heavily influenced by Databricks. It’s built on delta lake, which is created and open sourced by Databricks 2019. You are encouraged to use a medallion architecture, which as far as I can tell, comes from Databricks.

You will be a lot less frustrated if you realize that much of what’s going on with Fabric is a blend of open source formats and protocols, but also is a combination of the idiosyncrasies of Databricks and then those of Microsoft. David Gomes has good post about data lake file formats, and it’s interesting to imagine the parallel universe where Fabric is built on Iceberg (which is also based on Parquet files) instead of delta lake. (Note, I found this post from this week’s issue of Brent Ozar’s Newsletter)

Comments closed

Joining Tables without Relationships in DAX

Marco Russo and Alberto Ferrari have ended a relationship:

In a previous article, we saw several examples of using the NATURALLEFTOUTERJOIN and NATURALINNERJOIN functions in DAX. In that article, we saw how to join tables in the data model using existing relationships. However, if we want to join tables in the model without using the relationships in the model, we must eliminate the data lineage of the columns to use in the join condition. This article describes how to achieve this.

I suppose my question here is, when would you want to do this? I’m sure there’s a good reason to, but I’m just as sure that I don’t know it.

Comments closed

Common Warehouse Load Patterns

Ben Johnston continues a series on warehouse load patterns:

This continues and finishes my two-part series on warehouse load patterns. There are many methods to transfer rows between systems from a basic design perspective. This isn’t specific to any ETL tool but rather the basic patterns for moving data. The most difficult part in designing a pattern is efficiency. It has to be accurate and not adversely impact the source system, but this is all intertwined and dependent on efficiency. You only want to move the rows that have changed or been added since the previous ETL execution, deltas. This reduces the network load, the source system load (I/O, CPU, locking, etc.), the destination system load. Being efficient also improves the speed and as a direct result it increases the potential frequency for each ETL run, which has a direct impact on business value.

The pattern you select depends on many things. The previous part of the series covers generic design patterns and considerations for warehouse loads that can be applied to most of the ETL designs presented below. This section covers patterns I have used in various projects. I’m sure there are some patterns I have missed, but these cover the most used types that I have seen. These are not specific to any data engine or ETL tool, but the examples use SQL Server as a base for functionality considerations. Design considerations, columns available, administrative support, DevOps practices, reliability of systems, and cleanliness of data all come into consideration when choosing your actual ETL pattern.

Click through for a compendium of common patterns you can use to indicate that a row should go into a warehouse.

Comments closed

A Focus on TRY_PARSE()

Andy Brownsword takes a closer look at TRY_PARSE():

In the previous post we looked at the functions TRY_CASTTRY_CONVERT, and TRY_PARSE and how they compared. I wrapped up and said that my preference for new developments would be to use TRY_PARSE due to the tighter control which it provides us.

As with everything in SQL Server however, there is no ‘best’ approach, it depends. I therefore wanted a separate post to look into the specifics with TRY_PARSE and areas where it may work more or less effectively.

Andy hits both the good and the bad of TRY_PARSE() and I recommend checking out this post. It’s great for parsing one row or a small number of rows (call it 5-10K or so), and really bad at parsing large numbers of rows.

Comments closed

Fabric Data Pipeline for Blob Storage CSV into Azure SQL DB

Andy Leonard loads some data:

In November 2023, I shared how to start learning Microsoft Fabric in a post titled Start a Fabric Free Trial. In December 2023, I shared how to Create a Workspace in Fabric. In this post, I document one way to create a pipeline to load data from a CSV file stored in Azure Blob Storage to Azure SQL Database in your new Fabric workspace.

Click through for some key assumptions, as well as the process.

Comments closed

Thoughts on Temp Tables in Stored Procedures

Erik Darling has preferences:

I probably overuse temp tables. I’m the first one to admit it. But often when I choose to use one, it’s a choice between:

  • Spending a long time trying to business understand logic for a company I don’t work for
  • Tinkering with query syntax and using really weird tricks and hints to get different plan shapes
  • Waiting a long time to create or tweak indexes on huge tables (usually on “dev” hardware)

Erik provides a good example of where a temp table fits well, contrasting it to a common table expression that fares poorly. My bias tends to come in the opposite direction from Erik’s: I’m susceptible to playing query golf and getting everything I need back in one optimized call. But part of craftsmanship is knowing how and when to use each tool.

Comments closed