Press "Enter" to skip to content

Category: Data

The Data Mesh

James Serra explains what a data mesh is:

Its goal is to treat data as a product, with each source having its own data product manager/owner (who are part of a cross-functional team of data engineers) and being its own clearly-focused domain that has an autonomous offering, becoming the fundamental building blocks of a mesh, leading to a domain-driven distributed architecture. Note that for performance reasons, you could have a domain that aggregates data from multiple sources. Each domain should be discoverable, addressable, self-describing, secure (governed by global access control), trustworthy, and interoperable (governed by an open standard). Each domain will store its data in a data lake and in many cases will also have a copy of some of the data in a relational database (see Data Lakehouse defined for why you still want a relational database in most cases).

I’ll have to think more about this before I’m convinced. I’ll also need to think about the Aristotelian opposite of the data mesh.

Comments closed

Data Quality Monitoring with SQL

Ryan Kearns and Barr Moses walk us through key principles for monitoring data quality in a relational database:

Next, we want to assess the field-level, distributional health of our data. Distribution tells us all of the expected values of our data, as well as how frequently each value occurs. One of the simplest questions is, “how often is my data NULL”? In many cases, some level of incomplete data is acceptable — but if a 10% null rate turns into 90%, we’ll want to know.

This covers a couple examples around data freshness and completeness, and I appreciate the level of detail in here. Nothing is earth-shattering, but at the same time, it’s important to have a catalog of the sorts of issues which can pop up. H/T Mark Hutchinson.

Comments closed

Automating Data Comparison using Biml

Ben Weissman gives us an example of running data comparison using Biml:

When you build an ETL Solution, at some point you will most likely feel the need to compare the data between your source and your staging (or datawarehouse) database. There may be various reasons for them to be out of sync, such as delta loads, aggregations or added business logic, but one day your phone will ring and whoever is on the other end will tell you that the numbers are wrong. While this doesn’t necessarily have to be an issue within your ETL process, it might just as well be the report itself. In many cases, this is a good starting point to look at.

This article (formally published on PASS Community Blog) focusses on SSIS as your orchestrator, but the same principles could, obviously, also be applied to Azure Data Factory for example. Also, as we want to solve this task as lightweight as possible, we will, demonstrably, use Biml to implement it!

Click through to see how.

Comments closed

Change Tracking Runthrough

Erik Darling provides a runthrough (which is a walkthrough but at a faster pace) of change tracking in SQL Server:

I’ve been working with CDC and CT way too much, and even I’m annoyed with how much it’s coming out in blog posts.

I’m going to cover a lot of ground quickly here. If you get lost, or there’s something you don’t understand, your best bet is to reference the documentation to get caught up.

Check it out.

Comments closed

Setting up Azure Purview for Power BI

Soheil Bakhshi has a great step-by-step walkthrough for setting up Azure Purview:

Microsoft newly announced a piece of very exciting news that Azure Purview now supports Power BI. This is massive news from a data governance point of view. Azure Purview is the next generation of Azure Data Catalog with more metadata discovery power and the ability to use sensitivity labels. After reading the news, I immediately decided to set up my test environment and give it a go. I followed the steps mentioned in this article on the Microsoft documentation website but I faced some difficulties to get it to work. And here we are, another blog post to help you to set up the Azure Purview for Power BI.

Click through for a detailed walkthrough.

Comments closed

Azure Purview: External Connections and the Starter Kit

Wolfgang Strasser continues a series on Azure Purview. First up is a look at external connections:

During my tests with Azure Purview I found out, that in my demo accounts (I created multiple to test it) the Management menu was missing some items.

Read on to learn why and how you can rectify this. Then, check out Wolfgang’s take on the Starter Kit:

Very often, when I want to test some new services I miss some infrastructure and environments I can start and play with. I am not talking about creating a new Azure Purview account (see my previous blog post – Creating an Azure Purview account) – I am talking about the data infrastructure to analyze, catalog and gain knowledge out of it.

We could start to create such an infrastructure, BUT: the Purview team create a Starter Kit to quickly create a data estate and configure everything that you can start with Purview within a view minutes.

Read on for one issue (of the self-inflicted variety) Wolfgang ran into during deployment. But it does look like a great way to get started with Purview and build up a relevant demo environment.

Comments closed

Ignoring Bad Dates when Moving to Spark 3

Robert Blackburn shows us one way to handle bad dates when moving to Spark 3:

Moving from a Spark 2 to a Spark 3 runtime has a lot of benefits including big performance improvements through adaptive query executiondynamic partition pruning, and other optimizations. Some updates may require you to refactor your code. One of them is Delta tables now use the Proleptic Gregorian Calendar. Isn’t a calendar a calendar? Unfortunately, no. The Julian calendar has discrepancies with old dates. Specifically dates before 1582 and timestamps before 1900. Here we will dynamically update these dates for incoming source files.

If you would like to follow along in detail, I have a sample notebook that uses the community edition of Databricks. The DBC Archive file is here and the source file is here.

Fortunately, this change is unlikely to affect most of us, with perhaps the most common issue being that you used 0001-01-01 as a default date.

Comments closed

Creating an Azure Purview Catalog Instance

Wolfgang Strasser wants to try out Azure Purview:

Basics – Resource group, purview account name (this cannot be changed afterwards) and the location.

As of today (2020-12-06), there are only 5 Azure regions you can choose from to store the Purview metadata. But – in-region scanning from 16 other Azure regions is available in the preview (source)

This is part one of a multi-part series, so stay tuned for more.

Comments closed

Introducing Azure Purview

Wolfgang Strasser gives us a once-over on a new service:

Today, at the Azure Data and Analytics event, a new Azure data governance service called Azure Purview (https://aka.ms/AzurePurview) was presented and made available in a public preview.

I have not had a chance to try the actual service, but I found a very interesting video (Microsoft mechanics video) where I took the following screenshots from.

Read on for Wolfgang’s thoughts. It’s definitely a step up from Azure Data Catalog.

Comments closed

A New Financial Sample Dataset with Power BI

David Eldersveld takes a look at a new sample data set in Power BI Desktop:

With the October 2020 release of Power BI Desktop, Microsoft has incorporated a sample dataset directly into the product. While there are a few commonly used training and demo datasets already, there are benefits to having a basic starter Financials dataset included.

This data can accelerate the learning experience for new users, particularly those who may only need to focus on authoring reports on existing datasets at their organization and don’t need to immediately know much about data modeling.

Click through for David’s mixed review.

Comments closed