Press "Enter" to skip to content

Category: Data

Data Hubs, Warehouses, and Lakes

Trevor Legg compares and contrasts data hubs, data warehouses, and data lakes:

Data hubs, data warehouses, and data lakes are significant investment areas for data and analytics leaders and are vital to support increasingly complex, distributed, and varied data workloads.

Gartner finds that 57% of data and analytics leaders are investing in data warehouses, 46% are using data hubs, and 39% are using data lakes. However, they also found that these same data and analytics leaders don’t necessarily understand the difference between the three…

To best support specific business requirements, it’s vital to understand the difference and purpose of each type of structure, and the role it can play in modern data management infrastructure.

Click through for the definitions and comparisons.

Comments closed

Power Query Data Profiling

Ed Hansberry takes us through the data profiler in Power Query:

A solid green bar is usually best. It means there are no issues in that column, as shown in the Discounts field. On the far right in the COGS field, there is a tiny bit of gray. That means there is no data, or a null, in one or more records. The bigger the gray bar, the higher percentage of nulls there are. In the middle though we have a problem. Anything in red is an error and must be fixed. These may cause more errors in further transformations downstream, and will usually cause refresh errors.

Before we address this, let’s get a bit more information about our data. Go to the View ribbon, and there are 3 settings we want to look at.

I really like what the data profiler provides us. If you’re a regular Power BI user, I highly recommend checking it out if you haven’t already.

Comments closed

Bulk Copying Lots of Rows into SQL Server

Esat Erkec shows us how to use the Bulk Copy Program (BCP) to bulk load data into SQL Server:

If the installed version is older than the last version, we can download and install the latest version from the Microsoft website. The main capability of the SQL Server BCP is not much complex because it can only run with several arguments. The syntax of the BCP is like below:

bcp {table|view|”query”} {out|queryout|in|format} {data_file|nul} {[optional_argument]…}

For example, if we want to export any data of a table to a text file, we have to specify the table name, the out option, and the data file. The following command will export the Production table into the specified text file.

bcp AdventureWorks2017.Production.Product out C:\ExportedData\Product.txt -S localhost -T –w

I don’t know if I’m the only person for which this is true, but the data file format has always been a royal pain for me to get right, to the point where I’d happily build an SSIS package to perform bulk loading over having to use BCP myself.

2 Comments

Drift Monitoring with Azure Machine Learning

I take a look at dataset drift monitoring in Azure Machine Learning:

One of the things I like to say about machine learning model is, “shift happens.” By that, I mean that models lose effectiveness over time due to changes in underlying circumstances. Relationships between variables that used to hold no longer do, and so our model quality degrades. This means that we sometimes need to retrain models.

But there’s a cost to retraining models—that work can be computationally expensive and time-consuming. This concern is particularly salient if you’re in a cloud, as you pay directly for everything there. This means that we don’t want to retrain models unless we need to. But when do we know if we should retrain the model? We can watch for model degradation, but there’s another method: drift detection in your datasets.

Read on for a demonstration of how the product works and a couple of things to keep in mind.

Comments closed

Getting the Stack Overflow Database for Demos

Brent Ozar has an update on getting the latest version of the Stack Overflow data set:

If you only have a limited amount of bandwidth, you don’t have to keep seeding the database after you get it – I’ve got it hosted on a handful of seedboxes around the world.

As with the original data dump, these are provided under cc-by-sa 4.0 license. That means you are free to share it and adapt it for any purpose, even commercially, but you must attribute it to the original authors (not me):

Click through for those attribution details as well as links to get the demo database in various sizes.

Comments closed

The Data Mesh

James Serra explains what a data mesh is:

Its goal is to treat data as a product, with each source having its own data product manager/owner (who are part of a cross-functional team of data engineers) and being its own clearly-focused domain that has an autonomous offering, becoming the fundamental building blocks of a mesh, leading to a domain-driven distributed architecture. Note that for performance reasons, you could have a domain that aggregates data from multiple sources. Each domain should be discoverable, addressable, self-describing, secure (governed by global access control), trustworthy, and interoperable (governed by an open standard). Each domain will store its data in a data lake and in many cases will also have a copy of some of the data in a relational database (see Data Lakehouse defined for why you still want a relational database in most cases).

I’ll have to think more about this before I’m convinced. I’ll also need to think about the Aristotelian opposite of the data mesh.

Comments closed

Data Quality Monitoring with SQL

Ryan Kearns and Barr Moses walk us through key principles for monitoring data quality in a relational database:

Next, we want to assess the field-level, distributional health of our data. Distribution tells us all of the expected values of our data, as well as how frequently each value occurs. One of the simplest questions is, “how often is my data NULL”? In many cases, some level of incomplete data is acceptable — but if a 10% null rate turns into 90%, we’ll want to know.

This covers a couple examples around data freshness and completeness, and I appreciate the level of detail in here. Nothing is earth-shattering, but at the same time, it’s important to have a catalog of the sorts of issues which can pop up. H/T Mark Hutchinson.

Comments closed

Automating Data Comparison using Biml

Ben Weissman gives us an example of running data comparison using Biml:

When you build an ETL Solution, at some point you will most likely feel the need to compare the data between your source and your staging (or datawarehouse) database. There may be various reasons for them to be out of sync, such as delta loads, aggregations or added business logic, but one day your phone will ring and whoever is on the other end will tell you that the numbers are wrong. While this doesn’t necessarily have to be an issue within your ETL process, it might just as well be the report itself. In many cases, this is a good starting point to look at.

This article (formally published on PASS Community Blog) focusses on SSIS as your orchestrator, but the same principles could, obviously, also be applied to Azure Data Factory for example. Also, as we want to solve this task as lightweight as possible, we will, demonstrably, use Biml to implement it!

Click through to see how.

Comments closed

Change Tracking Runthrough

Erik Darling provides a runthrough (which is a walkthrough but at a faster pace) of change tracking in SQL Server:

I’ve been working with CDC and CT way too much, and even I’m annoyed with how much it’s coming out in blog posts.

I’m going to cover a lot of ground quickly here. If you get lost, or there’s something you don’t understand, your best bet is to reference the documentation to get caught up.

Check it out.

Comments closed

Setting up Azure Purview for Power BI

Soheil Bakhshi has a great step-by-step walkthrough for setting up Azure Purview:

Microsoft newly announced a piece of very exciting news that Azure Purview now supports Power BI. This is massive news from a data governance point of view. Azure Purview is the next generation of Azure Data Catalog with more metadata discovery power and the ability to use sensitivity labels. After reading the news, I immediately decided to set up my test environment and give it a go. I followed the steps mentioned in this article on the Microsoft documentation website but I faced some difficulties to get it to work. And here we are, another blog post to help you to set up the Azure Purview for Power BI.

Click through for a detailed walkthrough.

Comments closed