Press "Enter" to skip to content

Category: Data

Creating an Azure Purview Catalog Instance

Wolfgang Strasser wants to try out Azure Purview:

Basics – Resource group, purview account name (this cannot be changed afterwards) and the location.

As of today (2020-12-06), there are only 5 Azure regions you can choose from to store the Purview metadata. But – in-region scanning from 16 other Azure regions is available in the preview (source)

This is part one of a multi-part series, so stay tuned for more.

Comments closed

Introducing Azure Purview

Wolfgang Strasser gives us a once-over on a new service:

Today, at the Azure Data and Analytics event, a new Azure data governance service called Azure Purview (https://aka.ms/AzurePurview) was presented and made available in a public preview.

I have not had a chance to try the actual service, but I found a very interesting video (Microsoft mechanics video) where I took the following screenshots from.

Read on for Wolfgang’s thoughts. It’s definitely a step up from Azure Data Catalog.

Comments closed

A New Financial Sample Dataset with Power BI

David Eldersveld takes a look at a new sample data set in Power BI Desktop:

With the October 2020 release of Power BI Desktop, Microsoft has incorporated a sample dataset directly into the product. While there are a few commonly used training and demo datasets already, there are benefits to having a basic starter Financials dataset included.

This data can accelerate the learning experience for new users, particularly those who may only need to focus on authoring reports on existing datasets at their organization and don’t need to immediately know much about data modeling.

Click through for David’s mixed review.

Comments closed

Spark Infer Schema vs ADF Get Metadata

Paul Andrew compares two techniques for retrieving metadata:

For file types that don’t contain there own metadata (CSV, Text etc) we typically have to go and figure out there structure including; attributes and data types before doing any actual transformation work. Often I’ve used the Data Factory Metadata Activity to do this with its structure option. However, while playing around with Azure Synapse Analytics, specifically creating Notebooks in C# to run against the Apache Spark compute pools I’ve discovered in most case the Data Frame infer schema option basically does a better job here.

Now, I’m sure some Spark people will probably read the above and think, well der, obviously Paul! Spark is better than Data Factory. And sure, I accept for this specific situation it certainly is. I’m simply calling that out as it might not be obvious to everyone

Read on for a comparison of the two techniques.

Comments closed

Data Masking Improvements in dbatools

Sander Stad walks us through some changes to the data masking algorithm in dbatools:

If you’ve used the data masking command in dbatools you’ve probably noticed that the PowerShell session becomes memory intensive when it has to handle larger tables with one or more unique indexes.

The reason that happens is that during the data masking process the command looks for any unique indexes in the table. If it finds a unique index it will create a unique row for all the columns in the unique index.

Read on to see how Sander handled this.

Comments closed

Validating Data Model Results in Power BI

Paul Turley continues a series on doing Power BI the right way:

When designing a new data model, this is typically the first thing I do… For every fact table and for each large dimension table, I create a measure that returns the record count for that table. Users normally think about their data in business terms (like sums, ratios and averages) and not about how many records there are in a table. Record counts are a convenient sanity check for record completeness and distribution across groups; and may also be a good indicator for model size and performance.

Paul takes several passes at the problem, getting a bit deeper into it each time.

Comments closed

Processing Larger Messages with Apache Kafka

Kai Wähner walks us through the tradeoffs of sending large messages in Apache Kafka:

After exploring use cases for large message payloads, let’s clarify what Kafka is not:

Kafka is usually not the right technology to store and process large files (images, videos, proprietary files, etc.) as a whole. Products were built specifically for these use cases.

For instance, a Content Delivery Network (CDN) such as Akamai, Limelight Networks, or Amazon CloudFront distribute video streams and other software downloads across the globe. Or “big file editing and processing” (like a video processing tool). Or video editing tools from Adobe, Autodesk, Camtasia, and many other vendors are used to structure and present all video information, including films and television shows, video advertisements, and video essays.

There’s a lot of good advice in here. I think the best advice is essentially “don’t do this unless you need it” but I appreciate that Kai goes a lot further than that.

Comments closed

The Dunder Mifflin Data Set

Tim Mitchell has a new data set for us:

I’ve been a fan of Dunder Mifflin ever since I first learned about this small midwestern paper company. Over the years I’ve gotten to know their people and processes, following from a distance their successesfailures, and various adventures. Who would have known the paper business would be so interesting?

Based on what I learned about this company, I built this Dunder Mifflin data set based on the old Northwind structure, adapting it to meet the needs of this small paper company. It includes most of the employees, regional locations (both current and now-closed), and has a modestly-sized set of sales data for demos and testing.

Check out Tim’s GitHub repo and give it a try.

Comments closed

Trust and Warehouse Data

Rob Farley explains one way that people might lose trust in your warehouse data:

The scenario is that there’s a source system, and there’s a table in a warehouse that is being used to report on it. Maybe it’s being populated by Integration Services or Data Factory. Maybe it’s being populated by T-SQL. I don’t really care. What I care about is whether the data in the warehouse is a true representation of what’s in the source system.

If it’s not a true representation, then we have all kinds of problems.

Mostly, that our warehouse is rubbish.

Read on for an example of how this might occur and what you can do to prevent it.

Comments closed

Free Data Sets

Chris Taylor puts on the Santa Claus getup:

This is just a short overview page of various dataset sources I’ve used in the past for usage in my SQL Server, Azure, AWS and Power BI demo’s. Most if not all of these are free as I don’t like paying for stuff and there are some whereby you can generate your own data if there is something more specific you need……..just be careful as although its “randomly” generated, I have had issues where I pushed the files to github and as it happens, one of the email addresses randomly generated was that of an employee of a company which I’d never heard of. Pure chance (I believe) but something to be mindful of if that is your intention for use.

Chris also has sub-links to Google and Kaggle data sets.

Comments closed