Press "Enter" to skip to content

Category: Data

Azure Purview: External Connections and the Starter Kit

Wolfgang Strasser continues a series on Azure Purview. First up is a look at external connections:

During my tests with Azure Purview I found out, that in my demo accounts (I created multiple to test it) the Management menu was missing some items.

Read on to learn why and how you can rectify this. Then, check out Wolfgang’s take on the Starter Kit:

Very often, when I want to test some new services I miss some infrastructure and environments I can start and play with. I am not talking about creating a new Azure Purview account (see my previous blog post – Creating an Azure Purview account) – I am talking about the data infrastructure to analyze, catalog and gain knowledge out of it.

We could start to create such an infrastructure, BUT: the Purview team create a Starter Kit to quickly create a data estate and configure everything that you can start with Purview within a view minutes.

Read on for one issue (of the self-inflicted variety) Wolfgang ran into during deployment. But it does look like a great way to get started with Purview and build up a relevant demo environment.

Comments closed

Ignoring Bad Dates when Moving to Spark 3

Robert Blackburn shows us one way to handle bad dates when moving to Spark 3:

Moving from a Spark 2 to a Spark 3 runtime has a lot of benefits including big performance improvements through adaptive query executiondynamic partition pruning, and other optimizations. Some updates may require you to refactor your code. One of them is Delta tables now use the Proleptic Gregorian Calendar. Isn’t a calendar a calendar? Unfortunately, no. The Julian calendar has discrepancies with old dates. Specifically dates before 1582 and timestamps before 1900. Here we will dynamically update these dates for incoming source files.

If you would like to follow along in detail, I have a sample notebook that uses the community edition of Databricks. The DBC Archive file is here and the source file is here.

Fortunately, this change is unlikely to affect most of us, with perhaps the most common issue being that you used 0001-01-01 as a default date.

Comments closed

Creating an Azure Purview Catalog Instance

Wolfgang Strasser wants to try out Azure Purview:

Basics – Resource group, purview account name (this cannot be changed afterwards) and the location.

As of today (2020-12-06), there are only 5 Azure regions you can choose from to store the Purview metadata. But – in-region scanning from 16 other Azure regions is available in the preview (source)

This is part one of a multi-part series, so stay tuned for more.

Comments closed

Introducing Azure Purview

Wolfgang Strasser gives us a once-over on a new service:

Today, at the Azure Data and Analytics event, a new Azure data governance service called Azure Purview (https://aka.ms/AzurePurview) was presented and made available in a public preview.

I have not had a chance to try the actual service, but I found a very interesting video (Microsoft mechanics video) where I took the following screenshots from.

Read on for Wolfgang’s thoughts. It’s definitely a step up from Azure Data Catalog.

Comments closed

A New Financial Sample Dataset with Power BI

David Eldersveld takes a look at a new sample data set in Power BI Desktop:

With the October 2020 release of Power BI Desktop, Microsoft has incorporated a sample dataset directly into the product. While there are a few commonly used training and demo datasets already, there are benefits to having a basic starter Financials dataset included.

This data can accelerate the learning experience for new users, particularly those who may only need to focus on authoring reports on existing datasets at their organization and don’t need to immediately know much about data modeling.

Click through for David’s mixed review.

Comments closed

Spark Infer Schema vs ADF Get Metadata

Paul Andrew compares two techniques for retrieving metadata:

For file types that don’t contain there own metadata (CSV, Text etc) we typically have to go and figure out there structure including; attributes and data types before doing any actual transformation work. Often I’ve used the Data Factory Metadata Activity to do this with its structure option. However, while playing around with Azure Synapse Analytics, specifically creating Notebooks in C# to run against the Apache Spark compute pools I’ve discovered in most case the Data Frame infer schema option basically does a better job here.

Now, I’m sure some Spark people will probably read the above and think, well der, obviously Paul! Spark is better than Data Factory. And sure, I accept for this specific situation it certainly is. I’m simply calling that out as it might not be obvious to everyone

Read on for a comparison of the two techniques.

Comments closed

Data Masking Improvements in dbatools

Sander Stad walks us through some changes to the data masking algorithm in dbatools:

If you’ve used the data masking command in dbatools you’ve probably noticed that the PowerShell session becomes memory intensive when it has to handle larger tables with one or more unique indexes.

The reason that happens is that during the data masking process the command looks for any unique indexes in the table. If it finds a unique index it will create a unique row for all the columns in the unique index.

Read on to see how Sander handled this.

Comments closed

Validating Data Model Results in Power BI

Paul Turley continues a series on doing Power BI the right way:

When designing a new data model, this is typically the first thing I do… For every fact table and for each large dimension table, I create a measure that returns the record count for that table. Users normally think about their data in business terms (like sums, ratios and averages) and not about how many records there are in a table. Record counts are a convenient sanity check for record completeness and distribution across groups; and may also be a good indicator for model size and performance.

Paul takes several passes at the problem, getting a bit deeper into it each time.

Comments closed

Processing Larger Messages with Apache Kafka

Kai Wähner walks us through the tradeoffs of sending large messages in Apache Kafka:

After exploring use cases for large message payloads, let’s clarify what Kafka is not:

Kafka is usually not the right technology to store and process large files (images, videos, proprietary files, etc.) as a whole. Products were built specifically for these use cases.

For instance, a Content Delivery Network (CDN) such as Akamai, Limelight Networks, or Amazon CloudFront distribute video streams and other software downloads across the globe. Or “big file editing and processing” (like a video processing tool). Or video editing tools from Adobe, Autodesk, Camtasia, and many other vendors are used to structure and present all video information, including films and television shows, video advertisements, and video essays.

There’s a lot of good advice in here. I think the best advice is essentially “don’t do this unless you need it” but I appreciate that Kai goes a lot further than that.

Comments closed

The Dunder Mifflin Data Set

Tim Mitchell has a new data set for us:

I’ve been a fan of Dunder Mifflin ever since I first learned about this small midwestern paper company. Over the years I’ve gotten to know their people and processes, following from a distance their successesfailures, and various adventures. Who would have known the paper business would be so interesting?

Based on what I learned about this company, I built this Dunder Mifflin data set based on the old Northwind structure, adapting it to meet the needs of this small paper company. It includes most of the employees, regional locations (both current and now-closed), and has a modestly-sized set of sales data for demos and testing.

Check out Tim’s GitHub repo and give it a try.

Comments closed