Press "Enter" to skip to content

Category: Data

Data Governance and Microsoft Fabric

Matthew Roche digs deeper into data governance in Microsoft Fabric:

One of the most underappreciated benefits of Power BI as a managed SaaS data platform has been the “managed” part. When you create a report, dataset, dataflow, or other item in Power BI, the Power BI service knows everything about it. Power BI is the authoritative system for all items it contains, which means that Power BI can answer questions related to lineage (where does the data used by this report come from?) and impact analysis (where is the data in this dataset used?) and compliance (who has permissions to access this report?) and more.

If you’ve ever tried to authoritatively answer questions like these for a system of any non-trivial scope, you know how hard it is. Power BI has made this information increasingly available to administrators, through logs and APIs, and the community has built a wide range of free and paid solutions to help admins turn this information into insights and action. Even more excitingly, Power BI keeps getting better and better even as the newer parts of Fabric seem to be getting all of the attention.

Leave a Comment

Paper Review: Moving Fast with Broken Data

Adnan Masood reviews a paper:

I recently came across an insightful research paper titled “Moving Fast With Broken Data” by Shreya Shankar, Labib Fawaz, Karl Gyllstrom, and Aditya G. Parameswaran from UC Berkeley and Meta. The paper addresses the significant issue of data corruption in machine learning (ML) pipelines, which often leads to decreased model accuracy. The authors present an automatic data validation system implemented at Meta that aims to solve this problem.

Sounds like I have some beach reading.

Ed. Note: He’s kidding, right?

Ed. 2 Note: About going to the beach maybe.

Ed. & Ed. 2 Note: HAHAHAHAHAH.

Yeah, I hired Statler and Waldorf as my editors. Worst Best decision of my life.

Comments closed

Postgres GIS Calculations by SRID

Ryan Lambert talks accuracy:

A common use case with PostGIS data is to calculate things, such as distances between points, lengths of lines, and the area of polygons. The topic of accuracy, or inaccuracy, with GEOMETRY data comes up often. The most frequent offenders are generic SRIDs such as 3857 and 4326. In some projects accuracy is paramount. Non-negotiable. On the other hand, plenty of projects do not need accurate calculations. Those projects often rely on relationships between calculations, not the actual values of the calculations themselves. If Coffee shop Y is 4 times further away than Coffee shop Z. I’ll often go to Coffee shop Z just based on that.

In most cases, users should still understand how significant the errors are. This post explores one approach to determine the how accurate (or not!) the calculations of a given SRID are in a particular region, based on latitude (North/South). The queries used in this post can be adjusted for your specific area.

Click through to see how much the choice of SRID can impact your results.

Comments closed

Migration Tool for Cosmos DB

Hasan Savran reviews a tool:

Migrating databases to a new database server is a big job. You need to have the right tools to make this process easy for everyone. Cosmos DB had a Data Migration Tool to move data from a bunch of sources. It was a great, free tool with an easy user interface. You didn’t have to be a developer to use this tool. Here is a screenshot of what it used to look like.

     Unfortunately, the tool got old and did not get updated with upcoming SDK changes. It worked only in Windows environments too.

Click through for Hasan’s thoughts. I had a behind-the-scenes look at everything getting put together, in that I was supposed to help, got busy, and slinked away as really sharp people like Carey Payette and John Bowen drove the project to completion.

Comments closed

Building Sets with GENERATE_SERIES

Aaron Bertrand is a fan of this new function:

I have come across a lot of use cases for manufacturing rows on the fly, aside from the common goal of populating a large data set such as a numbers or calendar table. A few favorites include building sample data, pivoting an unknown number of columns, data extrapolation, and filling gaps in date or time ranges.

If you are on SQL Server 2022 or Azure SQL Database, or have been reading up on new features, you’ve likely heard about one of the better T-SQL enhancements: a new built-in function called GENERATE_SERIES. The syntax is straightforward – it accepts arguments for start and stop, and an optional argument to indicate step (in case you want to iterate by more than 1, or backwards):

Click through to see how performance for this compares to two methods we’ve used in the past to generate similar results.

Comments closed

Analyzing SQL Server Table Metadata

Barney Lawrence looks at details about a table:

For a while I’ve been building up a script that uses SQL Servers INFORMACTION_SCHEMA.COLUMNS table to create a query that applies aggregates to each column in a table and then formats the results to allow an at a glance summary of the shape of the data inside of it.

I showed some of the techniques used to build this in a session regarding metadata and data warehousing which I co-presented with Emma Dolling and Ruth Pearson at SQLBits. The one comment that I heard the most after this session was “I want that script”. I did promise to make it available more than a few times and so I’ve put together a github repository to hold it and other resources we built for the session.

Barney anticipated the most likely change I’d make while reviewing the script: APPROX_COUNT_DISTINCT() if you’re on SQL Server 2022 and dealing with a large table.

Comments closed

A Critique of XML

Andy Leonard isn’t XML’s biggest fan:

If you are sending me (or some other hapless victim data engineer) lots of data that resides in a stable schema – one in which the number, order, data type, etc. of the columns never change – using XML, I have a question:

Why?

Why are you using XML to transmit this data?

Read the whole thing. My approximate thoughts (because it is fairly early when I’m writing this, so I might have missed something) are:

  1. XML is most useful with an XSLT, a document describing the shape and rules of the XML data. This is a big advantage over CSV, as it helps you retain information on data types, data lengths, and other details which get lost in the comma.
  2. Speaking of which, CSVs run a high risk of needing to use the separator as a native character. The problem is that there is no single right way to indicate that “That comma is a separator, but this comma is just a comma.” Different parsers work differently, and one of my lengthy rants about PolyBase is that it helpfully indicates that you have a quoted delimiter here and helpfully removes it before barfing on the commas inside quotations. There is actually an ANSI standard character for separator which is not supposed to occur in the wild…but how many people actually use it? Especially considering that most tools don’t interpret it correctly, so you lose some of the readability of CSVs in the process.
  3. That said, for stable schemas with a known separator (or at least a known mechanism for differentiating separators from naturally occurring characters), separated values works well.
  4. And that said, Parquet works better, assuming you don’t have a lot of long string columns. If you’re dealing mostly with numeric data, Parquet will compress much better, will retain data types and lengths, and won’t be a repetitious blob of angle brackets. But a lot of tools still don’t support Parquet, which is a downside.
  5. Basically, this is why we can’t have nice things.
1 Comment

Data Updates in Azure Data Explorer

Hiram Fleitas updates the data:

I recently ran into a Kustomer that migrated from TSI to ADX (Azure Data Explorer). They were really excited about using Kusto Trender but one item they couldn’t wrap their head around was how to update their hierarchy table(s) in ADX. i.e.  

- Contoso WindFarm Hierarchy (Levels: Plant > Unit > System > Name)
-- Plant
--- Unit
---- System
----- Name 

As a big data platform ADX is an append-only data store, so we don’t have the options to do updates, right? Well, that’s not completely true. We absolutely don’t support updates, but we do have a couple options to simulate updates.

Read on to see what options are available to you.

Comments closed

2023 Data Professional Survey Results

Brent Ozar busts out the briefcase full of Benjamins:

Are your peers being paid more this year? Are they switching job roles? Are they planning on leaving their companies? To find out, I run a salary survey every year for folks in the database industry. Download the raw data here and slice & dice ’em to see what’s important to you.

As a quick note, however, remember that inflation in the US went up considerably. Inflation wasn’t something we had to factor in from 2017-2021, as it was 1.5-2%. In 2021, it increased to more than 4% and in 2022 was closer to 8-9%, so converting these from nominal (pre-inflation) to real (post-inflation) will help tell the full story.

Comments closed

A New Sample Database

Daniel Hutmacher isn’t satisfied with AdventureWorks:

The database collation is Latin1_General_100_CI_AS_SC.

I’ve divided information into schemas based on their respective sources. The “ReferenceData” schema will have mixed sources, all of them publicly available.

Because the data is so geographically bound, many of the tables with have geo data as well, though I technically put it in a geometry type and not a geography type – just because it was easier. This can make for some cool map visuals in SSMS if you want.

Most columns and tables are annotated using the extended property MS_description, so if you view the extended properties in SSMS, or if you use my sp_ctrl3 utility, a brief description will show up for each object.

Read on for an overview of the database’s schema, as well as the link to download the DB. I’ll have to check it out.

Comments closed