Category: Data

A Critique of XML

Published 2023-03-13 by Kevin Feasel

If you are sending me (or some other hapless victim data engineer) lots of data that resides in a stable schema – one in which the number, order, data type, etc. of the columns never change – using XML, I have a question:

Why?

Why are you using XML to transmit this data?

Read the whole thing. My approximate thoughts (because it is fairly early when I’m writing this, so I might have missed something) are:

XML is most useful with an XSLT, a document describing the shape and rules of the XML data. This is a big advantage over CSV, as it helps you retain information on data types, data lengths, and other details which get lost in the comma.
Speaking of which, CSVs run a high risk of needing to use the separator as a native character. The problem is that there is no single right way to indicate that “That comma is a separator, but this comma is just a comma.” Different parsers work differently, and one of my lengthy rants about PolyBase is that it helpfully indicates that you have a quoted delimiter here and helpfully removes it before barfing on the commas inside quotations. There is actually an ANSI standard character for separator which is not supposed to occur in the wild…but how many people actually use it? Especially considering that most tools don’t interpret it correctly, so you lose some of the readability of CSVs in the process.
That said, for stable schemas with a known separator (or at least a known mechanism for differentiating separators from naturally occurring characters), separated values works well.
And that said, Parquet works better, assuming you don’t have a lot of long string columns. If you’re dealing mostly with numeric data, Parquet will compress much better, will retain data types and lengths, and won’t be a repetitious blob of angle brackets. But a lot of tools still don’t support Parquet, which is a downside.
Basically, this is why we can’t have nice things.

1 Comment

Data Updates in Azure Data Explorer

Published 2023-02-15 by Kevin Feasel

Hiram Fleitas updates the data:

I recently ran into a Kustomer that migrated from TSI to ADX (Azure Data Explorer). They were really excited about using Kusto Trender but one item they couldn’t wrap their head around was how to update their hierarchy table(s) in ADX. i.e.
- Contoso WindFarm Hierarchy (Levels: Plant > Unit > System > Name)
-- Plant
--- Unit
---- System
----- Name 
As a big data platform ADX is an append-only data store, so we don’t have the options to do updates, right? Well, that’s not completely true. We absolutely don’t support updates, but we do have a couple options to simulate updates.

Read on to see what options are available to you.

Comments closed

2023 Data Professional Survey Results

Published 2023-01-04 by Kevin Feasel

Brent Ozar busts out the briefcase full of Benjamins:

Are your peers being paid more this year? Are they switching job roles? Are they planning on leaving their companies? To find out, I run a salary survey every year for folks in the database industry. Download the raw data here and slice & dice ’em to see what’s important to you.

As a quick note, however, remember that inflation in the US went up considerably. Inflation wasn’t something we had to factor in from 2017-2021, as it was 1.5-2%. In 2021, it increased to more than 4% and in 2022 was closer to 8-9%, so converting these from nominal (pre-inflation) to real (post-inflation) will help tell the full story.

Comments closed

A New Sample Database

Published 2022-12-06 by Kevin Feasel

Daniel Hutmacher isn’t satisfied with AdventureWorks:

The database collation is Latin1_General_100_CI_AS_SC.

I’ve divided information into schemas based on their respective sources. The “ReferenceData” schema will have mixed sources, all of them publicly available.

Because the data is so geographically bound, many of the tables with have geo data as well, though I technically put it in a geometry type and not a geography type – just because it was easier. This can make for some cool map visuals in SSMS if you want.

Most columns and tables are annotated using the extended property MS_description, so if you view the extended properties in SSMS, or if you use my sp_ctrl3 utility, a brief description will show up for each object.

Read on for an overview of the database’s schema, as well as the link to download the DB. I’ll have to check it out.

Comments closed

E-Mailing Power BI Query Outputs as CSV Files

Published 2022-09-21 by Kevin Feasel

Gilbert Quevauvilliers finds a way:

I recently had a requirement from a customer where they wanted a list of all Customers and the Primary Contact to be emailed to them every day.
The reason for this to be emailed daily is to ensure that when calling the customer, they know whom to speak to.
This got me thinking and I could use Power Automate to achieve this task which I detail in this blog post below.

I appreciate the ingenuity involved in getting this to work, though this also presents a good case for having this data in a warehouse, where data export to CSV would be easier.

Comments closed

The Importance of Data Shaping

Published 2022-09-01 by Kevin Feasel

Paul Turley shapes the youth of data:

Power BI is a new tool and dimensional modeling is an old idea. One of the challenges is that, like other modern self-service analytics products on the market, Power BI doesn’t force self-service data jockeys to transform their data before reporting with it. If you want to import a big, wide spreadsheet full of numbers and create charts in a Power BI report, knock yourself out. But, the solution won’t scale and you will inevitably run into walls when you try to make future enhancements. Similar problems arise from importing many tables from different sources and transactional systems. Several tables all chained together with creative mashups and relationships present their own set of problems. The first iteration of such an effort is usually a valuable discovery method and learning experience. Great… treat it as such; take notes, make note of the good parts and then throw it away and start over! In Fredrick Brooks’ “The Mythical Man Month“, he cites that for most engineering projects, the first six attempts should be abandoned before the team will be prepared to start over and complete the work successfully. He was a chemical engineer before working for IBM; and hopefully, our methods in the data engineering business are more effective then his 6-to-1 rule. But, this makes the case the prototypes and proof-of-concept projects are a critical part of the learning path.

The tools don’t make the rules.

Unless you’re talking about the lambda architecture, in which case that’s kind of accurate. But we’re not talking about that here.

Comments closed

Data Retention: Definition and Policy

Published 2022-08-11 by Kevin Feasel

Joey Jablonski thinks about data retention:

Data retention policies should be defined in a way that they are easy to understand, easy to be implemented programmatically, and should enable engineering teams to operate independently most of the time when working with datasets that are known and already leveraged by the organization. In addition to policy definitions, data governance leaders should ensure changes are part of data literacy plans for training and rollout to ensure awareness across the organization.

This is something that most DBAs provide input into but don’t directly control. Still, it’s good to know some of the challenges around data retention and figure out how to apply it to your organization.

Comments closed

Security Practices for Delta Sharing

Published 2022-08-05 by Kevin Feasel

Andrew Weaver, et al, share some advice:

When you enable Delta Sharing, you configure the token lifetime for recipient credentials. If you set the token lifetime to 0, recipient tokens never expire.
Setting the appropriate token lifetime is critically important for regulatory, compliance and reputational standpoint. Having a token that never expires is a huge risk; therefore, it is recommended using short-lived tokens as best practice. It is far easier to grant a new token to a recipient whose token has expired than it is to investigate the use of a token whose lifetime has been improperly set.

Click through for eight such tips.

Comments closed

Detecting Data Changes in Power BI Incremental Refresh

Published 2022-08-01 by Kevin Feasel

Chris Webb writes some M:

One feature of Power BI incremental refresh I’ve always been meaning to test out is the ability to create your own M queries to work with the “detect data changes” feature, and last week I finally had the chance to do it. The documentation is reasonably detailed but I thought it would be a good idea to show a worked example of how to use it to get direct control over what data is refreshed during an incremental refresh.

Read on to see how it works, including a couple gotchas around things like the shape of query results.

Comments closed

Finding Sample Data Online

Published 2022-07-13 by Kevin Feasel

Mara Pereira goes searching for data:

Have you ever struggled to find sample data to play with in Power BI?
Did you spend hours (sounds crazy, but it happened to me too!) just looking for a dataset with insurance data? Healthcare data? Housing prices data?
Did you ever wonder “where are people finding the data to create those Netflix and Amazon reports that seem to be everywhere these days? Seriously, tell me your secret!”?

Click through for three good sites. Another one I’ve taken to is the US Bureau of Labor Statistics, which is the federal department responsible for tracking things like employment data, consumer prices, and compensation data. The plus side to these datasets is that you get the whole range of data cleanup, warehousing, querying, and analytics over data which is both real and fairly interesting. For a broader take, data.gov has open source data hosted by the US federal government, though I haven’t spent much time working with it.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31