Press "Enter" to skip to content

Category: Data

Building Sets with GENERATE_SERIES

Aaron Bertrand is a fan of this new function:

I have come across a lot of use cases for manufacturing rows on the fly, aside from the common goal of populating a large data set such as a numbers or calendar table. A few favorites include building sample data, pivoting an unknown number of columns, data extrapolation, and filling gaps in date or time ranges.

If you are on SQL Server 2022 or Azure SQL Database, or have been reading up on new features, you’ve likely heard about one of the better T-SQL enhancements: a new built-in function called GENERATE_SERIES. The syntax is straightforward – it accepts arguments for start and stop, and an optional argument to indicate step (in case you want to iterate by more than 1, or backwards):

Click through to see how performance for this compares to two methods we’ve used in the past to generate similar results.

Comments closed

Analyzing SQL Server Table Metadata

Barney Lawrence looks at details about a table:

For a while I’ve been building up a script that uses SQL Servers INFORMACTION_SCHEMA.COLUMNS table to create a query that applies aggregates to each column in a table and then formats the results to allow an at a glance summary of the shape of the data inside of it.

I showed some of the techniques used to build this in a session regarding metadata and data warehousing which I co-presented with Emma Dolling and Ruth Pearson at SQLBits. The one comment that I heard the most after this session was “I want that script”. I did promise to make it available more than a few times and so I’ve put together a github repository to hold it and other resources we built for the session.

Barney anticipated the most likely change I’d make while reviewing the script: APPROX_COUNT_DISTINCT() if you’re on SQL Server 2022 and dealing with a large table.

Comments closed

A Critique of XML

Andy Leonard isn’t XML’s biggest fan:

If you are sending me (or some other hapless victim data engineer) lots of data that resides in a stable schema – one in which the number, order, data type, etc. of the columns never change – using XML, I have a question:

Why?

Why are you using XML to transmit this data?

Read the whole thing. My approximate thoughts (because it is fairly early when I’m writing this, so I might have missed something) are:

  1. XML is most useful with an XSLT, a document describing the shape and rules of the XML data. This is a big advantage over CSV, as it helps you retain information on data types, data lengths, and other details which get lost in the comma.
  2. Speaking of which, CSVs run a high risk of needing to use the separator as a native character. The problem is that there is no single right way to indicate that “That comma is a separator, but this comma is just a comma.” Different parsers work differently, and one of my lengthy rants about PolyBase is that it helpfully indicates that you have a quoted delimiter here and helpfully removes it before barfing on the commas inside quotations. There is actually an ANSI standard character for separator which is not supposed to occur in the wild…but how many people actually use it? Especially considering that most tools don’t interpret it correctly, so you lose some of the readability of CSVs in the process.
  3. That said, for stable schemas with a known separator (or at least a known mechanism for differentiating separators from naturally occurring characters), separated values works well.
  4. And that said, Parquet works better, assuming you don’t have a lot of long string columns. If you’re dealing mostly with numeric data, Parquet will compress much better, will retain data types and lengths, and won’t be a repetitious blob of angle brackets. But a lot of tools still don’t support Parquet, which is a downside.
  5. Basically, this is why we can’t have nice things.
1 Comment

Data Updates in Azure Data Explorer

Hiram Fleitas updates the data:

I recently ran into a Kustomer that migrated from TSI to ADX (Azure Data Explorer). They were really excited about using Kusto Trender but one item they couldn’t wrap their head around was how to update their hierarchy table(s) in ADX. i.e.  

- Contoso WindFarm Hierarchy (Levels: Plant > Unit > System > Name)
-- Plant
--- Unit
---- System
----- Name 

As a big data platform ADX is an append-only data store, so we don’t have the options to do updates, right? Well, that’s not completely true. We absolutely don’t support updates, but we do have a couple options to simulate updates.

Read on to see what options are available to you.

Comments closed

2023 Data Professional Survey Results

Brent Ozar busts out the briefcase full of Benjamins:

Are your peers being paid more this year? Are they switching job roles? Are they planning on leaving their companies? To find out, I run a salary survey every year for folks in the database industry. Download the raw data here and slice & dice ’em to see what’s important to you.

As a quick note, however, remember that inflation in the US went up considerably. Inflation wasn’t something we had to factor in from 2017-2021, as it was 1.5-2%. In 2021, it increased to more than 4% and in 2022 was closer to 8-9%, so converting these from nominal (pre-inflation) to real (post-inflation) will help tell the full story.

Comments closed

A New Sample Database

Daniel Hutmacher isn’t satisfied with AdventureWorks:

The database collation is Latin1_General_100_CI_AS_SC.

I’ve divided information into schemas based on their respective sources. The “ReferenceData” schema will have mixed sources, all of them publicly available.

Because the data is so geographically bound, many of the tables with have geo data as well, though I technically put it in a geometry type and not a geography type – just because it was easier. This can make for some cool map visuals in SSMS if you want.

Most columns and tables are annotated using the extended property MS_description, so if you view the extended properties in SSMS, or if you use my sp_ctrl3 utility, a brief description will show up for each object.

Read on for an overview of the database’s schema, as well as the link to download the DB. I’ll have to check it out.

Comments closed

E-Mailing Power BI Query Outputs as CSV Files

Gilbert Quevauvilliers finds a way:

I recently had a requirement from a customer where they wanted a list of all Customers and the Primary Contact to be emailed to them every day.

The reason for this to be emailed daily is to ensure that when calling the customer, they know whom to speak to.

This got me thinking and I could use Power Automate to achieve this task which I detail in this blog post below.

I appreciate the ingenuity involved in getting this to work, though this also presents a good case for having this data in a warehouse, where data export to CSV would be easier.

Comments closed

The Importance of Data Shaping

Paul Turley shapes the youth of data:

Power BI is a new tool and dimensional modeling is an old idea. One of the challenges is that, like other modern self-service analytics products on the market, Power BI doesn’t force self-service data jockeys to transform their data before reporting with it. If you want to import a big, wide spreadsheet full of numbers and create charts in a Power BI report, knock yourself out. But, the solution won’t scale and you will inevitably run into walls when you try to make future enhancements. Similar problems arise from importing many tables from different sources and transactional systems. Several tables all chained together with creative mashups and relationships present their own set of problems. The first iteration of such an effort is usually a valuable discovery method and learning experience. Great… treat it as such; take notes, make note of the good parts and then throw it away and start over! In Fredrick Brooks’ “The Mythical Man Month“, he cites that for most engineering projects, the first six attempts should be abandoned before the team will be prepared to start over and complete the work successfully. He was a chemical engineer before working for IBM; and hopefully, our methods in the data engineering business are more effective then his 6-to-1 rule. But, this makes the case the prototypes and proof-of-concept projects are a critical part of the learning path.

The tools don’t make the rules.

Unless you’re talking about the lambda architecture, in which case that’s kind of accurate. But we’re not talking about that here.

Comments closed

Data Retention: Definition and Policy

Joey Jablonski thinks about data retention:

Data retention policies should be defined in a way that they are easy to understand, easy to be implemented programmatically, and should enable engineering teams to operate independently most of the time when working with datasets that are known and already leveraged by the organization. In addition to policy definitions, data governance leaders should ensure changes are part of data literacy plans for training and rollout to ensure awareness across the organization.

This is something that most DBAs provide input into but don’t directly control. Still, it’s good to know some of the challenges around data retention and figure out how to apply it to your organization.

Comments closed

Security Practices for Delta Sharing

Andrew Weaver, et al, share some advice:

When you enable Delta Sharing, you configure the token lifetime for recipient credentials. If you set the token lifetime to 0, recipient tokens never expire.

Setting the appropriate token lifetime is critically important for regulatory, compliance and reputational standpoint. Having a token that never expires is a huge risk; therefore, it is recommended using short-lived tokens as best practice. It is far easier to grant a new token to a recipient whose token has expired than it is to investigate the use of a token whose lifetime has been improperly set.

Click through for eight such tips.

Comments closed