Press "Enter" to skip to content

Curated SQL Posts

Portfolio Management for Creating a Technology Strategy

Kevin Sookocheff busts out the 2×2 matrix:

Application Portfolio Management (APM) draws inspiration from financial portfolio management, which has been around since at least the 1970s. By looking at all applications and services in the organization and analyzing their costs and benefits, you can determine the most effective way to manage them as part of a larger overall strategy. This allows the architect or engineering leader to take a more strategic approach to managing their application portfolio backed by data. Portfolio management is crucial for creating a holistic view of your team’s technology landscape and making sure that it aligns with business goals.

This is for C-levels and VPs rather than individual contributors, but acts as a good way of thinking about a portfolio of applications and what to do with each.

Comments closed

Converting JSON to a Relational Schema with KQL

Devang Shah does some flattening and moving:

In the world of IoT devices, industrial historians, infrastructure and application logs, and metrics, machine-generated or software-generated telemetry, there are often scenarios where the upstream data producer produces data in non-standard schemas, formats, and structures that often make it difficult to analyze the data contained in these at scale. Azure Data Explorer provides some useful features to run meaningful, fast, and interactive analytics on such heterogenous data structures and formats. 

In this blog, we’re taking an example of a complex JSON file as shown in the screenshot below. You can access the JSON file from this GitHub page to try the steps below.

Click through for the example, which is definitely non-trivial.

Comments closed

Common Challenges Implementing PySpark Code

Amlan Patnaik looks at some common implementation problems:

Pyspark has become one of the most popular tools for data processing and data engineering applications. It is a fast and efficient tool that can handle large volumes of data and provide scalable data processing capabilities. However, Pyspark applications also come with their own set of challenges that data engineers face on a day-to-day basis. In this article, we will discuss some of the common challenges faced by data engineers in Pyspark applications and the possible solutions to overcome these challenges.

Read on for five such challenges.

Comments closed

Sampling and Inconsistent Result Counts

Kevin Wilkie does the math:

One of the things you may have noticed after reading our last post on Top (found here) is that sometimes SAMPLE doesn’t give the answer you want.

For example, we can run the same query to get 20% of the table. Remember that this table has 290 rows in total.

After seeing two runs return 69 and then 50 rows, respectively, Kevin digs in and finds out why. This got me thinking about whether a one-pass scan, assigning values based on a uniform distribution (which sounds like what is happening here) would be faster than random sampling without replacement over an array of 8-byte pointers, but then I realized that it’s way too early in the morning for me to be thinking architecture.

Comments closed

Adding Microseconds to ADF Timestamps

Rayis Imayev can’t wait for the next second:

The current addToTime function (https://learn.microsoft.com/en-us/azure/data-factory/control-flow-expression-language-functions#addToTime) in Azure Data Factory (ADF) only supports a specific set of time units ranging from Year to Seconds. Since I needed to increment a timestamp by microseconds, I had to find an alternative solution in ADF. Here are my findings on how to do this using an alternate approach.

Click through for Rayis’s solution to the problem.

Comments closed

A Post-Migration Go-Live Plan for SQL Server 2022

Brent Ozar takes it slow:

You’re planning to migrate to SQL Server 2022, and you want your databases to be faster after the migration.

This is NOT a blog post about how to migrate – that’s the same as it’s been for a long time. Go build the new 2022 servers, and use log shipping or database mirroring to sync the old and new servers. (I’m not a fan of using Distributed Availability Groups to do version upgrades. You can, but it is a heck of a lot of moving parts to set up for a one-time migration.)

This is a blog post about what to do after you migrate:

Click through for five steps and what they entail.

Comments closed

Version Control for Power BI Datasets

Richard Swinbank improves on a prior version control system:

In the previous post, I outlined a possible workflow for Power BI development, and implemented an Azure DevOps pipeline to show how steps in such a workflow could be automated. To build the pipeline I stored an entire .pbix report file – data and all – in version control, which is a problem for at least two reasons:

  • storing large report files in a version control system won’t scale well
  • datasets may contain confidential or sensitive data which must be kept out of version control.

In this post I’ll look at separating a report’s dataset from its visuals, version controlling the standalone dataset (without data), and deploying the dataset automatically to Power BI.

Read on for the process.

Comments closed

CETAS to Parquet Files in Azure SQL Managed Instance

Michael Bourgon gives CETAS a chance:

TL;DR – the below lines will allow you to query a table on your MIcreating Parquet files in Azure blob storageAnd you can query it! Next up is partitioning over time, etc, etc. But this is freaking fantastic. I have a python script I wrote that does it, but it’s nowhere as nice/easy as this.

Why do you care? Because it’s a fantastically easy way to archive older data to blob storage, and I suspect (need to test) that if you do it right, you can then have it go to cool/archive storage via a lifecycle setup, so that if you need it much later, you can.

Yep, this is historically one of the best use cases for PolyBase. Unfortunately, we can’t do this in SQL Server 2022, though you can in pre-2022 versions using the Hadoop process. Given that it’s now available in SQL MI, I wouldn’t be too shocked to see it on-premises at some point, with the big question being in SQL Server 2022 or vNext.

2 Comments

Updating an Always Encrypted Column

Chad Callihan makes an update:

When recently troubleshooting an issue, I needed to update a database record to test application functionality. Because the table had an Always Encrypted column, some extra steps were needed to make the UPDATE successfully. Let’s look at the error encountered and how it was resolved.

Click through for the error and see how Chad got around the problem. This is definitely one of those head-scratcher solutions, where you can kind of understand why it’s necessary but still think the required process is dumb.

Comments closed

Optimizing Kafka Infrastructure Costs

Addison Huddy saves some money:

In this first blog, we’re going to run through the infrastructure costs of running Kafka—i.e., compute, storage, networking, and the additional tooling you need to keep Kafka up and running smoothly. We won’t bury the lede—if you’re running Kafka in the cloud across multiple AZs (as most do for high availability), networking likely represents over 50% of your Kafka infrastructure costs. Let’s see how this ends up being the case.

Click through for some thoughts on how to reduce network costs, using AWS as an example.

Comments closed