Press "Enter" to skip to content

Curated SQL Posts

Analyzing SQL Server Table Metadata

Barney Lawrence looks at details about a table:

For a while I’ve been building up a script that uses SQL Servers INFORMACTION_SCHEMA.COLUMNS table to create a query that applies aggregates to each column in a table and then formats the results to allow an at a glance summary of the shape of the data inside of it.

I showed some of the techniques used to build this in a session regarding metadata and data warehousing which I co-presented with Emma Dolling and Ruth Pearson at SQLBits. The one comment that I heard the most after this session was “I want that script”. I did promise to make it available more than a few times and so I’ve put together a github repository to hold it and other resources we built for the session.

Barney anticipated the most likely change I’d make while reviewing the script: APPROX_COUNT_DISTINCT() if you’re on SQL Server 2022 and dealing with a large table.

Comments closed

Using the OpenLibrary ISBN API with Powershell

Robert Cain has been working on a neat project:

In this post we’ll begin with an overview of what an ISBN is. We’ll then talk about the website that will be the source for our data. It has two different web APIs (Application Programming Interface) that we can use. We’ll discuss one here, then in the next blog post cover the advanced version.

First though, if you haven’t read the introductory post in this series, The ArcaneBooks Project – An Introduction, I’d highly recommend doing so as it lays the foundation for why and how this project to get ISBN data originated.

Robert is building this up over a series of posts, so stay tuned.

Comments closed

Data Validation with Great Expectations and Azure Functions

Eduard van Valkenburg does a bit of data validation:

Great Expectations (Great Expectations Home Page • Great Expectations) is a popular Python-based OSS tool to validate data that comes into your data estate. And for me, validating incoming data is best done file by file, as the files arrive! On Azure there is no better platform for that then Azure Functions. I love the serverless nature of Functions and with the triggers available for arriving blobs, as well as HTTP, event grid events, queues and others. There are some great patterns that allow you to build event-driven data architectures. We also now have the Python v2 framework for Azure Functions available, which makes the developer experience even better. So, let’s go through how to get it running.

This looks really interesting and tying it in to Azure Functions is a good idea assuming that the checks don’t run for too long.

Comments closed

Pro Encryption in SQL Server Errata

Matthew McGiffen made the first mistake—admitting fault for anything, ever:

My biggest fear when my book went into production was that any factual errors had slipped through my checks and the various reviews. I had a lot of reviewer support from Apress, but at the end of the day any issues are my responsibility.

So far I’m not aware of any factual errors but one kind reader (Ekrem Önsoy) has shared with me a few typos they have found. I’m going to document them here and will keep this post up to date as I’m made aware of any others:

Mistakes in 300 pages of text will happen, no matter how many times you go through your magnum opus. For example, I hate the fact that I went through every chapter of PolyBase Revealed 8 or 9 times to weed out any little typo. Then, as soon as I got my copies of the print edition in, I flipped open to a random page and immediately spotted a typo.

Comments closed

An Overview of SQL Server Security Options

Ben DeBow gives us a once-over of things you can do to harden a SQL Server instance:

Microsoft SQL Server is one of the most secure platforms available, but companies need to deploy, configure, and implement it correctly – along with implementing its built-in security features – in order to ensure their systems are fully protected. Here, we’ll explore six of the most important security features and how to implement them to enhance your SQL Server security.

This isn’t a how-to guide so much as it is a what-you-can-do guide.

Comments closed

Database Backups with dbatools

Chad Callihan backs that database up:

Keeping on the recent PowerShell trend, let’s use PowerShell to accomplish a primary task of any database administrator: backups. With PowerShell and dbatools, you can do a simple backup or add a range of options to fit your needs.

I’d also like to call out that it’s really easy to set configuration options with dbatools, such as buffer count and max transfer size.

Comments closed

Mnemonics for Remembering SQL Clause Order

Bob Pusateri keeps it all straight in his head:

Ooh! A mnemonic! And a pretty good one at that. The idea being that the first letter of each word of the sentence helps you remember something else, like the order of the major parts of a SELECT statement:

SELECT
FROM
WHERE
GROUP BY
HAVING
ORDER BY

Click through for Bob’s best attempts. Oddly enough, now I want some Rally’s fries. Or maybe Checkers—I can’t decide.

Comments closed

An Overview of Kafka Streams

The Instaclustr team explains how stream processing works in Kafka Streams:

Kafka Streams is a client library providing organizations with a particularly efficient framework for processing streaming data. It offers a streamlined method for creating applications and microservices that must process data in real-time to be effective. Using the Streams API within Apache Kafka, the solution fundamentally transforms input Kafka topics into output Kafka topics. The benefits are important: Kafka Streams pairs the ease of utilizing standard Java and Scala application code on the client end with the strength of Kafka’s robust server-side cluster architecture.

Read on for an overview of how it works. And if you haven’t already, check out the prior post on Kafka so that you can experience the same slight mental perturbations I did when reading about “real-time” responses.

Comments closed

Real-Time Data Streaming and Apache Kafka

Kai Waehner explains how Apache Kafka is not real-time:

Real-time data beats slow data. It is that easy! But what is real-time? The term always needs to be defined when discussing a use case. Apache Kafka is the de facto standard for real-time data streaming. Kafka is good enough for almost all real-time scenarios. But dedicated proprietary software is required for niche use cases. Kafka is NOT the right choice if you need microsecond latency! This article explores the architecture of NASDAQ that combines critical stock exchange trading with low-latency streaming analytics.

Kai uses the much more appropriate term “near real-time,” which I agree with. My mental example of “real-time” is software that you’d put on a fighter jet (which was an actual example in my undergrad days of a real-time operating system). If people potentially die because your software takes 4 milliseconds to do a job it needs to do in 100 microseconds, that’s real-time. For most of us, near real-time is certainly enough.

Actually, I’d go one step further: for most of us, not-really-real-time is fine. So many cases of “The users needs this data in real time!” boil down to “The users really only look at this once a day and couldn’t act on faster information and some of our data sources only update once a day.” Swap ‘once a day’ with ‘once an hour’ or something like that and you have the large majority of projects which started out with “near real-time” requirements.

1 Comment