Press "Enter" to skip to content

Category: Data

XML Support in MySQL and Postgres

Aisha Bukar lays out how XML works in a pair of relational platforms:

XML (Extensible Markup Language) may no longer dominate modern web APIs the way it once did, but it still plays a critical role in many enterprise systems. Financial institutions, publishing platforms, healthcare systems, government agencies, and large legacy applications continue to rely heavily on XML for structured data exchange and long-term interoperability.

XML also remains deeply embedded in technologies such as SOAP-based APIs, enterprise messaging systems, configuration files, and document-centric workflows where strict structure and validation are essential. This is largely because, unlike lightweight formats such as JSON, XML was designed to handle complex hierarchical documents, namespaces, schemas, and mixed content.

Read on to see how the two open-source relational database platforms handle XML data.

Leave a Comment

Anonymizing PII in Language Model Pipelines

Dejan Lukic tries to minimize personally identifiable information:

Large language models (LLMs) and the agents built on top of them ingest everything they are given, including personally-identifiable information (PII). In workflows where PII is inevitable, proper measures should exist for data sanitization.

Data can leak through model outputs, embeddings or even logs. Given that you have to use LLMs in your pipeline, in this article I will cover the anonymization techniques you can utilize in an LLM flow to minimize PII exposure vectors.

Some of the points are specific to language models (or at least storing data in vector databases), but other tips are more generic and can apply to classic data handling.

Leave a Comment

Explaining PostgreSQL Large Record Storage with TOAST

Radim Marek looks off-page:

In earlier posts in this series we established that every heap tuple lives inside a strict 8KB page. Everything else is built on top of that hard limit: MVCCHOT updates, and indexes that point at (page, line_pointer). And yet this still works:

CREATE TABLE docs (id int PRIMARY KEY, body jsonb);
INSERT INTO docs VALUES (1, (SELECT jsonb_agg(g) FROM generate_series(1, 100000) g));

That body value is somewhere north of half a megabyte. The heap page is still 8KB. Both statements are true at the same time, and the mechanism that makes them coexist is TOASTThe Oversized-Attribute Storage Technique.

Read on to see how TOAST works, when it kicks in, and some of the consequences of this solution.

Comments closed

The Pain of NULL

Louis Davidson explains the unknown:

There is no simpler topic in relational comparisons than three valued logic. I am being mostly facetious about this, but in reality, it seems so simple that people don’t think about how a NULL works, and make mistakes all of the time. I was reading a post about this the other day on LinkedIn (which by no means could one ever find again!) where one of the comments chastised the author of the post for not understanding “the fundamentals” of relational theory. The original poster wasn’t completely right (and my post may not be completely perfect either, though I will back most of what I write with code.)

In this post I want to point out a few of the key basics that one really should understand.

Click through for a primer on what NULL means and doesn’t mean. And by the time you’re done, I’d like to interest you in the power of 6th normal form, where you can effectively banish NULL into the abyss (at least until you join the bits back together).

Comments closed

Generating Sample Data in Fabric Dataflows

Chris Webb builds some data:

Back in December the FabricAI.Prompt() M function was released in Fabric Dataflows Gen2. Most of the people writing about it at that time, as in this great post by my colleague Sandeep Pawar, focused on calling this function for each row in a table – something that the UI in the editor makes easy. However the FabricAI.Prompt() function itself is a lot more flexible. You can use it to summarise whole tables of data as I showed here; you can also use it to generate sample data. This is similar to what I blogged about here where I got Copilot to generate M code that returned sample data but using FabricAI.Prompt() is maybe a bit simpler.

Click through to see how.

Comments closed

Data Extraction from Unstructured Data with Fabric AI Functions

Sandeep Pawar demonstrates functionality:

Most enterprise data lives in free text – tickets, contracts, feedback, clinical notes, and more. It holds critical information but doesn’t fit into the structured tables that pipelines expect. Traditionally, extracting structure meant rule-based parsers that break with every format to change, or custom NLP models that take weeks to build. LLMs opened new possibilities, but on their own they bring inconsistent outputs, no type of enforcement, and results that vary between runs. What production workflows need is LLM intelligence with structured-output guarantees, delivered inside the data platform teams already use.

Microsoft Fabric AI Functions deliver exactly that. Functions like ai.summarize, ai.classify, ai.translate, and ai.extract let you transform and enrich unstructured data at scale with a single line of code – no model deployment or ML infrastructure needed. For the full list, see Transform and enrich data with AI functions.

Click through for an example. The tricky part of this is, because answers won’t be deterministic, you have to do a lot of testing and verification to ensure things are working reasonably well.

Comments closed

Bluebox: An Evolving Sample Database for PostgreSQL

Ryan Booz has a sample database:

Sure, there are datasets everywhere. Kaggle currently lists over 600,000 public datasets, but most of them are static CSV files that you load once and never touch again. Great for a one-time analysis, not so great for learning how a real database behaves over time. The Postgres Wiki lists a few dozen sample databases, too. And shoot, your shiny new AI coding buddy can help you create one if you want to put the time in.

The problem with most of these datasets is that they’re primarily static. If you’re lucky, some of the datasets might produce new data dumps once a month to keep things “current”. But the problem is that you can’t really practice query tuning if your data never changes. You can’t explore vacuum behavior when there are no updates. You can’t test monitoring tools when nothing is happening.

Click through for more information on Bluebox, as well as a Docker container containing several helpful tools and processes to make this data evolve over time.

Comments closed

NYC Open Data R Package

Antoine Soetewey announces a package:

I am pleased to announce the release of nycOpenData, an R package providing convenient, tidy access to dozens of datasets from the New York City Open Data platform.

The package is designed as part of an open-science and reproducible-research effort, with the goal of lowering the friction between public data and statistical analysis—especially for teaching, exploratory research, and applied civic work.

It is available on CRAN, so it should be easy to grab. H/T R-Bloggers.

Comments closed

2026 Data Professional Salary Survey Results

Brent Ozar has another year of data:

The results are in! You can download the raw data in Excel for all 10 years and do some slicing and dicing to find out whether you’re underpaid, overpaid, or what it looks like for folks who are out there looking for work.

This year, I added a couple of new items to the survey asking about folks who are unemployed and currently looking for work. In hindsight, I wish I would have done this long ago so that we could have a baseline to know whether things have gotten better or worse. Ah, well – the best time to plant a tree was 20 years ago, and the second-best time is now. Let’s dig into the data and see what we find.

Click through for Brent’s initial findings and have fun analyzing the data.

Comments closed

Gaps in Identity Columns

Brent Ozar explains why there can be gaps in identity columns:

And you use that identity number for invoice numbers, or customer numbers, or whatever, and you expect that every single number will be taken up. For example, your accounting team might say, “We see order ID 1, 3, and 4 – what happened to order ID #2? Did someone take cash money from a customer, print up an invoice for them, and then delete the invoice and pocket the cash?”

Well, that might have happened. But there are all kinds of reasons why we’d have a gap in identities. One of the most common is failed or rolled-back transactions. To illustrate it, let’s start a transaction in one window:

I have a talk on applying forensic accounting techniques using SQL and Python (as well as an older version using R) and this is one of the things I bring up. In cases where you absolutely need contiguous numbers, the best I can do for you is no identity column and a stored procedure that runs in a SERIALIZED transaction isolation level, using an app lock to prevent anybody else from calling the stored procedure concurrently, taking a table lock out on the relevant table prior to doing any real work, and hard blocking everybody else until your transaction either succeeds or fails. And I’m not even 100% sure on that if you have enough concurrency to matter.

Comments closed