Press "Enter" to skip to content

Day: January 5, 2024

LOWESS Smoothing in R

Steven Sanderson had me thinking of LOESS but then, bam!, snuck this in on me:

Locally Weighted Scatterplot Smoothing, or Lowess, is a powerful technique for capturing trends in noisy data. It’s particularly useful when dealing with datasets that exhibit complex patterns that might be missed by other methods. So, let’s get our hands dirty and start coding!

Read on for an example of LOWESS smoothing, which actually is a little different from LOESS. If you’re interested in learning more about the differences between LOESS and LOWESS, this Stack Exchange question and answer page is really good.

Comments closed

Using Schema Registry for Data Quality in Apache Kafka

Kai Waehner talks data quality:

Good data quality is one of the most critical requirements in decoupled architectures, like microservices or data mesh. Apache Kafka became the de facto standard for these architectures. But Kafka is a dumb broker that only stores byte arrays. The Schema Registry enforces message structures. This blog post looks at enhancements to leverage data contracts for policies and rules to enforce good data quality on field-level and advanced use cases like routing malicious messages to a dead letter queue.

Click through to learn more about the topic. This focuses a lot on the “why” and “what” but does have an example of “how” in there as well.

Comments closed

Incremental Backup in Postgres

Robert Haas talks about a new feature:

Five days before Christmas I committed my patch to add incremental backup to PostgreSQL. Actually, I’ve been committing preparatory patches for some months now, but December 20 saw the two main patches land. Since then, there’s been a bunch of bug-fix commits, and there are still a few pending items that need to be addressed, but the core of the feature is now committed. If you want a quick overview of the feature, Lukas Fittl has a great video about that. Here, I’d like to talk about the architecture of the feature itself in a little more detail, and specifically with how we decide which data to copy.

Most people who are likely to read this blog are probably already familiar with the core idea of an incremental backup: instead of copying the whole database instance, just copy the data that has changed. That’s faster, and takes up less space on disk. But, to work properly, you have to be able to quickly and reliably identify which data has, in fact, changed. There’s more than one way to do that.

Read on for some of the complexity around this. It’s interesting to see what goes on behind the scenes in a relational database.

Comments closed

Pagination in Stored Procedures

Erik Darling hits on a bugbear of mine:

A common-enough practice to limit search results is to write paginated queries. You may give users the ability to specify how many rows per page they want to see, or you may have a set number of rows per page.

But the end goal is to allow users to very quickly get a smaller number of rows returned to them. Almost no one needs to get many thousands of rows back, unless they’re planning on exporting the data.

Pagination is such a common activity that I wish there were a way to say, “Here is my data. Hang onto it in memory and quickly retrieve a subset of rows upon request” without doing all kinds of shenanigans on my end. Something like a data snapshot that remains in memory as long as the session is active, until the service restarts, until there is memory pressure, or until the caller manually evicts the data. That would make OFFSET and FETCH really useful instead of barely usable for most pagination scenarios because you wouldn’t need to re-run the entire query for every offset/fetch page.

There are ways to make pagination efficient, but the efficient ones aren’t easy or intuitive.

Comments closed

New Challenge: 2D Interval Packing

Itzik Ben-Gan has a new challenge for the new year:

Packing intervals is a classic SQL task that involves packing groups of intersecting intervals to their respective continuous intervals. In mathematics, an interval is the subset of all values of a given type, e.g., integer numbers, between some low value and some high value. In databases, intervals can manifest as date and time intervals representing things like sessions, prescription periods, hospitalization periods, schedules, or numeric intervals representing things like ranges of mile posts on a road, temperature ranges, and so on.

An example for an interval packing task is packing intervals representing sessions for billing purposes. If you conduct a web search of the term packing intervals SQL, you’ll find scores of articles on the subject, including my own.

Read on for more information about this challenge and one solution to it.

Comments closed