Press "Enter" to skip to content

Day: June 26, 2026

Clustering Text via Embeddings and HDBSCAN

Ivan Palomaras Carrascosa groups things together:

In this article, you will learn how to build a text clustering pipeline by combining large language model embeddings with HDBSCAN, a density-based clustering algorithm, to automatically discover topics in unlabeled text data.

Topics we will cover include:

  • How to generate text embeddings for raw documents using a pre-trained sentence-transformers model.
  • How to reduce the dimensionality of those embeddings with UMAP to prepare them for clustering.
  • How to apply HDBSCAN to automatically discover topic clusters and visualize the results.

This is a pretty neat trick that takes advantage of the embedding model’s ability to convert raw text into hundreds (or thousands) of floating point numbers while maintaining enough of the context to differentiate ideas. A lot of it is the original word2vec concepts but scaled up.

Leave a Comment

Alerting on Checkpoint Time in Postgres

Jeremy Schneider shares some advice:

Checkpoint is the heart of your database. It’s buried deep inside. It’s not something everyone talks about, like well-tuned autovacuum or fast queries. But if checkpointer stops beating, then you’re dead.

In addition to its well-understood job of getting dirty pages written from cache to disk in the background, it also has many smaller jobs that are less widely known. Management of a few shared-memory config settings like sync_standby_names and full_page_writes. Fsync Batching. Deferred file unlinks. Enforcement of archive_timeout.

Click through to see what happens when checkpoint time starts increasing and one important thing you should not do.

Leave a Comment

Handling Growth in sysjobhistory

Aaron Bertrand has a lot of jobs:

In the first few days of my new role at Infios, we came across an interesting case of memory exhaustion. A whole slew of memory-related error messages would populate the errorlog, then some stack and memory dumps would appear, and then the SQL Server service would just shut itself down without warning. Some of the errors we observed (apologies, it’s a long list, but I want to make sure that any subset might land you here):

Click through to see if you have any of those issues and one possibility of what the cause might be, as well as how to deal with it.

Leave a Comment

XML Support in MySQL and Postgres

Aisha Bukar lays out how XML works in a pair of relational platforms:

XML (Extensible Markup Language) may no longer dominate modern web APIs the way it once did, but it still plays a critical role in many enterprise systems. Financial institutions, publishing platforms, healthcare systems, government agencies, and large legacy applications continue to rely heavily on XML for structured data exchange and long-term interoperability.

XML also remains deeply embedded in technologies such as SOAP-based APIs, enterprise messaging systems, configuration files, and document-centric workflows where strict structure and validation are essential. This is largely because, unlike lightweight formats such as JSON, XML was designed to handle complex hierarchical documents, namespaces, schemas, and mixed content.

Read on to see how the two open-source relational database platforms handle XML data.

Leave a Comment

Optimizing Power BI Data Agents

Paul Turley shares some advice:

Amid the AI frenzy, there is a lot of conversation about how business users will use agentic chat to answer business questions rather than interactive, dashboard-style reports. Is there truly a shift in the industry, and is agentic analytics going to change the way most business users consume data?

Just how viable is the whole “chat with your data” option, and is it really a replacement for conventional reporting? I recently heard a VP-level leader at a large consulting firm say something to the effect of “we need to stop investing in dashboard-building skills and focus on creating AI-driven data analysis solutions for our consulting customers.” I’m paraphrasing from memory, but that was the sentiment. Are all business leaders across the industry giving up their dashboards, interactive visual reports and scorecards in exchange for AI chat? No. Of course they aren’t — but conversational analysis is a new way to consume business data.

Much of the advice is very similar to what you’d get for standard dashboard creation, and it makes sense. The clearer your data model is and the tighter your semantic model is, the easier it is for processes to use that semantic model. But Paul also covers some things specific to Data Agents as well.

Leave a Comment