Press "Enter" to skip to content

Day: July 15, 2025

Data Cleansing Tips in Pandas

Jayita Gulati shares some tips:

Data preparation is one of the most time-consuming parts of any data science or analytics project, but it doesn’t have to be. With the proper techniques, Pandas can help you quickly transform messy and complex datasets into clean, ready-to-analyze formats. From handling missing data to reshaping and optimizing your DataFrames, a few tricks can save you hours of work.

In this article, you will discover seven practical Pandas tips that can speed up your data prep process and help you focus more on analysis and less on cleanup.

Two of the tips are basically “use functional programming techniques,” and I’m okay with that.

Leave a Comment

Ingesting Logs into Microsoft Fabric Real-Time Intelligence via Logstash

Surya Teja Josyula and Ramachandran G. use one part of the ELK stack:

Logstash is an open-source data processing tool that enables the collection, transformation, and forwarding of data from a wide variety of sources. It acts as a data pipeline engine, helping organizations manage and streamline the flow of structured and unstructured data across systems.

Whether you’re managing infrastructure logs, application events, or telemetry data, this guide will walk you through setting up a seamless pipeline that bridges raw log data with real-time analytics in Fabric.

Click through for the process.

Leave a Comment

Regular Expressions in SQL Server 2025

Ed Pollack digs into some new functionality:

String-searching in SQL Server has always been a mighty hassle. Balancing performance and horribly-complex queries is a compromise that no one enjoys. 

Generally speaking, a relational database is not an ideal place to search large amounts of text. Even when leveraging features such as Full-Text Indexing, the ability for an application to leverage speedy text-searching decreases as data becomes larger. If a service optimized for text-search can be used, such as Elasticsearch or Azure AI Search, then it will be far easier to deliver accurate results quickly. 

Ed focuses on the mechanisms available rather than performance, and that’s the current sticking point. Whether regular expression queries will get faster in subsequent CTPs or SQL Server 2025 RTM, we’ll see.

Leave a Comment

Copying Azure SQL Databases between Subscriptions

Kenneth Fisher is back in the fight:

I recently had to copy an Azure SQL database (SQL db) from one subscription to an Azure SQL Server instance in another subscription. All of the help I found suggested going to the database and hitting the COPY option. Unfortunately, when I did, I ran into a problem.

Read on for the issue, as well as one way to fix it. The route Kenneth landed on was the same one I ended up going with when I had a similar problem and very limited access to SQL DB on both subscriptions.

Leave a Comment

Fixing Slow Row-Level Security Policies in PostgreSQL

Dian Fay troubleshoots some row-level security slowness:

At my day job, we use row-level security extensively. Several different roles interact with Postgres through the same GraphQL API; each role has its own grants and policies on tables; whether a role can see record X in table Y can depend on its access to record A in table B, so these policies aren’t merely a function of the contents of the candidate row itself. There’s more complexity than that, even, but no need to get into it.

Read on for a dive into row-level security and several tips to make the operation faster.

Leave a Comment

Adaptive Joins and Equivalent Plans

Forrest McDaniel has a public service announcement:

In case you haven’t heard, moral decline is everywhere. The latest generation of query operators is a prime example of this rot. “But this is programming” you say, “what do morals have to do with anything?” No. This is SQL. And for us, we have the term morally equivalent plans which allows me to finally have an ethical excuse for sprinkling a post with degenerate puns.

I bit my tongue specifically to avoid doing exactly what Forrest does. So click through to read all of the terrible puns.

Leave a Comment

Efficiency of Sparse Hash Tables in PostgreSQL

Ashutosh Bapat runs some tests:

The hash_create() API in PostgreSQL takes initial size as an argument. It allocates memory for those many hash entries upfront. If more entries are added, it will expand that memory later. The point of argument was what should be the initial size of the hash table, introduced by that patch, containing the derived clauses. During the discussion, David hypothesised that the size of the hash table affects the efficiency of the hash table operations depending upon whether the hash table fits cache line. While I thought it’s reasonable to assume so, the practical impact wouldn’t be noticeable. I thought that beyond saving a few bytes choosing the right hash table size wasn’t going to have any noticeable effects. If an derived clause lookup or insert became a bit slower, nobody would even notice it. It was practically easy to address David’s concern by using the number of derived clauses at the time of creating the hash table to decide initial size of the hash table. The patch was committed.

Read on to see how things didn’t quite turn out this way, and what the results of testing look like.

Leave a Comment