Press "Enter" to skip to content

Curated SQL Posts

Data Visualization in Matplotlib

Rajendra Gupta generates some graphics:

Data analysis requires analysts to handle structured, semi-structured, or unstructured data. Small datasets with few rows and columns are easy to understand. However, as the data complexity increases with many interlinked variables, getting data insights from tabular formatted data becomes challenging. According to a recent study from MIT, the human brain processes an entire image in just 13 milliseconds. Therefore, it is helpful to learn Python and visualization together.

How do we use Python to generate plots from the data to analyze patterns, correlations, and trends? What plots are available, and how do we use them with customizations? Let’s explore them in this tip.

There are a few visualization libraries in Python I prefer over matplotlib, and for static graphics, ggplot2 in R has pretty much everything else beat. But matplotlib is essentially the standard, so it’s important to know.

Comments closed

GROUP BY Clause Order and Performance in PostgreSQL

Andrei Lepikhov runs some tests:

PostgreSQL users often employ analytical queries that sort and group data by different rules. Optimising these operators can significantly reduce the time and cost of query execution. In this post, I will discuss one such optimisation: choosing the order of columns in the GROUP BY expression.

Postgres can already reshuffle the list of grouped expressions according to the ORDER BY condition to eliminate additional sorting and save computing resources. We went further and implemented an additional strategy of group-by-clause list permutation in a series of patches (the first attempt and the second one) for discussion with the Postgres community, expecting it to be included in the next version of PostgreSQL core. You can also try it in action in the commercial Postgres Pro Enterprise fork.

From a “this is a 4th generation language” perspective, there should never be a performance difference in the ordering of grouped columns in a GROUP BY clause. The database optimizer should always determine the optimal ordering. Read Andrei’s article to see how much of a challenge that can be in practice.

Comments closed

Building a Gantt Chart

Amy Esselman wants to know if you’re on time or have any slack:

A Gantt chart is used to visually present tasks or events over time. It consists of a horizontal axis of time and horizontal bars representing the duration of each item arranged along the vertical axis.

The visual was popularized in the early 1900s by American engineer and management consultant Henry Gantt, for whom the chart is named. Most commonly, this view is used to see the start date and duration of each task in a larger project to better understand timelines, milestones, and dependencies.

Click through for an overview of the chart, where it can be useful, and some of the limitations of the chart.

Comments closed

An Overview of the Naive Bayes Class of Algorithms

Harris Amjad takes us through a rather useful class of algorithms for classification:

As AI and Machine Learning have increased in popularity, especially Large Language Models, more professionals have explored how these systems work. Unfortunately, some put the cart before the horse, where they take on more complex algorithms before learning to pave the foundation, resulting in faded interest in the topic. This tip will introduce a simple probabilistic, yet powerful classifier, the Naïve Bayes Model, and implement it in Python.

I like using the Naive Bayes variants, despite the fact that it is not Bayesian and arguably isn’t very naive. The reason I like to use this class of algorithm is that it’s fast, easy, and gives you a useful baseline for quality. If you need to meet some specific quality threshold (say, accuracy > 85% or F1-score above 0.8), you can get an answer quickly with Naive Bayes. If that answer is anywhere near your threshold, the problem is likely solvable. If your answer is way below the threshold, it’s probably not worth spending the time or compute effort trying out a variety of other algorithms.

Comments closed

Analyzing Semantic Model Logs via Microsoft Fabric

Sandeep Pawar parses the logs:

Workspace Monitoring was one of my favorite announcements at MS Ignite ‘24 this week. It logs events from Fabric items such as Semantic Models, Eventhouse, GraphQL to a KQL database that’s automatically provisioned and managed in that workspace. Currently it’s limited to these three items but hopefully other (especially spark and pipelines) will be added soon. Read the announcement by Varun Jain (PM, Microsoft) on this for details. 

Click through for some thoughts from Sandeep, as well as a variety of useful queries.

Comments closed

Generating Effect Plots in Python and R

MIchael Mayer builds some effect plots:

The plots show different types of feature effects relevant in modeling:

  • Average observed: Descriptive effect (also interesting without model).
  • Average predicted: Combined effect of all features. Also called “M Plot” (Apley 2020).
  • Partial dependence: Effect of one feature, keeping other feature values constant (Friedman 2001).
  • Number of observations or sum of case weights: Feature value distribution.
  • R only: Accumulated local effects, an alternative to partial dependence (Apley 2020).

Click through to see how they both work.

Comments closed

New Syntax in SQL Server 2022

Steve Jones talks syntax:

At the recent 2024 PASS Data Community Summit, I saw a lightning talk from Mikey Bronowski on the New T-SQL Functions in SQL Server 2022.  Before the talk, I made a joke with him that none of these were new because it’s 2024. They’d been out for 2 years.

Mikey did a nice job, given that he only had 10 minutes, but it was enough to give attendees an idea of some new things they might investigate to use in their own code. If you missed the talk, or you don’t have access to the recordings, we have a series on SQL Server Central that covers these (Part 1Part 2Part 3) and helps you understand the new options. The list of language changes is also in the MS Docs under What’s New in the Language.

Steve asks about how much utilization there is of these. I think the problem is that so few organizations have adopted SQL Server 2022 that knowledge that these things even exist is quite low.

By the way, I do have a script covering many of these new bits of syntax as part of a talk on getting up to speed with core T-SQL changes since SQL Server 2012.

Comments closed

The Difficulty of Deletion in PostgreSQL

Radim Marek takes us through the process of deleting rows:

Your database is ticking along nicely – until a simple DELETE brings it to its knees. What went wrong? While we tend to focus on optimizing SELECT and INSERT operations, we often overlook the hidden complexities of DELETE. Yet, removing unnecessary data is just as critical. Outdated or irrelevant data can bloat your database, degrade performance, and make maintenance a nightmare. Worse, retaining some types of data without valid justification might even lead to compliance issues.

Read on to learn about the process, some challenges, and a common pattern for resolving these challenges. The solution is pretty similar in SQL Server as well: batching delete operations, ideally with a supporting index.

Comments closed

AWS DMS and a LOB Bug

Richard O’Riordan fixes an issue:

The table over in our Postgres cluster is similar except for the data type “text” being used instead of “varchar”. All kind of boring so far, but what we noticed that on some very rare occasions the “largevalue” column was empty over in the PostgreSQL database even though for that row it was populated in SQL Server.

This seemed odd to me, like you would expect if there was some error inserting the row on the PostgreSQL side then since it is all done within a transaction that it would either all succeed or all fail, how would the row be partially inserted, i.e. missing this text value.

Read on for the story and several tests of how things work.

Comments closed