Press "Enter" to skip to content

Curated SQL Posts

Renaming Multiple Columns in a PySpark Notebook

Gilbert Quevauvilliers wants one rename to rule them all:

Following on from my previous blog post this blog post I’m going to demonstrate how to bulk rename column names in a single step instead of having to rename them individually.

The reason this came about is because I had a set of data where the column names had the square brackets which I wanted to remove.

As shown below I have highlighted 2 column names with the square brackets.

Read on to see how you can perform somewhat-generic rename operations in Spark notebooks.

Comments closed

Feature Request: Right-Click to Open Plan in New Tab

Erik Darling wants you to vote:

One problem I run into regularly is when I’m tuning a query, and I want to keep one of the resulting execution plans available, so I can:

  1. Compare it after I make some other changes and run the query again
  2. Do some additional analysis without worrying about accidentally losing the plan

The only way to do that is to save the plan, stick the XML in another tool, or keep opening new query tabs to run things in, where I won’t lose the plan.

Vote for Erik’s idea.

Comments closed

Azure Data Studio 1.48 Now Available

David Levy gives us the latest:

The February release of Azure Data Studio (version 1.48) includes several new features and enhancements, such as:

  • Support for restores from URLs, including S3-compatible storage, via the restore dialog
  • Query Editor now includes SPIDs in tab titles
  • You can now specify a connection when launching ADS from the command line
  • You can now enable Ledger when creating a database in the Create Database dialog

Read on for more about each of these topics.

Comments closed

Melting Datasets in R

Steven Sanderson performs a melt():

The melt() function in the data.table package is an extremely useful tool for reshaping datasets in R. However, for beginners, understanding how to use melt() can be tricky. In this post, I’ll walk through several examples to demonstrate how to use melt() to move from wide to long data formats.

“Melting,” by the way, is the R term for unpivoting data.

Comments closed

Generators in Python

Jack Wallen generates some values:

Within the realm of programming, a generator is a routine that is used to control interaction within a loop. Generators are useful when you want to produce a large sequence of values without storing them in memory at once. The ability to create a large sequence without using up memory is important, especially when dealing with Python programs that generate a large amount of information, such as a long sequence of numbers.

Read on to see how generators work and how to build a custom generator.

Comments closed

Apache Kafka 3.7 Released

Stanislav Kozlovski makes an announcement:

We are proud to announce the release of Apache Kafka® 3.7.0. This release contains many new features and improvements. This blog post will highlight some of the more prominent features. For a full list of changes, be sure to check the release notes.

See the Upgrading to 3.7.0 from any version 0.8.x through 3.6.x section in the documentation for the list of notable changes and detailed upgrade steps.

Read on to see what’s new. Looks like they’ve taken care of a couple dozen items in this release, so plenty to read there.

Comments closed

Classical Methods for Outlier Detection

Vinod Chugani is speaking my language:

Outliers are unique in that they often don’t play by the rules. These data points, which significantly differ from the rest, can skew your analyses and make your predictive models less accurate. Although detecting outliers is critical, there is no universally agreed-upon method for doing so. While some advanced techniques like machine learning offer solutions, in this post, we will focus on the foundational Data Science methods that have been in use for decades.

Vinod looks at a few techniques, including inter-quartile range and comparing results to an expected distribution. If you’re really excited about this topic, I know a guy who’s written a bit about it.

Comments closed

Row Constructors in Postgres

Deepak Mahto builds some rows:

A ROW expression allows you to construct ROW values, which can represent anonymous records, specific table row types, or custom composite types. Its uses include processing records within multiple expressions (using operators like =, <>, <, <=, >, or >=), evaluation with subqueries, and combining ROW values to build composite types. You can even export these to JSON format (using functions like rowtojson). Some key operations you can perform with a ROW constructor in Conversion or migration are :

Click through for that list and several examples.

Comments closed

A Cheat Sheet for Joins

Erik Darling does the academically correct thing and makes all of his joins CROSS JOINS and then performs all filters in the WHERE clause:

Think of your standard CRUD operations. Most don’t even require a join; they’re very straightforward. And hey, if you completely denormalize all your data to one huge table, you’ll never have to think about a lot of this stuff anyway.

It’s only when developers are forced to think about things that things start to go wrong. I don’t mean to pick on developers specifically. It’s the human condition. Thinking often leads to poor choices.

In this post, I’m going to give you some basic guidance on when to use various T-SQL facilities, based on years of finding, fixing, and writing queries.

You get a free second quotation because I love this line:

I’m not going to talk about right outer joins, because that’s the foolish domain of characterless buffoons who use Venn diagrams to explain join results.

If I were emotionally capable of crying, that would bring a tear to my eye because it’s savagery against two things I hate.

Comments closed

“License-Free” Managed Instance Requirements

Arun Sirpal reads the fine print:

This is the managed instance link feature; I really like this, if you know about Distributed AGs then you may know they are tricky to setup (well I found this) but Microsoft takes care of this out of the box.

The point of this quick blog is not how to set this up but the benefit of enabling the Managed Instance as “ license free “ via the hybrid failover rights option – do not forget about this.

Read on for the list of requirements.

Comments closed