March 2024 – Page 7 – Curated SQL

Using IN and NOT IN in SQL Server

Published 2024-03-18 by Kevin Feasel

I’ll be brief here, and let you know exactly when I’ll use IN and NOT IN rather than anything else:

When I have a list of literal values

That’s it. That’s all. If I have to go looking in another table for anything, I use either EXISTS or NOT EXISTS. The syntax just feels better to me, and I don’t have to worry about getting stupid errors about subqueries returning more than one value.

I’m typically a lot more flexible about using IN, though I do agree with NOT IN: that clause is usually more trouble than it’s worth.

Comments closed

Postgres Internals: Database Clusters, Databases, and Tables

Published 2024-03-18 by Kevin Feasel

Semab Tariq begins a new series:

A database cluster is a collection of multiple databases managed by a single PostgreSQL server. It can be referred to as a data/base directory.

A database is a collection of database objects. Whereas a database object is a data structure used to store objects such as tables, views, indexes, extensions, Sequences functions, etc. In simple words, anything that we can create or store within a database is a database object

Read on to learn more about how Postgres lays out database files and tablespaces.

Comments closed

tidyAML 0.0.5 Now Available

Published 2024-03-14 by Kevin Feasel

Steven Sanderson has an announcement:

I’m thrilled to announce the latest release of tidyAML, version 0.0.5, now available for download on CRAN or GitHub!

In this release, we’ve introduced some fantastic new features and made minor fixes and improvements to enhance your experience with tidyAML.

Click through to see what’s new in this version.

Comments closed

Retrieving Spark Session Config Variables from Microsoft Fabric

Published 2024-03-14 by Kevin Feasel

Koen Verbeeck gets some settings:

I was trying some stuff out in a notebook on top of a Microsoft Fabric Lakehouse. I was wondering what some of the default values are of the configuration variables, and if there’s an easy way to retrieve them all. Luckily there is. In the code, I’m using Scala because it has a nice GetAll() function.

Click through for an example of how to use this. And bonus points for using Scala instead of Python here.

Comments closed

Postgres Data Extraction with LATERAL joins and More

Published 2024-03-14 by Kevin Feasel

Ryan Booz extracts some data:

In our data hungry world, knowing how to effectively load and transform data from various sources is a highly valued skill. Over the last couple of years, I’ve learned how useful many of the data manipulation functions in PostgreSQL can supercharge your data transformation and analysis process, using just PostgreSQL and SQL.

For the last couple of decades, “Extract Transform Load” (ETL) has been the primary method for manipulating and analyzing the results. In most cases, ETL relies on an external toolset to help acquire different forms of data, slicing and dicing it into a form suitable for relational databases, and then inserting the results into your database of choice. Once it’s in the destination table with a relational schema, querying and analyzing it is much easier.

I call out CROSS JOIN LATERAL (or any kind of lateral join) here because it’s the ANSI equivalent of T-SQL’s APPLY operator, and I’ve already pointed out once today that I’m a huge fan of APPLY.

Comments closed

Overloading Power BI in Microsoft Fabric

Published 2024-03-14 by Kevin Feasel

Reitse Eskens pushes the envelope:

In my previous blog on Fabric and loadtesting, I ended with not really knowing how PowerBI would respond to all these rows. After creating and presenting a session on this subject, it’s time to dig into this part of Fabric as well. There were questions and I made promises. So here goes! This blog will only show the F2 experience as that’s where things went off the road. And, as I’ve shown in the previous blog, the CU count doesn’t change between SKU’s, only the amount of SKU’s available changes.
This blog isn’t meant to scold Fabric or make it look silly, I’m the one who’s silly. The goal is to show some limitations, a way you can do some load testing and help you find your way in the available metrics.

Read on to see what Reitse has gotten into.

Comments closed

Using the APPLY Operator

Published 2024-03-14 by Kevin Feasel

Erik Darling gets an auto-link for talking about my favorite operator:

I end up converting a lot of derived joins, particularly those that use windowing functions, to use the apply syntax. Sometimes good indexes are in place to support that, other times they need to be created to avoid an Eager Index Spool.

One of the most common questions I get is when developers should consider using apply over other join syntax.

The short answer is that I start mentally picturing the apply syntax being useful when:

To learn when, you’re going to have to read the whole thing. And, if you want to learn even more about it, I have a talk on the topic that might be of interest.

Comments closed

Postgres and NUMA

Published 2024-03-14 by Kevin Feasel

Annie Ghazali follows up on a Chris Travers webinar:

Q1. At what point we need to focus on ensuring huge_pages in PostgreSQL?

There are a couple of factors here. The first is, that if you’re able to show that you have multiple NUMA domains, it will almost always be a win performance-wise. But it becomes critical at the point where you start seeing that the checkpointer is running at 100 percent CPU load, and none of your queries are running at 100 percent CPU load, especially if you don’t have a lot of IO weight. That’s a really good indication that you’ve hit a point where it’s now a heavy bottleneck, and that’s a point where it’s starting to become something where you’re going to see a very large win out of it.

Read on to see this full answer, as well as answers to questions around why you might not want to disable NUMA support and what NUMA does to swap space recommendations.

Comments closed

Pulling Samples in R with sample()

Published 2024-03-13 by Kevin Feasel

Steven Sanderson takes a sample:

The sample() function in R is a powerful tool that allows you to generate random samples from a given dataset or vector. It’s an essential function for tasks such as data analysis, Monte Carlo simulations, and randomized experiments. In this blog post, we’ll explore the sample() function in detail and provide examples to help you understand how to use it effectively.

Read on to see what options are available with sample() and the different ways in which you can use the function.

Comments closed

Issues and Projects in GitHub

Published 2024-03-13 by Kevin Feasel

I have a new video:

In this video, we take a look at what GitHub has for project management, reviewing GitHub Projects and Issues.

The upshot is that GitHub has a fair amount of capability for project management. Its notion of Issues definitely feels fairly well fleshed out, which makes sense considering GitHub’s original purpose as a storehouse for open-source code repositories. By contrast, Projects are a relatively new feature and there’s still some room to grow there, especially if you’re used to project management tools like Jira or Trello.

Comments closed

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Month: March 2024