Press "Enter" to skip to content

Month: February 2024

Using Databricks System Tables

Dustin Vannoy has a primer on system tables in Databricks:

Monitoring is important, so I’ve covered the topic a few times in the past. I’ve talked about collecting your Spark application logs and Spark metrics. These are a good way to track what is happening and what is going wrong as your code runs. In the video related to this post I focus on a different side of monitoring. The evolving capabilities offered by Databricks System Tables. I have some sample queries and links to help you get started and begin to get value from system tables. This will need to be updated (I’ll try) as new tables go into public preview status. So let’s discuss the questions I had when I first started researching this feature:
1) What do the Databricks system tables offer me for monitoring?
2) How much does this overlap with the application logs and metrics?

Click through for a video and a walkthrough.

Leave a Comment

Looping in Python

Jack Wallen has us spinning in circles:

What is a loop? Other than a fruity breakfast cereal, a loop is a portion of code that repeats a set number of times until a desired process is complete.

Here’s an example of a loop that many can relate to. You go to the ATM, insert your card, and start with your transaction. First, you check your balance, then you remove money from your account. The loop keeps running with each transaction until you tell the ATM you are finished.

Read on for loop examples, though for is definitely much more common and popular than while. I’d also recommend learning a bit about list comprehensions, which are a way of turning a for loop into a one-line operation: [f(x) for x in my_list] is a simple example: for each row in my_list, perform function f() with that row’s value as the input. Then, take all of the results of f(x) and stick them together in a new list.

Leave a Comment

Supporting Replication in dbatools

Jess Pomfret makes an announcement:

Replication, everyone loves to hate it, but it’s been our most requested feature as far as adding commands to dbatools. For ages we’ve said ‘sounds great’ and ‘we would love that’, but when we started looking into it the energy soon fizzled away, due to it’s dependency on RMO – Replication Management Objects, as opposed to SMO – SQL Server Management Objects, things weren’t as easy as we had hoped.

That said, it’s now there as of version 2.1.1. It supports snapshot, transactional, and merge replication. No peer-to-peer, but is anyone actually surprised at that?

Leave a Comment

Search Patterns in T-SQL

Erik Darling puts on the fedora and grabs the bullwhip:

First, what you should not do: A universal search string:

The problem here is somewhat obvious if you’ve been hanging around SQL Server long enough. Double wildcard searches, searching with a string type against numbers and dates, strung-together OR predicates that the optimizer will hate you for.

These aren’t problems that other things will solve either. For example, using CHARINDEX or PATINDEX isn’t a better pattern for double wildcard LIKE searching, and different takes on how you handle parameters being NULL don’t buy you much.

Read on for an example of a terrible search query, a mediocre search query, a good search query, and a possible unicorn: an actually valid reason to use a non-clustered columnstore index.

Leave a Comment

Clustered Index Rebuilds and Transaction Log Space

Chad Callihan talks about a potential problem:

Of all the problems that can cause a database transaction log to fill up, perhaps one lesser-known cause is a large clustered index rebuild. Let’s look at a scenario where this issue could cause a headache.

Click through for the scenario. I will say that I disagree with reorganizing the index under pretty much any circumstances. Jeff Moden, in a long and deep but extremely interesting presentation he gave to us at TriPASS three years ago, laid out why that is. I do agree that rebuilding a clustered index should be uncommon and may be a sign of a data modeling problem.

The other thing to note is that this problem is bigger when you rebuild online and your data drive is the same as your log drive. When you rebuild an index online, SQL Server needs enough space to make a copy of the data, so the rule of thumb is that you’ll have 2x the normal disk space requirements during online rebuild. It’s not necessarily exactly 2x because of fragmentation, compression, and other factors, but that’s why it’s a rule of thumb and not a “guarantee or your money back” type of thing.

Leave a Comment

Postgres Backup and Recovery with Barman

Muhammad Ali shows off another way to back up and restore data in Postgres databases:

Barman is a production grade tool for managing the backup and recovery process of PostgreSQL databases. It not only handles the physical backups but also provides automatic management of retention policies, compression, near zero Recovery point objective(RPO) and enables recovery at any desired point(PITR) in time within the recovery window.

Barman’s Stream Archiving feature stands out as a key component in achieving zero Recovery Point Objective (RPO). This is achieved by using pg_recievewal utility which continuously backs up Write-Ahead Logging (WAL) files in real-time to a designated Barman server. This capability is particularly important for applications where even minimal data loss is unacceptable.

Read on for a couple questions about it and a demonstration of how Barman works.

Leave a Comment

Common Mistakes in T-SQL Code

Rich Benner builds a list:

One potential issue here is the lack of time to delve really deeply into each programming language at one’s disposal. The thing about SQL Server is that it is a language that performs very differently than, say, C#. Those languages are row based, which means it’s common to iterate through a data set and process each row individually. This works quite well in those languages, but quite terribly in SQL Server. Being a set-based language, SQL Server prefers to deal with the whole data set at once.

Because of these nuances, it’s easy for developers to make mistakes when creating SQL server code. Let’s go through some common errors that we see.

Read on for several common mistakes that we often find in our code.

Leave a Comment

Adding Superscripts and Subscripts to Axis Labels in R

Steven Sanderson changes the script:

Before we dive into the code, let’s quickly review what superscripts and subscripts are.

  • Superscripts: These are smaller-sized characters or numbers that appear above the baseline of the text. They are often used to denote exponents or indices.
  • Subscripts: On the other hand, subscripts are smaller-sized characters or numbers that appear below the baseline of the text. They are commonly used in mathematical expressions or chemical formulas.

Read on to see how you can generate these in R visuals.

Leave a Comment

GUID Conversion and the Serverless SQL Pool

Reitse Eskens hits a weird error:

One of the transformations is to change one primary key column from integer to GUID. This is something you can do with some trickery you’ll see in the code. But what I found was that, even though the primary key is unique, the GUID’s weren’t. And then the fun starts digging into the why…

Read on for the research Reitse performed. I don’t even have a good guess for this, it’s so weird. It feels like a bug but it’s weird regardless.

Leave a Comment

Full and Incremental Loads in Microsoft Fabric

John Miner continues a series on data engineering in Microsoft Fabric:

In a data lake, we have a bronze quality zone that supposed to represent the raw data in a delta file format. This might include versions of the files for auditing. In the silver quality zone, we have a single version of truth. The data is de-duplicated and cleaned up. How can we achieve these goals using the Apache Spark engine in Microsoft Fabric?

Read on for John’s take on the answer. I’ve found that I have a fairly good answer for smaller datasets, though as the size of the data gets larger, the less I like answers for the raw layer.

Leave a Comment