Press "Enter" to skip to content

Month: April 2024

Transactional Replication Slow: Check Missing Indexes

Lori Brown provides a good tip:

Have you ever had transactional replication mysteriously start showing significant latency at a subscriber server?  If so, check to see if the primary keys from the publisher database are missing on the subscriber database!

Replication was showing long latency while the publisher and subscriber servers were not heavily utilized.  Microsoft generated stored procedures that are used to send INSERT, UPDATE and DELETE operations from the publisher database to the subscriber database had no indexes to help query performance to the subscriber tables.  Since there were no indexes that matched the keys of the statements pushing through data changes, tables were being fully scanned and replication statements were piling up.  In other words, the subscriber was missing indexes. 

Read on to see how you can check this and also how you can quickly script those missing indexes.

Leave a Comment

Mirroring Azure SQL DB into Microsoft Fabric

Dennes Torres holds up a mirror:

You need to read data from production to build a single source of truth. If you create pipelines reading directly from production, you will create additional load over the production environment. The mirror allows you to do much of the production reporting from the mirror, leaving the production environment to serve other users. Keep in mind, production report, but not analytics report.

Mirroring a production database to Fabric is one method to ensure the load over production will be as low as possible and the data will be transferred fabric to complete the transformations from this point.

Only this? What about avoiding pipeline creation? Not really, you still need to create pipelines, as I will explain ahead.

Click through for the demo and explanation. This is an important thing for people to note: mirroring doesn’t eliminate ELT. You still have the data lake process to work through, as your transactional system does not and should not look like your reporting system.

Leave a Comment

Converting Cursors to PL/pgSQL

Deepak Mahto explains a difference in cursors from Oracle:

In the blog, we will cover scenarios with cursors that differ from how Oracle handles them. During conversion, our initial approach is to match all codebases as closely as possible with the target compatibility. However, in some cases, although the code appears identical, the functionality might vary. Let’s explore one such case here.

Click through for the scenario.

Leave a Comment

SHA_256 Hashes and Data Type

Reitse Eskens hit an interesting issue:

The issue is quite simple. A text needs to be converted into a SHA2_256 hash for some authentication reasons. The example shown here is simplified. The thing is, the outcome of the hash isn’t accepted by the authorising party and when the input is checked via an online MD5 hashing site, there’s a difference between that output and that from the SQL Script.

Read on to see what the problem is and how it can affect you.

Leave a Comment

Removing Multiple Rows from a DataFrame via Base R

Steven Sanderson gets rid of rows:

As data analysts and scientists, we often find ourselves working with large datasets where data cleaning becomes a crucial step in our analysis pipeline. One common task is removing unwanted rows from our data. In this guide, we’ll explore how to efficiently remove multiple rows in R using the base R package.

Read on for a couple of ways to do this, including removing by some filter and removing by some index.

Leave a Comment

The Performance of Various Tidy Wrappers

Art Steinmetz runs a comparison:

As we start working with larger and larger datasets, the basic tools of the tidyverse start to look a little slow. In the last few years several packages more suited to large datasets have emerged. Some of these are column, rather than row, oriented. Some use parallel processing. Some are vector optimized. Speedy databases that have made their way into the R ecosystem are data.tablearrowpolars and duckdb. All of these are available for Python as well. Each of these carries with it its own interface and learning curve. duckdb, for example is a dialect of SQL, an entirely different language so our dplyr code above has to look like this in SQL:

Read on for a detailed comparison. Your mileage may vary, etc., but I’m pleasantly surprised with the results, given that I like the Tidyverse for its ease of use compared to base R and other alternatives like raw data.table. H/T R-Bloggers.

Leave a Comment

Working with the Schema Registry in Confluent

Italo Nesi shows off the schema registry:

If you are new to Schema Registry or don’t know the difference between schema, schema type, subject, compatibility type, schema ID, and subject version, I would recommend starting with this free course: Schema Registry 101 by Danica Fine.

This article will show the bits and bytes of what happens behind the scenes in Apache Kafka® producer and consumer clients when communicating with the Schema Registry and serializing/deserializing messages.

Read on to learn more about data quality rules and how the schema registry works.

Leave a Comment

Querying a Database from Rust

Ukeje Goodness writes a query:

You’ll need to download and install Rust and your preferred SQL database management system to interact with SQL databases through Rust. Diesel. In this tutorial, we will use SQLite as the DBMS, but you can use your preferred relational database

After you’ve installed Rust and a preferred SQL DBMS that Diesel supports, you can proceed to create a new Rust project with Cargo’s init command:

Doing a separate search, it does look like you can execute stored procedures as well, using either the sql_query() function (when there is a result set you expect back) or execute() (when there isn’t).

Leave a Comment

Speeding up Databricks Lakehouse Queries with Redis

Drew Furgiuele has the need for speed:

Since compute and storage are now separated, this means that any time you want to work with your data, you need some form of compute engine that is capable of connecting to and reading your data from your storage locations. Compute engines vary, but one of the best is Apache Spark, which gives you a great distributed compute layer suitable for all sorts of workloads, whether they be analytical and ad-hoc queries, dashboard or BI workloads, data engineering related, or even data science or AI/ML use cases. It really can do it all, and it does it very well.

But what about use operational use cases? For instance: let’s say your Lakehouse is hosting some data that is critical to customer-facing systems that demand low-latency response times, such as real-time users lookups, API interfaces, or event-driven systems, sometimes the overhead required to take a query, schedule it, and run it can be in the hundreds of milliseconds. For some workloads, that’s a lifetime.

Read on to see how you can build a caching layer on top of certain lakehouse operations when some operation needs to be as fast as possible.

Leave a Comment