Press "Enter" to skip to content

Month: April 2024

Mirroring Azure SQL DB into Microsoft Fabric

Dennes Torres holds up a mirror:

You need to read data from production to build a single source of truth. If you create pipelines reading directly from production, you will create additional load over the production environment. The mirror allows you to do much of the production reporting from the mirror, leaving the production environment to serve other users. Keep in mind, production report, but not analytics report.

Mirroring a production database to Fabric is one method to ensure the load over production will be as low as possible and the data will be transferred fabric to complete the transformations from this point.

Only this? What about avoiding pipeline creation? Not really, you still need to create pipelines, as I will explain ahead.

Click through for the demo and explanation. This is an important thing for people to note: mirroring doesn’t eliminate ELT. You still have the data lake process to work through, as your transactional system does not and should not look like your reporting system.

Comments closed

Converting Cursors to PL/pgSQL

Deepak Mahto explains a difference in cursors from Oracle:

In the blog, we will cover scenarios with cursors that differ from how Oracle handles them. During conversion, our initial approach is to match all codebases as closely as possible with the target compatibility. However, in some cases, although the code appears identical, the functionality might vary. Let’s explore one such case here.

Click through for the scenario.

Comments closed

SHA_256 Hashes and Data Type

Reitse Eskens hit an interesting issue:

The issue is quite simple. A text needs to be converted into a SHA2_256 hash for some authentication reasons. The example shown here is simplified. The thing is, the outcome of the hash isn’t accepted by the authorising party and when the input is checked via an online MD5 hashing site, there’s a difference between that output and that from the SQL Script.

Read on to see what the problem is and how it can affect you.

Comments closed

Removing Multiple Rows from a DataFrame via Base R

Steven Sanderson gets rid of rows:

As data analysts and scientists, we often find ourselves working with large datasets where data cleaning becomes a crucial step in our analysis pipeline. One common task is removing unwanted rows from our data. In this guide, we’ll explore how to efficiently remove multiple rows in R using the base R package.

Read on for a couple of ways to do this, including removing by some filter and removing by some index.

Comments closed

The Performance of Various Tidy Wrappers

Art Steinmetz runs a comparison:

As we start working with larger and larger datasets, the basic tools of the tidyverse start to look a little slow. In the last few years several packages more suited to large datasets have emerged. Some of these are column, rather than row, oriented. Some use parallel processing. Some are vector optimized. Speedy databases that have made their way into the R ecosystem are data.tablearrowpolars and duckdb. All of these are available for Python as well. Each of these carries with it its own interface and learning curve. duckdb, for example is a dialect of SQL, an entirely different language so our dplyr code above has to look like this in SQL:

Read on for a detailed comparison. Your mileage may vary, etc., but I’m pleasantly surprised with the results, given that I like the Tidyverse for its ease of use compared to base R and other alternatives like raw data.table. H/T R-Bloggers.

Comments closed

Working with the Schema Registry in Confluent

Italo Nesi shows off the schema registry:

If you are new to Schema Registry or don’t know the difference between schema, schema type, subject, compatibility type, schema ID, and subject version, I would recommend starting with this free course: Schema Registry 101 by Danica Fine.

This article will show the bits and bytes of what happens behind the scenes in Apache Kafka® producer and consumer clients when communicating with the Schema Registry and serializing/deserializing messages.

Read on to learn more about data quality rules and how the schema registry works.

Comments closed

Querying a Database from Rust

Ukeje Goodness writes a query:

You’ll need to download and install Rust and your preferred SQL database management system to interact with SQL databases through Rust. Diesel. In this tutorial, we will use SQLite as the DBMS, but you can use your preferred relational database

After you’ve installed Rust and a preferred SQL DBMS that Diesel supports, you can proceed to create a new Rust project with Cargo’s init command:

Doing a separate search, it does look like you can execute stored procedures as well, using either the sql_query() function (when there is a result set you expect back) or execute() (when there isn’t).

Comments closed

Speeding up Databricks Lakehouse Queries with Redis

Drew Furgiuele has the need for speed:

Since compute and storage are now separated, this means that any time you want to work with your data, you need some form of compute engine that is capable of connecting to and reading your data from your storage locations. Compute engines vary, but one of the best is Apache Spark, which gives you a great distributed compute layer suitable for all sorts of workloads, whether they be analytical and ad-hoc queries, dashboard or BI workloads, data engineering related, or even data science or AI/ML use cases. It really can do it all, and it does it very well.

But what about use operational use cases? For instance: let’s say your Lakehouse is hosting some data that is critical to customer-facing systems that demand low-latency response times, such as real-time users lookups, API interfaces, or event-driven systems, sometimes the overhead required to take a query, schedule it, and run it can be in the hundreds of milliseconds. For some workloads, that’s a lifetime.

Read on to see how you can build a caching layer on top of certain lakehouse operations when some operation needs to be as fast as possible.

Comments closed

Returning a Row when there’s No Row to Return

Erik Darling has an existential dillema:

Rather selfishly, I do this for my stored procedures, for all the reasons in the first sentence. Especially when debugging stored procedures, you’ll want to know where things potentially went wrong.

In this post, I’m going to walk through a couple different ways that I use to do this. One when you’re storing intermediate results in a temporary object, and one when you’re just using a single query.

Read on for an example of how to do this.

Comments closed

The Value of Multiple Graphs

Alex Velez shares some advice:

A tip I regularly share when providing data visualization feedback is to use multiple graphs instead of packing several series into a single chart. Although it is important to be concise, people are often surprised to hear that when it comes to the number of graphs we share, fewer isn’t always better.

Read on for the advice. It makes a lot of sense for several reasons, as Alex shares. One additional reason is that it goes toward Edward Tufte’s argument about information density: multiple smaller visuals allow you to put more relevant information into the same amount of physical space. This makes your visuals more impactful as a result.

Comments closed