Curated SQL – Page 300 – A Fine Slice Of SQL Server

SHA_256 Hashes and Data Type

Published 2024-04-12 by Kevin Feasel

The issue is quite simple. A text needs to be converted into a SHA2_256 hash for some authentication reasons. The example shown here is simplified. The thing is, the outcome of the hash isn’t accepted by the authorising party and when the input is checked via an online MD5 hashing site, there’s a difference between that output and that from the SQL Script.

Read on to see what the problem is and how it can affect you.

Comments closed

Removing Multiple Rows from a DataFrame via Base R

Published 2024-04-11 by Kevin Feasel

Steven Sanderson gets rid of rows:

As data analysts and scientists, we often find ourselves working with large datasets where data cleaning becomes a crucial step in our analysis pipeline. One common task is removing unwanted rows from our data. In this guide, we’ll explore how to efficiently remove multiple rows in R using the base R package.

Read on for a couple of ways to do this, including removing by some filter and removing by some index.

Comments closed

The Performance of Various Tidy Wrappers

Published 2024-04-11 by Kevin Feasel

Art Steinmetz runs a comparison:

As we start working with larger and larger datasets, the basic tools of the tidyverse start to look a little slow. In the last few years several packages more suited to large datasets have emerged. Some of these are column, rather than row, oriented. Some use parallel processing. Some are vector optimized. Speedy databases that have made their way into the R ecosystem are data.table, arrow, polars and duckdb. All of these are available for Python as well. Each of these carries with it its own interface and learning curve. duckdb, for example is a dialect of SQL, an entirely different language so our dplyr code above has to look like this in SQL:

Read on for a detailed comparison. Your mileage may vary, etc., but I’m pleasantly surprised with the results, given that I like the Tidyverse for its ease of use compared to base R and other alternatives like raw data.table. H/T R-Bloggers.

Comments closed

Working with the Schema Registry in Confluent

Published 2024-04-11 by Kevin Feasel

Italo Nesi shows off the schema registry:

If you are new to Schema Registry or don’t know the difference between schema, schema type, subject, compatibility type, schema ID, and subject version, I would recommend starting with this free course: Schema Registry 101 by Danica Fine.

This article will show the bits and bytes of what happens behind the scenes in Apache Kafka® producer and consumer clients when communicating with the Schema Registry and serializing/deserializing messages.

Read on to learn more about data quality rules and how the schema registry works.

Comments closed

Querying a Database from Rust

Published 2024-04-11 by Kevin Feasel

Ukeje Goodness writes a query:

You’ll need to download and install Rust and your preferred SQL database management system to interact with SQL databases through Rust. Diesel. In this tutorial, we will use SQLite as the DBMS, but you can use your preferred relational database

After you’ve installed Rust and a preferred SQL DBMS that Diesel supports, you can proceed to create a new Rust project with Cargo’s init command:

Doing a separate search, it does look like you can execute stored procedures as well, using either the sql_query() function (when there is a result set you expect back) or execute() (when there isn’t).

Comments closed

Speeding up Databricks Lakehouse Queries with Redis

Published 2024-04-11 by Kevin Feasel

Drew Furgiuele has the need for speed:

Since compute and storage are now separated, this means that any time you want to work with your data, you need some form of compute engine that is capable of connecting to and reading your data from your storage locations. Compute engines vary, but one of the best is Apache Spark, which gives you a great distributed compute layer suitable for all sorts of workloads, whether they be analytical and ad-hoc queries, dashboard or BI workloads, data engineering related, or even data science or AI/ML use cases. It really can do it all, and it does it very well.

But what about use operational use cases? For instance: let’s say your Lakehouse is hosting some data that is critical to customer-facing systems that demand low-latency response times, such as real-time users lookups, API interfaces, or event-driven systems, sometimes the overhead required to take a query, schedule it, and run it can be in the hundreds of milliseconds. For some workloads, that’s a lifetime.

Read on to see how you can build a caching layer on top of certain lakehouse operations when some operation needs to be as fast as possible.

Comments closed

Returning a Row when there’s No Row to Return

Published 2024-04-11 by Kevin Feasel

Erik Darling has an existential dillema:

Rather selfishly, I do this for my stored procedures, for all the reasons in the first sentence. Especially when debugging stored procedures, you’ll want to know where things potentially went wrong.

In this post, I’m going to walk through a couple different ways that I use to do this. One when you’re storing intermediate results in a temporary object, and one when you’re just using a single query.

Read on for an example of how to do this.

Comments closed

The Value of Multiple Graphs

Published 2024-04-10 by Kevin Feasel

Alex Velez shares some advice:

A tip I regularly share when providing data visualization feedback is to use multiple graphs instead of packing several series into a single chart. Although it is important to be concise, people are often surprised to hear that when it comes to the number of graphs we share, fewer isn’t always better.

Read on for the advice. It makes a lot of sense for several reasons, as Alex shares. One additional reason is that it goes toward Edward Tufte’s argument about information density: multiple smaller visuals allow you to put more relevant information into the same amount of physical space. This makes your visuals more impactful as a result.

Comments closed

Removing Rows with Missing Data in R

Published 2024-04-10 by Kevin Feasel

Steven Sanderson shows us three ways:

Handling missing values is a crucial aspect of data preprocessing in R. Often, datasets contain missing values, which can adversely affect the analysis or modeling process. One common task is to remove rows containing missing values entirely. In this tutorial, we’ll explore different methods to accomplish this task in R, catering to scenarios where we want to remove rows with either some or all missing values.

Click through for three ways to do this.

Comments closed

CI/CD in GitHub

Published 2024-04-10 by Kevin Feasel

I have a new video:

In this video, I explain what continuous integration (CI) is, disambiguate continuous delivery from continuous deployment (CD), and see how you can perform CI/CD operations using GitHub Actions.

Read on to see what these terms mean and an example of how it all works with .NET projects.

Comments closed

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Curated SQL Posts