Press "Enter" to skip to content

Author: Kevin Feasel

Removing Multiple Rows from a DataFrame via Base R

Steven Sanderson gets rid of rows:

As data analysts and scientists, we often find ourselves working with large datasets where data cleaning becomes a crucial step in our analysis pipeline. One common task is removing unwanted rows from our data. In this guide, we’ll explore how to efficiently remove multiple rows in R using the base R package.

Read on for a couple of ways to do this, including removing by some filter and removing by some index.

Leave a Comment

Working with the Schema Registry in Confluent

Italo Nesi shows off the schema registry:

If you are new to Schema Registry or don’t know the difference between schema, schema type, subject, compatibility type, schema ID, and subject version, I would recommend starting with this free course: Schema Registry 101 by Danica Fine.

This article will show the bits and bytes of what happens behind the scenes in Apache Kafka® producer and consumer clients when communicating with the Schema Registry and serializing/deserializing messages.

Read on to learn more about data quality rules and how the schema registry works.

Leave a Comment

Querying a Database from Rust

Ukeje Goodness writes a query:

You’ll need to download and install Rust and your preferred SQL database management system to interact with SQL databases through Rust. Diesel. In this tutorial, we will use SQLite as the DBMS, but you can use your preferred relational database

After you’ve installed Rust and a preferred SQL DBMS that Diesel supports, you can proceed to create a new Rust project with Cargo’s init command:

Doing a separate search, it does look like you can execute stored procedures as well, using either the sql_query() function (when there is a result set you expect back) or execute() (when there isn’t).

Leave a Comment

Returning a Row when there’s No Row to Return

Erik Darling has an existential dillema:

Rather selfishly, I do this for my stored procedures, for all the reasons in the first sentence. Especially when debugging stored procedures, you’ll want to know where things potentially went wrong.

In this post, I’m going to walk through a couple different ways that I use to do this. One when you’re storing intermediate results in a temporary object, and one when you’re just using a single query.

Read on for an example of how to do this.

Leave a Comment

Speeding up Databricks Lakehouse Queries with Redis

Drew Furgiuele has the need for speed:

Since compute and storage are now separated, this means that any time you want to work with your data, you need some form of compute engine that is capable of connecting to and reading your data from your storage locations. Compute engines vary, but one of the best is Apache Spark, which gives you a great distributed compute layer suitable for all sorts of workloads, whether they be analytical and ad-hoc queries, dashboard or BI workloads, data engineering related, or even data science or AI/ML use cases. It really can do it all, and it does it very well.

But what about use operational use cases? For instance: let’s say your Lakehouse is hosting some data that is critical to customer-facing systems that demand low-latency response times, such as real-time users lookups, API interfaces, or event-driven systems, sometimes the overhead required to take a query, schedule it, and run it can be in the hundreds of milliseconds. For some workloads, that’s a lifetime.

Read on to see how you can build a caching layer on top of certain lakehouse operations when some operation needs to be as fast as possible.

Leave a Comment

The Value of Multiple Graphs

Alex Velez shares some advice:

A tip I regularly share when providing data visualization feedback is to use multiple graphs instead of packing several series into a single chart. Although it is important to be concise, people are often surprised to hear that when it comes to the number of graphs we share, fewer isn’t always better.

Read on for the advice. It makes a lot of sense for several reasons, as Alex shares. One additional reason is that it goes toward Edward Tufte’s argument about information density: multiple smaller visuals allow you to put more relevant information into the same amount of physical space. This makes your visuals more impactful as a result.

Leave a Comment

Removing Rows with Missing Data in R

Steven Sanderson shows us three ways:

Handling missing values is a crucial aspect of data preprocessing in R. Often, datasets contain missing values, which can adversely affect the analysis or modeling process. One common task is to remove rows containing missing values entirely. In this tutorial, we’ll explore different methods to accomplish this task in R, catering to scenarios where we want to remove rows with either some or all missing values.

Click through for three ways to do this.

Leave a Comment

CI/CD in GitHub

I have a new video:

In this video, I explain what continuous integration (CI) is, disambiguate continuous delivery from continuous deployment (CD), and see how you can perform CI/CD operations using GitHub Actions.

Read on to see what these terms mean and an example of how it all works with .NET projects.

Leave a Comment

Searching for Files in a Blob Storage Container

Andy Brownsword hits one of my bugbears:

Shifting from handling data on premises to Azure has been a real change of mindset. Whilst what I want to build may be similar, the how part is completely different. There’s a learning curve not just to the tooling but how you use it too.

This is one of those instances.

I had a storage container with files which had a date in their name. I wanted to perform a wildcard search to select some of them. That sounds straight forward, right?

This is unnecessarily painful, especially if you’re trying to find the right full backup in a container filled with full and transaction log backup files. Andy’s solution does work but also requires a full scan of keys. And I don’t think there’s a better way to do it.

Leave a Comment

The Importance of Orchestration in E(L)TL Processes

Martin Schoombee begins a new series:

In the context of what we’re talking about throughout this series – facilitating the execution of an ETL process in a platform like Azure Data Factory – orchestration means that we’re using the ETL tool primarily for the “E” (Extract) part of the process. In addition to that, most people I know would also use the ETL tool to facilitate the workflow, in other words the order of execution and any constraints that go along with that.

In what I’d like to call the “traditional” approach for lack of a better term, all parts of the ETL process are performed natively by the tool (image below), using whatever built-in tasks are available and of course accounting for any nuances. With this approach, transformations are typically performed in transit and in memory.

Read on to see how the Orchestration approach differs from the traditional ETL approach.

Leave a Comment