Press "Enter" to skip to content

Day: March 20, 2024

Duplicating Rows in R

Steven Sanderson repeats the punch line a few times:

Are you working with a dataset where you need to duplicate certain rows multiple times? Perhaps you want to create synthetic data by replicating existing observations, or you need to handle imbalanced data by oversampling minority classes. Whatever the reason, replicating rows in a data frame is a handy skill to have in your R programming toolkit.

In this post, we’ll explore how to replicate rows in a data frame using base R functions. We’ll cover replicating each row the same number of times, as well as replicating rows a different number of times based on a specified pattern.

Click through to replicate data without copy-paste.

Comments closed

Learning about GitHub Actions

I have a new video:

In this video, we dig into GitHub’s process for executing code: GitHub Actions workflows. We’ll learn what Actions and workflows are, how we can create them from scratch, and how to incorporate Actions from the GitHub Marketplace into our own workflows.

Along the way, I describe what GitHub Actions workflows are and we build a simple one. I’ll have more videos coming up that expand on GitHub Actions and show you more of what you can do with them.

Comments closed

Exposing Kafka Data in Iceberg using Tableflow

Marc Selwan announces a new product:

We’re excited to talk about our vision for Tableflow, which makes it push-button simple to take Apache Kafka® data and feed it directly into your data lake, warehouse, or analytics engine as Apache Iceberg® tables. Making operational data accessible to the analytical world is traditionally a complex, expensive, and brittle process and we believe we can do better to unify the operational and analytical estates.

Tableflow removes all this erroneous, duplicative work and helps convert Kafka topics and associated schemas to Iceberg tables in one click. This is central to our Confluent’s vision to build the world’s leading data streaming platform that fuels any operational and analytical workload with real-time data products. 

It looks like this is currently in early access, but you can see where Confluent intends to take the product.

Comments closed

Changing the Timeout of a Spark Session in Microsoft Fabric

Koen Verbeeck doesn’t have time to wait:

You might know the feeling: you’re writing code in a Notebook in Microsoft Fabric and suddenly you have to leave your workstation for a while. Someone ran the doorbell (you’re working from home and you get some parcels delivered), or you took a coffee break with some colleagues. When you return to your notebook, the Spark session has timed out and when you run a cell, you have to wait for the damn thing to restart again. The agony, waiting for 2-3 minutes for the session to start, and only after that the actual code can start running.

Read on to see how you can set the timeout to a custom value, assuming you’re okay with paying for the Spark cluster to sit around until it times out.

Comments closed

Dropping Objects in SQL Server and Snowflake

Kevin Wilkie gets the drop on us:

When you’re working between SQL Server and Snowflake, there can be a lot of crossover that may make you forget what system you’re working in. Sometimes it’s close, but not close enough.

Today, let’s go over something that should be rather simple – removing old objects that we shouldn’t need any longer.

Read on to see how the two data platform technologies differ in this regard.

Comments closed

The Proper Use of Views and Inline UDFs

Erik Darling plays tic-tac-toe:

The problem is really the stuff that people stick into views. They’re sort of like a junk drawer for data. Someone builds a view that returns a correct set of results, which becomes a source of truth. Then someone else comes along and uses that view in another view, because they know it returns the correct results, and so on and so on. Worse, views tend to do a bunch of data massaging, left joining and coalescing and substringing and replacing and case expressioning and converting things to other things. The bottom line is that views are as bad as you make them.

The end result is a trash monster with a query plan that can only be viewed in full from deep space.

Read on to learn the use cases for views and inline UDFs, as well as a few important notes regarding performance of each. Views are like mogwai: they’re fine as long as you never get them wet and never let them eat after midnight. The problem is, far too many companies are apparently the business equivalent of all-you-can-eat buffets at water parks.

Inline user-defined functions are like patenting a device that lets you shoot yourself in both feet with one pull of the trigger. Which, if I understand things correctly, means you’ll need a Form 4 for each inline UDF.

Comments closed

What’s New in SSMS 20

Erin Stellato gives us the skinny:

We expect that the first two posts, combined with the release notes and the new Connect with SQL Server Management Studio page, provide the details you need about the changes in SSMS 20 GA.  As such, the focus of this post is the roadmap for SSMS.  Our roadmap is heavily influenced by the evolving capabilities of SQL Server and Azure SQL, and feedback from SSMS users.  We’re currently collecting general feedback at https://aka.ms/sqlfeedback, and feedback on Copilot in SSMS at https://aka.ms/ssms-copilot-feedback.  Please comment and upvote on items that you would like to see in SSMS!

With SSMS 20 now being generally available, you can download it and try it out in your own environment. Erin quells any fears that Microsoft is abandoning SSMS and covers some of the big-ticket items on the roadmap.

1 Comment

Migrating from Power BI to Microsoft Fabric

Paul Turley gives us an overview:

Fabric is here but what does that mean if you are using Power BI? What do you need to know and what, if anything, will you need to change if you are a Power BI report designer, developer or BI solution architect? What parts of Fabric should you use now and how do you plan for the near-term future? As I write this in March of 2024, I’m at the Microsoft MVP Summit at the Microsoft campus in Redmond, Washington this week learning about what the product teams will be working on over the next year or so. Fabric is center stage in every conversation and session. To say that Fabric has moved my cheese would be a gross understatement. I’ve been working with data and reporting solutions for about 30 years and have seen many products come and go. Everything I knew about working with databases, data warehouses, transforming and reporting on data has changed recently BUT it doesn’t mean that everyone using Power BI must stop what they are doing and adapt to these changes. The core product is unchanged. Power BI still works as it always has.

Read on to learn more about Paul’s thesis and how the world changes with Microsoft Fabric.

Comments closed