Press "Enter" to skip to content

Author: Kevin Feasel

Aggregating by Month and Year in R

Steven Sanderson groups by month and year:

Taming the beast of daily data can be daunting. While it captures every detail, sometimes you need a bird’s-eye view. Enter aggregation, your secret weapon for transforming daily data into monthly and yearly insights. In this post, we’ll dive into the world of R, where you’ll wield powerful tools like dplyr and lubridate to master this data wrangling art.

Click through for examples of summarizing daily data into monthly and annual data. One thing to keep in mind, however, is that the monthly aggregation in these examples is just month, so if you have July 2023 and July 2024 data, you’ll get a row back for July. It’s all about understanding what the grain of your data is, as well as your desired grain.

Comments closed

Generating Test Data with ChatGPT

Daniel Janik builds fake data:

Have you ever been tasked with creating test data for an application and then ran into performance problems once the application moves to production?

Many of us manage databases or applications that contain regulated data that can’t leave a production environment. This means that we need to “clean” the data if it’s going to be used in QA or development work and one common way to de-identify the data is to simply update columns like firstname and lastname with a simple format “firstname” + counter; however, this results in all the data being unique and sequential. Firstname1, firstname2, firstname3, …
This isn’t good for getting like for like results with a production database and can lead to questions we’ve heard before in the workplace like “Why didn’t we catch this in QA?”

This works reasonably well, though you’d want to be sure to seed in edge cases and the like. But if you just need to generate some realistic-ish data pretty quickly, this is one option that can work.

Comments closed

Goodbye Aurora Serverless v1

Alex Woodie breaks the news:

AWS has notified customers of its Amazon Aurora Serverless v1 service that it will cease supporting the offering at the end of 2024. Replacing v1 in the Aurora Serverless range, which supports Postgres and MySQL databases, will be v2, which offers some advantages but also one big disadvantage: It doesn’t scale all the way down to zero.

Click through for more information.

Comments closed

Visualizing Power BI Import Dependencies as a Graph

Chris Webb builds graphs, but not those types of graphs–the other type of graphs:

A few years ago a new pair of Profiler events was added for Power BI Import mode datasets (and indeed AAS models): the Job Graph events. I blogged about them here but they never got used by anyone because it was extremely difficult to extract useful data from them – you had to run a Profiler trace, save the trace file, run a Python script to generate a .dgml file, then open that file in Visual Studio – which was a shame because they contain a lot of really interesting, useful information. The good news is that with the release of Semantic Link in Fabric and the ability to run Profiler traces from a Fabric notebook it’s now much easier to access Job Graph data and in this blog post I’ll show you how.

Read on to see an example of it in action.

Comments closed

Optimized Locking in Azure SQL DB

Aaron Bertrand tries out a new feature:

In a sentence: Instead of locking individual rows and pages for the life of the transaction, a single lock is held at the transaction level, and row and lock pages are taken and released as needed.

This is made possible by previous investments in Accelerated Database Recovery and its persistent version store. A modification can evaluate the predicate against the latest committed version, bypassing the need for a lock until it is ready to update (this is called lock after qualification, or LAQ). There’s a lot more to it than that, and I’m not going to dive deep today, but the result is simple: long-running transactions will lead to fewer lock escalations and will do a lot less standing in the way of the rest of your workload. Locks held for shorter periods of time will naturally help reduce blocking, update conflicts, and deadlocks. And with fewer locks being held at any given time, this will help improve concurrency and reduce overall lock memory.

Read on to learn more about how it works and Aaron’s initial thoughts on the feature.

Comments closed

Explaining Models with Classic Methods and SHAP

Michael Mayer has some ‘splainin to do:

Let’s explain a {tidymodels} random forest by classic explainability methods (permutation importance, partial dependence plots (PDP), Friedman’s H statistics), and also fancy SHAP.

Disclaimer: {hstats}, {kernelshap} and {shapviz} are three of my own packages.

What I really appreciate in here is that Michael includes classic methods here. It can be easy to say “Oh, this is old and therefore no longer relevant.” But that would be quite wrong.

Comments closed

Against Publishing Power BI Model Changes from PBI Desktop

Soheil Bakhshi has some thoughts:

In a previous post, I shared a comprehensive guide on implementing Incremental Data Refresh in Power BI Desktop. We covered essential concepts such as truncation and load versus incremental load, understanding historical and incremental ranges, and the significant benefits of adopting incremental refresh for large tables. If you missed that post, I highly recommend giving it a read to get a solid foundation on the topic.

Now, let’s dive into Part 2 of this series where we will explore tips and tricks for implementing Incremental Data Refresh in more complex scenarios. This blog follows up on the insights provided in the first part, offering a deeper understanding of how Incremental Data Refresh works in Power BI. Whether you’re a seasoned Power BI user or just getting started, this post will provide valuable information on optimising your data refresh strategies. So, let’s begin.

Read on for plenty of detail, including your available options and how to use them.

Comments closed

Digging into Execution Plans for LAG() and LEAD()

Hugo Kornelis looks at a pair of useful window functions:

LAG and LEAD were introduced in SQL Server 2012. They require an OVER clause, but it can only specify PARTITION BY and ORDER BY. No ROWS / RANGE specification for a window frame. Which makes them the stand out as unusual in this series.

By default, they return a value from the last row before the current row, or from the first row after the current row, based on the specified sort order and while observing the specified partition boundaries. But there are two optional parameters, an offset to specify that you want, for instance, the third-last row or the second-next row. And the default parameter specifies a value to be used instead of NULL when the indicated row falls outside of the partition.

Click through to see what the plans look like, as well as how very welcome though potentially performance-impacting changes in SQL Server 2022 have affected this.

Comments closed

Stored Procedure Wrapup

Erik Darling wraps up a series on stored procedures. First, cursors and loops:

You will, for better or worse, run into occasions in your database career that necessitate the use of loops and cursors.

While I do spend a goodly amount of time reworking this sort of code to not use loops and cursors, there are plenty of reasonable uses for them.

I do think we push the “don’t use cursors or loops” thing a little too hard in the SQL Server world, but I also think that a majority of cases in which you’re doing something in a loop, you should be doing it in code outside of SQL Server.

Erik then wraps things up for real:

The general idea of the series was to teach developers about the types of things I always seem to be fixing and adjusting, so that I can hopefully fix really interesting problems in the future.

Of course, that all depends on folks finding these and reading them. If that were the general sway of the world, I’d probably never had been in business in the first place.

Click through for a listing of all of the posts in the series.

Comments closed