Press "Enter" to skip to content

Month: May 2023

Reusable Power BI Deployment Pipelines

Richard Swinbank re-uses a pipeline:

Implementation of one pipeline per report makes additional demands of a developer when creating a new report. To make this easier to manage, in this post I look how to make pipeline creation as simple as possible, by building each pipeline from a set of reusable components.

Click through to see how this works in Azure DevOps. I’d expect the process to be reasonably similar for GitHub Actions as well.

Comments closed

The Current Status of the Lakehouse Architecture

Paul Turley is happy:

When I first started attending conference and user group sessions about Lakehouse architecture, I didn’t get it at first, but I do now; and it checks all the boxes. As a Consulting Services Director in a practice with over 200 BI developers and data warehouse engineers, I see first-hand how our customers – large and small – are adopting the Lakehouse for BI, Data science and operational reporting.

Read on for Paul’s thoughts. My main concern with the strategy has always been performance, with the expectation that it’d take a few years for lakehouse systems to be ready for prime time. We’re getting close to that few years (back in 2020, I believe I estimated 2024-2025).

Comments closed

Notes on Postgres Backups

Muhammad Ali hits us with it:

Backing up your PostgreSQL database is a critical task for ensuring the safety and availability of your data. In the event of a hardware failure, software error, or other disaster, having a recent backup of your database can mean the difference between a brief outage and a catastrophic data loss. In this blog post, we’ll cover best practices for backing up PostgreSQL database.

Click through for some notes on various backup utilities (pg_dump, pg_dumpall, pg_basebackup), when you might want to use each, and a few more topics.

Comments closed

Capturing Stored Procedure Call Parameters

Greg Dodd has an Extended Events session for that:

Do you ever have a stored procedure that you know is performing badly and needs tuning, but you’re struggling to capture when it is run or what parameters were passed in?
I had this problem recently, I knew that a particular stored procedure was running slowly for some parameters, but figuring out what the bad combination was proved to be very difficult. What if, instead of trying to guess what the parameters were, I just captured all of the time that the stored proc ran, along with the run time?

This isn’t something you’d want to run for everything at all times, I’m sure, but this can be quite important when you’re fighting one poorly-performing stored procedure.

Comments closed

Date Handling in Excel and R

Amieroh Abrahams continues a series comparing Excel and R:

Here we will explore the various ways to handle dates in Excel and R. Dates are a crucial part of data analysis and are used in various fields such as biology, healthcare, and social sciences. However, working with dates can be challenging, especially when dealing with large datasets or multiple formats.

In Excel, there are several functions available to handle dates, such as DATEYEARMONTH, and DAY. Excel also provides various formatting options to customise the display of dates. However, Excel has some limitations when it comes to complex date calculations, and it can be time-consuming to work with dates in large datasets.

In contrast, R has a robust set of tools for handling dates, including the {lubridate} package, which simplifies the manipulation of dates and times. Additionally, R allows for efficient handling of dates in large datasets, making it a powerful tool for time-series analysis. Whether you are working with dates in Excel or R, this blog will provide you with the basic tools and techniques to handle dates efficiently and accurately. So let’s get started!

Read on for the comparison.

Comments closed

Building Your First Spark SQL Application

Dustin Vannoy has a new video for us:

Get hands on with Spark SQL (no Python or Scala) to build your first data pipeline. In this video I walk you through how to read, transform, and write the NYC Taxi dataset with Spark SQL. This dataset can be found on Databricks, Azure Synapse, or downloaded from the web to wherever you run Apache Spark. Once you have watched and followed along with this tutorial, go find a free dataset and try to write your own application with Apache Spark.

Click through for the video and sample code.

Comments closed

Backup Encryption Performance

Matthew McGiffen runs some tests:

Unlike TDE, there is some extra CPU overhead when you take an encrypted backup as the data has to be encrypted before being written to disk – whereas with TDE the data is already encrypted. Backup times however are unlikely to be affected significantly as the bottleneck is usually going to be the time it takes to physically write the data to disk. The CPU processing should take a fraction of that time.

Matthew’s tests are on a ~9GB database, which is large enough to estimate differences without having us wait all day to compare.

Comments closed

SQL Agent and Memory Consumption

Sean Gallardy performs some troubleshooting:

I was asked if I knew any reason why SQL Agent would be using “a bunch” of memory and more cpu than normal. You and I, reader, now have the same information to go on. What do you do? Think about for a minute or two if you want before reading on. I’m not saying the way I did it was the way to do it or the only way, but I gave an action plan for data capture and once data was in hand, was solved in a few minutes. There are all sorts of things that can cause this in a program, but Agent typically doesn’t use a whole lot, unless it’s executing many concurrent T-SQL jobs (which, it really shouldn’t be, get enterprise level scheduling) and even then, the memory should deallocate.

My first inclination turned out to be right: I figured it had to do with a job running. The specifics, that was something I wasn’t sure about, but Sean takes us through the troubleshooting process.

Comments closed

Could Not Update the Metadata that Indicates Database Enabled for CDC

Jose Manuel Jurado Diaz troubleshoots an issue in Azure SQL DB:

Today, we got a error message while trying to enable cdc for a database using the sqladmin user. Our customer got the error message: Msg 22830, Level 16, State 1, Procedure sys.sp_cdc_enable_db_internal, Line 283 [Batch Start Line 0]
Could not update the metadata that indicates database XYZ is enabled for Change Data Capture. The failure occurred when executing the command ‘SetCDCTracked(Value = 1)’. The error returned was 33171: ‘Only active directory users can impersonate other active directory users.’. Use the action and error to determine the cause of the failure and resubmit the request.

Read on to understand what the problem is and how you can resolve it.

Comments closed