Press "Enter" to skip to content

Author: Kevin Feasel

Using the pg_repack Extension

Muhammad Ali tries out an extension:

Regular updates and deletions within PostgreSQL tables can lead to various issues such as bloat, fragmentation, and a decline in performance over time. These challenges can significantly impact the efficiency and reliability of the database, potentially affecting critical operations.

To address these concerns, PostgreSQL introduced the pg_repack extension, which provides a robust solution for managing table maintenance without disrupting the production environment. By allowing tables to be rebuilt online, pg_repack tackles bloat and fragmentation issues, ensuring that database storage remains optimized and performance is consistently maintained.

Read on to see why vacuuming might not be enough and what pg_repack does.

Comments closed

Using F-SKU Power BI Capacity and Microsoft Fabric

Chris Webb has a public service announcement:

Since the announcement in March that Power BI Premium P-SKUs are being retired and that customers will need to migrate to F-SKU capacities intead I have been asked the same question several times:

Why are you forcing me to migrate to Fabric???

This thread on Reddit is a great example. What I want to make clear in this post is the following:

Moving from P-SKU capacities to F-SKU capacities is not the same thing as enabling Fabric in your tenant

Click through for Chris’s explanation. Also check out the comments section for this one, as there are plenty of questions and responses in there.

Comments closed

Gradient Boosting for Classification

I have a new video:

In this video, I take a look at an alternative to bootstrap aggregation & random forest: boosting. We cover a brief history of boosting and see how it works in action with XGBoost and LightGBM.

This is probably the video with the single largest number of links in my show notes. It’s also one of the shortest in the series; it’s funny how things work out sometimes.

Comments closed

Authenticating to Fabric APIs via Sempy and Service Principals

Gilbert Quevauvilliers links everything together:

I have been doing a fair amount of work lately with Fabric Notebooks.

I am always conscious to ensure that when I am authenticating using a Service Principal, I can make sure it is as secure as possible. To do this I have found that I can use the Azure Key Vault and Azure identity to successfully authenticate.

Read on for some of the advantages of using Azure Key Vault for this sort of credential management, as well as how to get it all working.

Comments closed

Creating Orchestrators in Azure Data Factory

Martin Schoombee continues a series on building an orchestration framework in Azure Data Factory:

The orchestration layer of the framework is where all the magic happens. It facilitates the execution of processes and/or tasks as defined in the metadata, and needs to do it both seamlessly and efficiently. Ideally you would want to deploy this layer only once, and never have to touch it again. And it is really with that in mind that I designed this layer…to function independently and with minimal dependencies in both directions.

I would have loved for this layer to consist of only one pipeline but there are some nuances in Data Factory that make it impossible, the primary nuance being that you cannot nest ForEach activities. As a result, this layer contains three pipelines that will be covered by the sections below in more detail.

Read on to see what those three pipelines are.

Comments closed

Refreshable Excel Files in the Power BI Service

Kristyna Ferris shows how to refresh a Power BI data source from Excel files in Sharepoint:

Ever since Excel made its debut in the 1980’s, it has been used as a quick way for end users to input and manipulate data on their own without going through the extensive data engineering and data ingestion processes. With Power BI coming on to the scene in 2015, it quickly became the go-to visualization tool for various data sources. These two powerful tools can be used together to drive customized insights for your organization. By uploading your Excel file into SharePoint/OneDrive, you can easily connect and set up a refresh to a Power BI report in the Power BI Service without an on-premises gateway.

Read on to see how it all works.

Comments closed

A Primer on SSIS Package Deployment

Andy Brownsword gives us a blast from the past:

Configurations for Integration Services packages allow us to tailor their execution without needing to redeploy. There are two main ways to manage these configurations – Package Configuration and Project Configuration. In this post we’ll look at the Package Configuration approach.

Package deployment was the original approach, though as Andy points out, it’s no longer the default.

Comments closed

Checking if a Column Exists in an R Data Frame

Steven Sanderson takes a peek:

When working with data frames in R, it’s common to need to check whether a specific column exists. This is particularly useful in data cleaning and preprocessing, to ensure your scripts don’t throw errors if a column is missing. Today, we’ll explore several methods to perform this check efficiently in R, and I encourage you to try these methods out with your own data sets.

Read on for four ways to do this.

Comments closed

Data Quality Issues in Python-Based Time Series Analysis

Hadi Fadlallah checks out the data:

Time-series data analysis is one of the most important analysis methods because it provides great insights into how situations change with time, which helps in understanding trends and making the right choices. However, there is a high dependence on its quality.

Data quality mistakes in time series data sets have implications that extend over a large area, such as the accuracy and trustworthiness of analyses, as well as their interpretation. For instance, mistakes can be caused by modes of data collection, storage, and processing. Specialists working on these data sets must acknowledge these data quality obstacles.

Read on for several examples of data quality issues you might run into in a time series dataset, as well as their fixes.

Comments closed