Press "Enter" to skip to content

Author: Kevin Feasel

Quantile Normalization with TidyDensity

Steven Sanderson achieves normality:

In data analysis, especially when dealing with multiple samples or distributions, ensuring comparability and removing biases is crucial. One powerful technique for achieving this is quantile normalization. This method aligns the distributions of values across different samples, making them more similar in terms of their statistical properties.

Read on to see how you can use the TidyDensity package to pull this off.

Leave a Comment

Filtering a Visual by a Measure via a Slicer in Power BI

Meagan Longoria solves a problem:

Have you ever wanted to filter a visual by selecting a range of values for a measure? You may have found that you cannot populate a slicer with a measure. But you can do this another way.

I have a report that shows project expenses and budgets. I want users to be able to filter the list of project to only those which have expenses within my selected range. I also have 2 other slicers for project budget and percent of budget used, but let’s just focus on the expense amount slicer.

Read on to see how.

Leave a Comment

The Challenge of Developing PostgreSQL Features

Robert Haas talks about a development challenge:

Hacking on PostgreSQL is really hard. I think a lot of people would agree with this statement, not all for the same reasons. Some might point to the character of discourse on the mailing list, others to the shortage of patch reviewers, and others still to the difficulty of getting the attention of a committer, or of feeling like a hostage to some committer’s whimsy. All of these are problems, but today I want to focus on the purely technical aspect of the problem: the extreme difficulty of writing reasonably correct patches.

Read on for Robert’s experience developing incremental backups in Postgres. In fairness, I think this is true of any complex system which becomes mission-critical. It’s really easy to develop in low-risk, limited-code, greenfield environments. As you change each of those properties, development gets considerably more challenging, even if people are doing the right things the right way and checking ego at the door.

Leave a Comment

Classification Concepts and CART in Action

I have a new video series:

In this video, I explain some core concepts behind classification and introduce the first classification algorithm we will look at in CART.

CART, by the way, stands for Classification and Regression Trees, and is one of the easiest classification algorithms to understand as a concept: it’s a decision tree (aka, a series of if-else statements) where each terminal node is an outcome: either a class for classification or a value for regression.

Leave a Comment

Visualizing a Spark Execution Plan

Gerhard Brueckl builds a very helpful tool:

I recently found myself in a situation where I had to optimize a Spark query. Coming from a SQL world originally I knew how valuable a visual representation of an execution plan can be when it comes to performance tuning. Soon I realized that there is no easy-to-use tool or snippet which would allow me to do that. Though, there are tools like DataFlint, the ubiquitous Spark monitoring UI or the Spark explain() function but they are either hard to use or hard to get up running especially as I was looking for something that works in both of my two favorite Spark engines being Databricks and Microsoft Fabric.

Read on for Gerhard’s answer, including an example of it in action.

Leave a Comment

Finding Object Dependencies in SQL Server

Andy Brownsword looks for references:

When looking to migrate, consolidate or deprovision parts of a SQL solution it’s key to understand the dependencies on the objects inside.

Identifying dependencies can be challenging and I wanted to demonstrate one way to approach this. We’ll start with some objects across a couple of databases:

Read on for a pair of queries that get you on the way. Reference detection is surprisingly difficult in SQL Server, especially if you have cross-server queries. Even cross-database queries may not work the way you expect.

Another option is to use sys.dm_sql_referencing_entities and sys.dm_sql_referenced_entities. I wrote a blog post on the topic a long while back and included some of the caveats around these two DMFs.

Leave a Comment

Logging and Auditing in PostgreSQL

Muhammad Ali checks the logs:

In PostgreSQL, managing logs serves as a vital tool for identifying and resolving issues within your application and database. However, navigating through logs can be overwhelming due to the volume of information they contain. To address this, it’s essential to implement a well-defined logs management strategy.

Customizing PostgreSQL logs involves adjusting various parameters to suit your specific needs. Each organization may have unique requirements for logging, depending on factors such as the type of data stored and compliance standards.

In this article, we will explain parameters used to customize logs in PostgreSQL. Furthermore, we describe how to record queries in PostgreSQL and finally recommend a tool for managing PostgreSQL logs at granular level.

Read on to learn how to enable logs in Postgres, some notes on log management, and even a bit on auditing via pgaudit.

Leave a Comment