2024-08-26 – Curated SQL

Random Forest Missing Data Imputation using missRanger

Published 2024-08-26 by Kevin Feasel

{missRanger} is a multivariate imputation algorithm based on random forests, and a fast version of the original missForest algorithm of Stekhoven and Buehlmann (2012). Surprise, surprise: it uses {ranger} to fit random forests. Especially combined with predictive mean matching (PMM), the imputations are often quite realistic.

This looks like an interesting package. At first, I thought it was a way of generating predictions outside the boundaries of training data and had concerns—a classic point (limitation?) of random forest as an algorithm is that it will not even try to predict values outside the range of what it sees in training data, so if the largest label is 10 and the smallest is 0, you won’t see a prediction of 11 or 50, no matter how you scale the inputs.

Instead of doing that, missRanger looks like it’s filling in missing data using a clever approach. That’s quite useful for dealing with incomplete data, a really common problem whose good solutions tend to be complex enough that people typically ignore them in favor of simple but less useful solutions like dropping rows altogether.

Comments closed

mssparkutils now notebookutils and Validating DAGs in Fabric

Published 2024-08-26 by Kevin Feasel

Sandeep Pawar gives us two quick hits:

First, if you haven’t noticed mssparkutils has been officially renamed to notebookutils. Check out the official documentation for details. Be sure to use/update your notebooks to notebookutils.

Read on for a pair of notes around this name change, as well as some capabilities to validate DAGs when using runMultiple to orchestrate multiple notebook executions.

Comments closed

Setting Row Label and Key Columns in Power BI

Published 2024-08-26 by Kevin Feasel

Chris Webb makes use of metadata:

A few weeks ago I wrote a post on how to improve the results you get from Power BI Copilot by editing the Linguistic Schema. As I mentioned, though, there are in fact lots of different ways that you as a Power BI semantic model developer can improve the results you get from Copilot and in this post I’ll show you another one: setting the Row Labels and Key Columns properties on a table.

Read on to see how these can affect results.

Comments closed

Databricks Notebook Package Installation and Variables

Published 2024-08-26 by Kevin Feasel

Chen Hirsh diagnoses a problem:

A friend called to ask for my help with a weird issue. In a Databricks notebook using Python, he declares and assigns a variable in the first cell. Something like that:
my_var = 1
He then runs the rest of the notebook, and somewhere along the way, tries to use this variable, and gets this message:
NameError: name 'my_var' is not defined
Going back to cell 1, and checking the value of my_var, he gets the same error.

Read on for the root cause of the issue, as well as a pair of helpful tips from Chen.

Comments closed

pg_dump and the Backup Tool Debate

Published 2024-08-26 by Kevin Feasel

Gulcin Yildirim Jelinek explains the debate around whether pg_dump is a backup tool or not:

Recently, while writing about the vulnerability affecting pg_dump, the topic of decommissioning pg_dump came up on Twitter. Unlike the nostalgic feelings many had for Pluto, there was less reluctance to see pg_dump reclassified. In fact, some people were eager to retire it as a backup utility, and I even got a bit of pushback for still referring to pg_dump as one

I was talking to my colleague Simona the other day, and she mentioned that everybody in Postgres circles says, “pg_dump is not a backup tool,” but perhaps it’s not always explained well why it is not.

Read on for that explanation.

Comments closed

System.Data.SqlClient Deprecated

Published 2024-08-26 by Kevin Feasel

David Engel has an announcement:

We announced Microsoft.Data.SqlClient in the first half of 2019 and shipped the first stable package later that year. That release coincided with .NET Core 3.0. We’ve been developing Microsoft.Data.SqlClient in the dotnet/SqlClient repo since that time, over the last five years. It’s now time to start the deprecation process for the System.Data.SqlClient package. We plan to take a staged process to deprecation, with the intent to align support changes and associated code transition activities between these libraries to natural migration points between .NET major versions.

I haven’t used System.Data.SqlClient in a while, so I’m not sure how much is in that library that isn’t in Microsoft.Data.SqlClient.

Comments closed

M	T	W	T	F	S	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Day: August 26, 2024

Random Forest Missing Data Imputation using missRanger

mssparkutils now notebookutils and Validating DAGs in Fabric

Setting Row Label and Key Columns in Power BI

Databricks Notebook Package Installation and Variables

pg_dump and the Backup Tool Debate

System.Data.SqlClient Deprecated