Press "Enter" to skip to content

Month: August 2024

Comparing Power BI Stacked Column Charts

Meagan Longoria compares and contrasts:

One of the new features in the August Power BI Desktop release is the updated legends that are styled to more accurately reflect the per-series formatting on the visual. This made me curious how close I could get to the clean look of a Deneb (vega-lite) stacked bar chart.

I used open source data from the Vega github repo and applied a few filters in Power BI. I used exactly the same source data in each chart, filtering down to five sites and 9 varieties. I chose a visual with a legend, data labels, and total labels to display the differences. We can argue about whether you should show a chart with this many labels another day.

Click through to see how the in-built visual compares to a custom visual of the same form.

Comments closed

Compressing Indexes and Shrinking Azure SQL MI Databases

Kendra Little has a good reason for an often-bad act:

Shrinking databases in SQL Server isn’t fun – it’s slow, it causes blocking if you forget to use the WAIT_AT_LOW_PRIORITY option, and sometimes it persistently fails and refuses to budge until you restart the instance. You only want to shrink a SQL Server database when you’ve got a good reason and a lot of patience.

If you’re using Azure SQL Managed Instance and you haven’t already used data compression on your indexes and shrunk your databases, you probably have two good reasons to do both of those things: performance and cost reduction.

Compressing indexes is very often (almost always?) a good thing. Shrinking databases is very often (again, almost always?) a bad thing. This is like a buddy cop movie for your database.

Kendra gives some good advice but also lays out a warning if you’re on General Purpose V1, so read the whole thing.

Comments closed

Random Forest Missing Data Imputation using missRanger

Michael Mayer handles missing data:

{missRanger} is a multivariate imputation algorithm based on random forests, and a fast version of the original missForest algorithm of Stekhoven and Buehlmann (2012). Surprise, surprise: it uses {ranger} to fit random forests. Especially combined with predictive mean matching (PMM), the imputations are often quite realistic.

This looks like an interesting package. At first, I thought it was a way of generating predictions outside the boundaries of training data and had concerns—a classic point (limitation?) of random forest as an algorithm is that it will not even try to predict values outside the range of what it sees in training data, so if the largest label is 10 and the smallest is 0, you won’t see a prediction of 11 or 50, no matter how you scale the inputs.

Instead of doing that, missRanger looks like it’s filling in missing data using a clever approach. That’s quite useful for dealing with incomplete data, a really common problem whose good solutions tend to be complex enough that people typically ignore them in favor of simple but less useful solutions like dropping rows altogether.

Comments closed

mssparkutils now notebookutils and Validating DAGs in Fabric

Sandeep Pawar gives us two quick hits:

First, if you haven’t noticed mssparkutils has been officially renamed to notebookutils. Check out the official documentation for details. Be sure to use/update your notebooks to notebookutils.

Read on for a pair of notes around this name change, as well as some capabilities to validate DAGs when using runMultiple to orchestrate multiple notebook executions.

Comments closed

Setting Row Label and Key Columns in Power BI

Chris Webb makes use of metadata:

A few weeks ago I wrote a post on how to improve the results you get from Power BI Copilot by editing the Linguistic Schema. As I mentioned, though, there are in fact lots of different ways that you as a Power BI semantic model developer can improve the results you get from Copilot and in this post I’ll show you another one: setting the Row Labels and Key Columns properties on a table.

Read on to see how these can affect results.

Comments closed

Databricks Notebook Package Installation and Variables

Chen Hirsh diagnoses a problem:

A friend called to ask for my help with a weird issue. In a Databricks notebook using Python, he declares and assigns a variable in the first cell. Something like that:

my_var = 1

He then runs the rest of the notebook, and somewhere along the way, tries to use this variable, and gets this message:

NameError: name 'my_var' is not defined

Going back to cell 1, and checking the value of my_var, he gets the same error.

Read on for the root cause of the issue, as well as a pair of helpful tips from Chen.

Comments closed

pg_dump and the Backup Tool Debate

Gulcin Yildirim Jelinek explains the debate around whether pg_dump is a backup tool or not:

Recently, while writing about the vulnerability affecting pg_dump, the topic of decommissioning pg_dump came up on Twitter. Unlike the nostalgic feelings many had for Pluto, there was less reluctance to see pg_dump reclassified. In fact, some people were eager to retire it as a backup utility, and I even got a bit of pushback for still referring to pg_dump as one

I was talking to my colleague Simona the other day, and she mentioned that everybody in Postgres circles says, “pg_dump is not a backup tool,” but perhaps it’s not always explained well why it is not.

Read on for that explanation.

Comments closed

System.Data.SqlClient Deprecated

David Engel has an announcement:

We announced Microsoft.Data.SqlClient in the first half of 2019 and shipped the first stable package later that year. That release coincided with .NET Core 3.0. We’ve been developing Microsoft.Data.SqlClient in the dotnet/SqlClient repo since that time, over the last five years. It’s now time to start the deprecation process for the System.Data.SqlClient package. We plan to take a staged process to deprecation, with the intent to align support changes and associated code transition activities between these libraries to natural migration points between .NET major versions.

I haven’t used System.Data.SqlClient in a while, so I’m not sure how much is in that library that isn’t in Microsoft.Data.SqlClient.

Comments closed

Interpreting Linear Regression Model Coefficients

Vinod Chugani looks at a linear regression:

Linear regression models are foundational in machine learning. Merely fitting a straight line and reading the coefficient tells a lot. But how do we extract and interpret the coefficients from these models to understand their impact on predicted outcomes? This post will demonstrate how one can interpret coefficients by exploring various scenarios. We’ll delve into the analysis of a single numerical feature, investigate the role of categorical variables, and unpack the complexities introduced when these features are combined. Through this exploration, we aim to equip you with the skills needed to leverage linear regression models effectively, enhancing your analytical capabilities across different data-driven domains.

Click through for details, with examples in Python.

Comments closed

Time Series Anomaly Detection in Microsoft Fabric

Adi Eldar talks anomaly detection:

Anomaly Detector, one of Azure AI services, enables you to monitor and detect anomalies in your time series data. This service is based on advanced algorithms, SR-CNN for univariate analysis and MTAD-GAT for multivariate analysis. This service is being retired by October 2026, and as part of the migration process

  • The algorithms were open sourced and published by the new time-series-anomaly-detector · PyPI package.
  • We offer a time series anomaly detection workflow in Microsoft Fabric data platform.

Read on to see what replacements exist and how you can use the time-series-anomaly-detector package in Microsoft Fabric.

Comments closed