Press "Enter" to skip to content

Author: Kevin Feasel

Feature Importance in XGBoost

Ivan Palomares Carrascosa takes a look at one of my favorite plots in XGBoost:

One of the most widespread machine learning techniques is XGBoost (Extreme Gradient Boosting). An XGBoost model — or an ensemble that combines multiple models into a single predictive task, to be more precise — builds several decision trees and sequentially combines them, so that the overall prediction is progressively improved by correcting the errors made by previous trees in the pipeline.

Just like standalone decision trees, XGBoost can accommodate both regression and classification tasks. While the combination of many trees into a single composite model may obscure its interpretability at first, there are still mechanisms to help you interpret an XGBoost model. In other words, you can understand why predictions are made and how input features contributed to them.

This article takes a practical dive into XGBoost model interpretability, with a particular focus on feature importance.

Read on to learn more about how feature importance works, as well as the three different views of the data you can get.

Comments closed

Registering Applications to Read Fabric Resources

Andy Leonard works in Microsoft Entra:

My older son, Stephen, and I have been vibe coding information-dense solutions for Fabric lately. The latest application is Fabric Navigator, which simplifies navigation between Fabric Data Factory pipelines and notebooks. While Fabric Navigator includes links to instructions about configuring Azure and Fabric security to allow read access to Fabric Data Factory pipelines and notebooks, I feel a walk-through of a minimally-viable security configuration is in order. Hence, this post.

Click through to see what setting you need to make in Entra, as well as settings you need to change in Microsoft Fabric, for this to work.

Comments closed

Equalizing Proxy vs Redirect Rates for OneLake Access

Elizabeth Oldag announces a pricing change:

We’re thrilled to share a major update and simplification to OneLake’s capacity utilization model that will make it even easier to manage Fabric capacity and scale your data workloads. We are reducing the consumption rate of OneLake transactions via proxy to match the rate for transactions via redirect. This means you no longer have to worry where you are accessing your OneLake data from (via proxy or redirect), they will consume your capacity at the same low rate.

Read on to see what this means in practice.

Comments closed

Custom Fonts in Power BI

Ben Richardson looks at fonts:

Want your Power BI reports to look more polished and on-brand?

Fonts play a big role in how your reports are perceived – impacting clarity, trust, and style.

But Power BI doesn’t let you upload custom fonts directly. So, what can you do?

Click through for several options.

Comments closed

Installing SQL Server Instances via dbatools

David Seis digs into another dbatools cmdlet:

As DBAs we install SQL Server for various reasons regularly. If you could save time for each installation for more critical tasks, would you?

In this blog post, we will audit the dbatools command Install-Dbainstance. I will test, review, and evaluate the script based on a series of identical steps. Our goal is to provide insights, warnings, and recommendations to help you use this script effectively and safely. Install-Dbainstance is a powerful tool to automate the install and configuration of a new SQL Server instance. It works well in scenarios that require frequent deployments of SQL Server instances.

You can definitely automate installation of SQL Server without the cmdlet, but the dbatools team does a good job of laying out what’s possible that you might not necessarily get just from the config script that the SQL Server installer spits out (and uses when you next-next-next your way to success).

Comments closed

Recommendations around SUMMARIZECOLUMNS

Marco Russo and Alberto Ferrari share some thoughts:

SUMMARIZECOLUMNS is the most widely-used function in Power BI queries. An important and unique feature of SUMMARIZECOLUMNS is that it determines automatically how to scan the model to produce its result. Indeed, when using SUMMARIZEGROUPBYADDCOLUMNS, or any of the more basic querying functions, developers must declare the source table to perform the grouping, as well as the group-by columns and the measures to add to the result. On the other hand, SUMMARIZECOLUMNS requires only the group-by columns; there is no need to provide the source table, which is the primary ingredient of any query. SUMMARIZECOLUMNS figures out the structure of the result by itself, using a sophisticated algorithm that requires some understanding.

The pair do have a whitepaper available on their premium (paid) service but even the free post contains a lot of detail you’ll want to check out if you use DAX.

Comments closed

Using JSON Arrays instead of JSON Objects for Serialization

Lukas Eder makes a recommendation:

Why, yes of course! jOOQ is in full control of your SQL statement and knows exactly what column (and data type) is at which position, because you helped jOOQ construct not only the query object model, but also the result structure. So, a much faster index access is possible, compared to the much slower column name access.

The same is true for ordinary result sets, by the way, where jOOQ always calls JDBC’s ResultSet.getString(int), for example, over ResultSet.getString(String). Not only is it faster, but also more reliable. Think about duplicate column names, e.g. when joining two tables that both contain an ID column. While JSON is not opinionated about duplicate object keys, not all JSON parsers support this, let alone Java Map types.

Read on for some insight into when you might want to choose either of the two approaches, and why Lukas went with JSON arrays instead of JSON objects for object serialization in jOOQ.

Comments closed

Using Python Code in SSIS

Tim Mitchell shoe-horns a language in:

SQL Server Integration Services (SSIS) is a mature, proven tool for ETL orchestration and data movement. In recent years, Python has exploded in popularity as a data movement and analysis tool. Surprisingly, though, there are no native hooks for Python in SSIS. In my experience using each of these tools independently, I’d love to see an extension of SSIS to naturally host Python integrations.

Fortunately, with a bit of creativity, it is possible to invoke Python logic in SSIS packages. In this post, I’ll walk you through the tasks to merge Python and SSIS together. If you want to follow along on your own, you can clone the repo I created for this project.

Honestly, it’s not that surprising. The last time there was significant development on Integration Services was roughly 2012 (unless you include the well-intentioned but barely-functional Hadoop support they added in around 2016). At that point, in the Windows world, Python was not at all a dominant programming language.

Comments closed

A Primer on TMDL Security Risks in Power BI

John Kerski gives us the low-down:

The Tabular Model Definition Language (TMDL) provides a simpler way of defining Power BI Semantic Models. Unlike the JSON-based Tabular Model Scripting Language (TMSL), TMDL uses a more accessible tab-based format for specifying DAX measures, relationships, and Power Query code.

Click through for the various ways things could go wrong, as well as how to mitigate those risks.

Mind you, “security risks” is a very broad concept and is not an indictment of the product, but rather something to keep in mind as you attempt to write secure code. For example, did you know that bad guys could potentially access all of your data in your database by using a series of SELECT statements?

Comments closed

Thoughts on Index Rebuilds in PostgreSQL

Laurenz Albe shares some advice:

People often ask “How can I automatically rebuild by indexes regularly?” or “When should I rebuild my indexes in PostgreSQL?”. That always gives me the feeling that they want to solve a problem that isn’t there. But the REINDEX statement is certainly there for a reason, and sometimes it is perfectly reasonable to rebuild an index. In this article, I’ll explain when it makes sense to rebuild an index and how you can get the relevant data to make that decision.

Read on to learn more.

Comments closed