Press "Enter" to skip to content

Month: October 2024

An Overview of LightGBM

Vinod Chugani continues a series on tree-based classification techniques:

LightGBM is a highly efficient gradient boosting framework. It has gained traction for its speed and performance, particularly with large and complex datasets. Developed by Microsoft, this powerful algorithm is known for its unique ability to handle large volumes of data with significant ease compared to traditional methods.

In this post, we will experiment with LightGBM framework on the Ames Housing dataset. In particular, we will shed some light on its versatile boosting strategies—Gradient Boosting Decision Tree (GBDT) and Gradient-based One-Side Sampling (GOSS). These strategies offer distinct advantages. Through this post, we will compare their performance and characteristics.

Read on to learn more about LightGBM as an algorithm, as well as how to use it.

Leave a Comment

Plotting the ROC Curve in Microsoft Fabric

Tomaz Kastrun gets plotting:

ROC (Receiver Operation Characteristics) – curve is a graph that shows how classifiers performs by plotting the true positive and false positive rates. It is used to evaluate the performance of binary classification models by illustrating the trade-off between True positive rate (TPR) and False positive rate (FPR) at various threshold settings.

Read on to see how you can generate one in a Microsoft Fabric notebook. Tomaz also plots a density function for additional fun.

Leave a Comment

Recovering Power BI Reports You Cannot Download

Kurt Buhler grabs a report:

Below are some reasons why you might not be able to download your Power BI report or model from a workspace:

  • The report was created in the service:
    • Someone created the report manually (using the User Interface) and connects to a model in another workspace.
    • Someone created the report programmatically (for instance, using the REST APIs).
    • Power BI created the report automatically (for instance, it copied the report to a workspace that belongs to a later stage in a deployment pipeline)
  • You used the REST APIs to re-bind a report (changed which model it connects to as a data source).
  • The model has incremental refresh enabled.
  • The model uses automatic aggregations.
  • The model was modified via an XMLA endpoint.
  • Other scenarios described in the limitations in the Microsoft documentation.

When you encounter this scenario, you see something like the following image, which shows the Download this file option greyed out from the File menu of the Power BI report.

Read on to see how you can nonetheless recover these published reports using the semantic-link-labs library.

Leave a Comment

Managing SQL Agent Job History

Joe Gavin prunes some history:

The SQL Server Agent is a very powerful job scheduling and alerting tool that’s tightly integrated with SQL Server. It’s quite possible you’re only using it for basic maintenance tasks like database backups, index maintenance, DBCC checks, etc., and the default retention is fine. However, you may also be using it for much more, i.e., executing multiple step jobs that execute multiple SSIS packages, and you need to retain a longer than the default job history.

Read on to learn more, including how one really useful-sounding checkbox doesn’t quite work the way you’d expect.

Leave a Comment

Mounting Azure Data Factory in Fabric Data Factory

Andy Leonard takes up a factory job:

Thanks to the hard work of the Microsoft Fabric Data Factory Team, it’s now possible to mount an Azure Data Factory in Fabric Data Factory. This post describes one way to mount an existing Azure Data Factory in Fabric Data Factory. In this post, we will:

  • Mount an existing Azure Data Factory in Fabric Data Factory
  • Open the Azure Data Factory in Fabric Data Factory
  • Test-execute two ADF pipelines
  • Modify and publish an ADF pipeline

Read on to see how it all works. One of the odd things about Microsoft Fabric—and its predecessor, Azure Synapse Analytics—is the penchant for similar-but-not-quite-the-same services. Yes, we have Data Factory…but it’s not quite the same. Yes, we have Azure Data Explorer (and KQL)…but it’s not quite the same. I get that there are reasons for this (such as not having a resource group with a dozen separate services hanging around), but I’m sure it’s a bit frustrating working on several separate code bases and trying to keep them all approximately in sync.

Leave a Comment

Memoizing Functions with Snowflake

Kevin Wilkie is speaking my language (that is, the language of functional programming):

If you’ve been working with data for several years like I have – mostly using the SQL language – then I have a term for you that other languages, like JavaScript or Python, have had for a few years. The term is “memoizable” and it means, in a nutshell, to remember. A memoizable function caches the results so that it can return the resultset in record time, given the same parameters.

Yeah, it’s a fancy term that basically states, “Instead of calculating the result each time, I’ll just create a lookup table of all possible inputs and what the output is.” It’s really helpful when you have a small number of possible inputs and generating a result takes a while.

Read on to learn more about how this works in Snowflake, including several limitations.

Leave a Comment

Trying out the Databricks For-Each Task

Chen Hirsh goes in a loop:

Databricks recently added a for-each task to their workflow capability. Workflows are Databricks jobs, like Data factory pipelines, or SQL server jobs, a pipeline that you can schedule, that include a number of tasks that together complete some business logic.

Theoretically, the long-awaited for-each task should make the run of multiple processes easier. for example, one of the things I often do is run a list of notebooks, each processing a different table, with no dependencies between them. At the moment I use parallel notebooks – https://docs.databricks.com/en/notebooks/notebook-workflows.html#run-multiple-notebooks-concurrently

As you will see later, this use case is not supported yet. But before that, let’s see what we can do with the for-each task.

Read on to see what it currently can do, and what it cannot.

Leave a Comment

Programmatic Power BI Report Modification via semantic-link-labs

Kurt Buhler makes a change:

Whether building reports in Power BI Desktop or in the web browser via the Power BI service, you have limited options to batch or streamline changes. Put another way; it’s tedious and slow to make many small changes to one or more Power BI reports. It’s also easy to make mistakes

When initially designing or building a report, this is not so much of a problem. Unless you’re using a template, you want to control report layout and formatting, yourself. However, certain changes can be little more than a waste of time. Some examples include:

  • Replacing fields when there’s a broken reference due to i.e. renaming a model measure or column.
  • Swapping one measure or column for another in the report
  • Changing visual container styles, like background, border, and shadow/glow.
  • Changing text or text styles across multiple visuals, pages, or reports.
  • Changing chart formatting (like color) or properties (like edit interactions) across multiple visuals, pages, or reports.

Read on to see how you can make some of these changes in Python code using the semantic-link-labs library.

Leave a Comment

Monitoring for Blocked Processes and Deadlocks with Extended Events

Lori Brown’s speaking my language:

Here is a way to set up an extended events session that can be used to collect blocked processes and deadlock records.  I use similar code to collect this type of data each day and have a job that pulls the data into tables which can be used to evaluate the resources that are being blocked and deadlocked on.

When checking for blocked process records, you must set the blocked process threshold in the configuration.  I have mine configured with the threshold set to 30 seconds.  This means that blocked process reports are generated every 30 seconds. 

Click through for that configuration setting, as well as the Extended Events session to do the work, and even code to pull the results into a table. It doesn’t get much simpler than that.

Leave a Comment

Reviewing the DATE_BUCKET() and DATETRUNC() Commands

Chad Callihan checks out a pair of functions new to SQL Server 2022:

If you haven’t experimented much with SQL Server 2022, you may not be familiar with the new DATE_BUCKET and DATETRUNC functions. Both of these functions are useful when it comes to aggregating data. Let’s take a look at each function and walk through a few examples.

The DATE_BUCKET() function takes a bit of time getting used to, especially when you don’t use clean intervals like 1 hour or 1 minute. DATETRUNC(), meanwhile, is something I’ve wanted for quite some time.

Leave a Comment