Press "Enter" to skip to content

Category: Python

An Overview of the Naive Bayes Class of Algorithms

Harris Amjad takes us through a rather useful class of algorithms for classification:

As AI and Machine Learning have increased in popularity, especially Large Language Models, more professionals have explored how these systems work. Unfortunately, some put the cart before the horse, where they take on more complex algorithms before learning to pave the foundation, resulting in faded interest in the topic. This tip will introduce a simple probabilistic, yet powerful classifier, the Naïve Bayes Model, and implement it in Python.

I like using the Naive Bayes variants, despite the fact that it is not Bayesian and arguably isn’t very naive. The reason I like to use this class of algorithm is that it’s fast, easy, and gives you a useful baseline for quality. If you need to meet some specific quality threshold (say, accuracy > 85% or F1-score above 0.8), you can get an answer quickly with Naive Bayes. If that answer is anywhere near your threshold, the problem is likely solvable. If your answer is way below the threshold, it’s probably not worth spending the time or compute effort trying out a variety of other algorithms.

Comments closed

Generating Effect Plots in Python and R

MIchael Mayer builds some effect plots:

The plots show different types of feature effects relevant in modeling:

  • Average observed: Descriptive effect (also interesting without model).
  • Average predicted: Combined effect of all features. Also called “M Plot” (Apley 2020).
  • Partial dependence: Effect of one feature, keeping other feature values constant (Friedman 2001).
  • Number of observations or sum of case weights: Feature value distribution.
  • R only: Accumulated local effects, an alternative to partial dependence (Apley 2020).

Click through to see how they both work.

Comments closed

Obtaining VisualIDs for Visuals in a Power BI Report

Sandeep Pawar checks for ID:

Log Analytics and Workspace Monitoring in Fabric logs all the activities of datasets in a workspace. These logs contain dataset, report, visual IDs which the user has to decipher to get the full picture. Dataset, report ids are straightforward but it’s not easy to get visual IDs programmatically. Chris Webb already has a blog on couple of different ways to get the visual IDs. That blog was published in 2022 and in the Fabric world we now have a couple of more options.

Read on for two additional methods you can use.

Comments closed

Creating a Parquet File in Python

Ed Pollack has part one of a two-parter:

This article dives into the Apache Parquet file format, how it works, and how it can be used to export and import data directly to SQL Server, even when a data platform that supports Parquet files natively is unavailable to assist.

In the second part of this article, customizations and more advanced options will be highlighted, showing the flexibility of Python as a tool to solve analytic data movement challenges.

I like how Ed covers the Parquet file format, as it’s not all that complicated but it does give you an idea of how so many operations on Parquet data can be so fast.

Comments closed

E-Mailing Query Results in Snowflake

Kevin Wilkie gussies up an e-mail:

In our last post, we discussed the most basic of all ways how in Snowflake you can send out a basic email. It was pretty simple, straight text – nothing to really grab the attention of our readers – which we know is the way to craft an email, right?

To do this, we’re going to have some fun in Python. Yes, delve deep into your bag of Python tricks as we get up to some shenanigans with Snowflake and Python.

Read on for a procedure to e-mail the prior result set in HTML format.

Comments closed

Debugging in Databricks

Chen Hirsh enables a debugger:

Do you know that feeling, when you write beautiful code and everything just works perfectly on the first try?

I don’t.

Every time I write code It doesn’t work in the beginning, and I have to debug it, make changes, test it…

Databricks introduced a debugger you can use on a code cell, and I’ve wanted to try it for quite some time now. Well, I guess the time is now 

I’m having trouble in finding the utility for a debugger here. Notebooks are already set up for debugging: you can easily add or remove cells and the underlying session maintains state between cells.

Comments closed

Querying a Fabric KQL Database via REST API

Sandeep Pawar grabs some data:

I have previously explained how to query a KQL database in a notebook using the Kusto Spark connector, Kusto Python SDK, and KQLMagic. Now, let’s explore another method using the REST API. Although this is covered in the ADX documentation, it isn’t in Fabric (with example), so I wanted to write a quick blog to show how you can query a table from an Eventhouse using a REST API.

Click through to see how you can do it. Sandeep’s code is in Python but because this is just hitting a REST API rather than using a library, you could also use some tool like Postman.

Comments closed

Analyzing Delta Table Measures in Microsoft Fabric

Sandeep Pawar has a script for us:

I have been sitting on this code for a long time. I shared the first version in one of my blogs on Direct Lake last year. I have been making updates to it since then as needed. I waited for the lakehouse schema to become available and then forgot to blog about it. Yesterday, someone reached out asking if the above could be used for warehouse delta tables in Fabric, so here you go. It’s 250+ lines so let me just explain what’s going on here:

Read on for the explanation, the script itself, a demonstration, and several additional notes.

Comments closed

Charting Microsoft Fabric Workspace Activity

Sandeep Pawar creates a chart:

Semantic Link Labs v0.8.3 has list_activities method to get the list of all activities in your Fabric tenant. It uses the same Power BI Admin - Get Activity Events API but this API now also includes Fabric activities. Note that this is an Admin API so you need to be a Fabric administrator. Check the API details.

To answer the above question, I will use the admin.list_activity_events , loop over the last 30 days and plot the results by Fabric item type in my personal tenant:

Click through for that code. Though if you’re going to do something similar in your environment, I recommend not using a line chart for this visual, as line charts indicate a flow over time and this is definitely point-in-time categorical data. A bar chart or dot plot would be better in that case.

Comments closed

A Primer on SparkSQL and PySpark

Anurag K covers the basics of PySpark:

In the era of big data, efficient data processing is critical for insights-driven decision-making. PySpark SQL, a part of Apache Spark, enables data engineers and analysts to work with structured data at massive scale. Combining SQL’s simplicity with Spark’s processing power, it opens a gateway to handling vast datasets seamlessly. This comprehensive guide walks you through PySpark SQL, from foundational concepts to advanced querying techniques, with detailed code examples. Let’s dive in and master PySpark SQL for data-driven analytics.

Click through for examples covering a variety of operations you can perform.

Comments closed