Press "Enter" to skip to content

Category: Python

Classification with Random Forest

I have a new video:

In this video, I cover a powerful ensemble method for classification: random forests. We get an idea of how this differs from CART, learn the best possible metaphor for random forests, and dig into random search for hyperparameter optimization.

Click through to see the video in all its glory.

Comments closed

Classification Concepts and CART in Action

I have a new video series:

In this video, I explain some core concepts behind classification and introduce the first classification algorithm we will look at in CART.

CART, by the way, stands for Classification and Regression Trees, and is one of the easiest classification algorithms to understand as a concept: it’s a decision tree (aka, a series of if-else statements) where each terminal node is an outcome: either a class for classification or a value for regression.

Comments closed

Visualizing a Spark Execution Plan

Gerhard Brueckl builds a very helpful tool:

I recently found myself in a situation where I had to optimize a Spark query. Coming from a SQL world originally I knew how valuable a visual representation of an execution plan can be when it comes to performance tuning. Soon I realized that there is no easy-to-use tool or snippet which would allow me to do that. Though, there are tools like DataFlint, the ubiquitous Spark monitoring UI or the Spark explain() function but they are either hard to use or hard to get up running especially as I was looking for something that works in both of my two favorite Spark engines being Databricks and Microsoft Fabric.

Read on for Gerhard’s answer, including an example of it in action.

Comments closed

Exporting and Sharing Power BI Reports in Fabric

Sandeep Pawar distributes PDFs like candy:

With the proposed solution below, you will be able to :

  • Export a Power BI report, or a page of a report or a specific visual from any page as a PDF, PNG, PPTX or other supported file formats
  • Apply report level filters before exporting
  • Automate the extracts on a schedule
  • Save the exported reports to specific folders
  • Grant access to individual folders in the Lakehouse

Click through for the solution.

Comments closed

Documenting Table Columns with the Python SDK for Purview

Danaraj Ram Kumar breaks out the Python IDE:

There are several approaches to work with Microsoft Purview entities programmatically, especially when needing to perform bulk operations such as documenting a large number of tables and columns dynamically. 

This article shows how to use the Python SDK for Purview to programmatically document Purview table columns in bulk – assuming there are many tables and columns that needed to be automatically documented based off a reference tables – as in this example, the data dictionary maintained in Excel.

On the other hand, Purview REST APIs can be used to natively work with the REST APIs whereas the Python SDK for Purview is a wrapper that makes it easier to programmatically interacts with the Purview Atlas REST APIs in the backend.

Click through for sample code and explanations.

Comments closed

Analyzing Aircraft Routes

Mark Litwintschik performs a deep dive into aircraft telemetry:

In November, I wrote a post on analysing aircraft position telemetry with adsb.lol. At the time, I didn’t have a clear way to turn a series of potentially thousands of position points for any one aircraft into a list of flight path trajectories and airport stop-offs.

Since then, I’ve been downloading adsb.lol’s daily feed and in this post, I’ll examine the flight routes taken by AirBaltic’s YL-AAX Airbus A220-300 aircraft throughout February. I flew on this aircraft on February 24th between Dubai and Riga. I used my memory and notes of the five-hour take-off delay to help validate the enriched dataset in this post.

Click through for Mark’s analysis using DuckDB and Python.

Comments closed

An Example of an MD5 Hash Collision

John Cook shares an example of a hash collision:

Marc Stevens gave an example of two alphanumeric strings that differ in only one byte that have the same MD5 hash value. It may seem like beating a dead horse to demonstrate weaknesses in MD5, but it’s instructive to study the flaws of broken methods. And despite the fact that MD5 has been broken for years, lawyers still use it.

Click through for the example.

Comments closed

Running SemPy from Microsoft Fabric Notebooks

Gilbert Quevauvilliers sets up an environment:

Below is where I had an error when trying to run a notebook via a data pipeline and it failed.

Below are the steps to get this working.

This was the error message I got as shown below.

Notebook execution failed at Notebook service with http status code – ‘200’, please check the Run logs on Notebook, additional details – ‘Error name – MagicUsageError, Error value – %pip magic command is disabled.’ :

Read on to see how you can fix this error and get SemPy running.

Comments closed

A Primer on Vector Similarity Search

Pavan Belagatti talks vectors:

In the realm of generative AI, vectors play a crucial role as a means of representing and manipulating complex data. Within this context, vectors are often high-dimensional arrays of numbers that encode significant amounts of information. For instance, in the case of image generation, each image can be converted into a vector representing its pixel values or more abstract features extracted through deep learning models.

These vectors become the language through which AI algorithms understand and generate new content. By navigating and modifying these vectors in a multidimensional space, generative AI produces new, synthetic instances of data — whether images, sounds or text — that mimic the characteristics of the original dataset. This vector manipulation is at the heart of AI’s ability to learn from data and generate realistic outputs based on that learning.

Read on for a high-level overview of the topic.

Comments closed