Press "Enter" to skip to content

Category: Python

Data Quality Issues in Python-Based Time Series Analysis

Hadi Fadlallah checks out the data:

Time-series data analysis is one of the most important analysis methods because it provides great insights into how situations change with time, which helps in understanding trends and making the right choices. However, there is a high dependence on its quality.

Data quality mistakes in time series data sets have implications that extend over a large area, such as the accuracy and trustworthiness of analyses, as well as their interpretation. For instance, mistakes can be caused by modes of data collection, storage, and processing. Specialists working on these data sets must acknowledge these data quality obstacles.

Read on for several examples of data quality issues you might run into in a time series dataset, as well as their fixes.

Comments closed

Adding the Current Date and Time to a PySpark Data Frame

Gilbert Quevauvilliers wants to know what time it is:

How to add current DateTime to existing PySpark data frame in a Fabric Notebook

In the blog post below, I am going to describe how to add the current Date Time to your existing Spark data frame.

This is really useful when I am inserting data into a Fabric Lakehouse table, and I want to know when the data got inserted.

Read on for the answer.

Comments closed

Classification with Random Forest

I have a new video:

In this video, I cover a powerful ensemble method for classification: random forests. We get an idea of how this differs from CART, learn the best possible metaphor for random forests, and dig into random search for hyperparameter optimization.

Click through to see the video in all its glory.

Comments closed

Classification Concepts and CART in Action

I have a new video series:

In this video, I explain some core concepts behind classification and introduce the first classification algorithm we will look at in CART.

CART, by the way, stands for Classification and Regression Trees, and is one of the easiest classification algorithms to understand as a concept: it’s a decision tree (aka, a series of if-else statements) where each terminal node is an outcome: either a class for classification or a value for regression.

Comments closed

Visualizing a Spark Execution Plan

Gerhard Brueckl builds a very helpful tool:

I recently found myself in a situation where I had to optimize a Spark query. Coming from a SQL world originally I knew how valuable a visual representation of an execution plan can be when it comes to performance tuning. Soon I realized that there is no easy-to-use tool or snippet which would allow me to do that. Though, there are tools like DataFlint, the ubiquitous Spark monitoring UI or the Spark explain() function but they are either hard to use or hard to get up running especially as I was looking for something that works in both of my two favorite Spark engines being Databricks and Microsoft Fabric.

Read on for Gerhard’s answer, including an example of it in action.

Comments closed

Exporting and Sharing Power BI Reports in Fabric

Sandeep Pawar distributes PDFs like candy:

With the proposed solution below, you will be able to :

  • Export a Power BI report, or a page of a report or a specific visual from any page as a PDF, PNG, PPTX or other supported file formats
  • Apply report level filters before exporting
  • Automate the extracts on a schedule
  • Save the exported reports to specific folders
  • Grant access to individual folders in the Lakehouse

Click through for the solution.

Comments closed

Documenting Table Columns with the Python SDK for Purview

Danaraj Ram Kumar breaks out the Python IDE:

There are several approaches to work with Microsoft Purview entities programmatically, especially when needing to perform bulk operations such as documenting a large number of tables and columns dynamically. 

This article shows how to use the Python SDK for Purview to programmatically document Purview table columns in bulk – assuming there are many tables and columns that needed to be automatically documented based off a reference tables – as in this example, the data dictionary maintained in Excel.

On the other hand, Purview REST APIs can be used to natively work with the REST APIs whereas the Python SDK for Purview is a wrapper that makes it easier to programmatically interacts with the Purview Atlas REST APIs in the backend.

Click through for sample code and explanations.

Comments closed

Analyzing Aircraft Routes

Mark Litwintschik performs a deep dive into aircraft telemetry:

In November, I wrote a post on analysing aircraft position telemetry with adsb.lol. At the time, I didn’t have a clear way to turn a series of potentially thousands of position points for any one aircraft into a list of flight path trajectories and airport stop-offs.

Since then, I’ve been downloading adsb.lol’s daily feed and in this post, I’ll examine the flight routes taken by AirBaltic’s YL-AAX Airbus A220-300 aircraft throughout February. I flew on this aircraft on February 24th between Dubai and Riga. I used my memory and notes of the five-hour take-off delay to help validate the enriched dataset in this post.

Click through for Mark’s analysis using DuckDB and Python.

Comments closed

An Example of an MD5 Hash Collision

John Cook shares an example of a hash collision:

Marc Stevens gave an example of two alphanumeric strings that differ in only one byte that have the same MD5 hash value. It may seem like beating a dead horse to demonstrate weaknesses in MD5, but it’s instructive to study the flaws of broken methods. And despite the fact that MD5 has been broken for years, lawyers still use it.

Click through for the example.

Comments closed