Press "Enter" to skip to content

Category: Python

Tweedie Distributions and Generalized Linear Modeling

Christian Lorentzen talks about Tweedie distributions:

Tweedie distributions and Generalised Linear Models (GLM) have an intertwined relationship. While GLMs are, in my view, one of the best reference models for estimating expectations, Tweedie distributions lie at the heart of expectation estimation. In fact, basically all applied GLMs in practice use Tweedie distributions with three notable exceptions: the binomial, the multinomial and the negative binomial distribution.

Read on for a bit more about its history and how it ties in with several other distributions.

Comments closed

Getting the Top N Results in a PySpark Notebook

Gilbert Quevauvilliers only needs the top 1:

How to get the TopN rows using Python in Fabric Notebooks

When working with data there are sometimes weird and wonderful requirements which must be created in order to get to the desired solution.

In today’s blog post I had a situation where I wanted to get a single row with the highest duration.

Gilbert uses the Spark SQL version, specifically the Python function variant. You could also use Spark SQL and write a query using the LIMIT operator.

Comments closed

Dynamic Historical Partition Refresh in Power BI

Marc Lelijveld digs into partition refreshing:

I’ve heard the question pretty often from customers: “You told me to use incremental refresh, but how can I regularly run a full load or refresh onder partitions?” Well, there are perfect ways to do this using Tabular Editor or SQL Server Management Studio. But this often includes manual work to trigger the processing.

Today, this question was asked again to me. I thought, there should be a smarter way to do this. Since I recently explored more in the wonderful world of Fabric Notebooks and Python, decided to dive a bit deeper in this world and see if it is possible to script something like this using Semantic Link. And obviously, the answer is “Yes!”

Read on to learn how to do it with a bit of Python and Microsoft Fabric’s Semantic Link library (sempy).

Comments closed

Looping through Data in Microsoft Fabric PySpark Notebooks

Gilbert Quevauvilliers builds a loop:

Continuing with my existing blog series on what I’m learning with notebooks and PySpark.

Today, I’m going to explain to you how I found a way to loop through data in a notebook.

In this example, I’m going to show you how I loop through a range of dates, which can then be used in a subsequent query to extract data by passing through each date into a DAX query.

Click through for Gilbert’s example. Here’s an alternative using something called a list comprehension. First, build a function that does what you want to do—that’d be the innards of Gilbert’s Python code, lines 31-54.

def perform_dax_query(row):
    var_Date = row["Date"]
    ...
    display(df_DAX_QueryResult)

Then, call that function for each row:

[perform_dax_query(row) for row in data_collect]

In this particular scenario, I’d personally stick with Gilbert’s composition, but in cases where you’re transforming a list of elements into a new list—for example, if you’re performing some data cleanup for each row in a list and you want the output to be a new list with cleaned-up data—then the list comprehension works really well.

Comments closed

Accuracy is Not Enough for Classification

I have a new video:

In this video, I explain why accuracy is not the be-all, end-all measure for classification. After that, I introduce the confusion matrix, a mechanism for tracking predicted versus actual values. Then, I talk about a variety of measures and how we can derive them from the confusion matrix.

The trickiest part of the confusion matrix measures is just remembering which measures comport to which combinations in the matrix. The second-trickiest part of the confusion matrix is that R and Python invert them, so reading across the top row in R is equivalent to reading down the first column in Python.

Comments closed

Data Quality Issues in Python-Based Time Series Analysis

Hadi Fadlallah checks out the data:

Time-series data analysis is one of the most important analysis methods because it provides great insights into how situations change with time, which helps in understanding trends and making the right choices. However, there is a high dependence on its quality.

Data quality mistakes in time series data sets have implications that extend over a large area, such as the accuracy and trustworthiness of analyses, as well as their interpretation. For instance, mistakes can be caused by modes of data collection, storage, and processing. Specialists working on these data sets must acknowledge these data quality obstacles.

Read on for several examples of data quality issues you might run into in a time series dataset, as well as their fixes.

Comments closed

Adding the Current Date and Time to a PySpark Data Frame

Gilbert Quevauvilliers wants to know what time it is:

How to add current DateTime to existing PySpark data frame in a Fabric Notebook

In the blog post below, I am going to describe how to add the current Date Time to your existing Spark data frame.

This is really useful when I am inserting data into a Fabric Lakehouse table, and I want to know when the data got inserted.

Read on for the answer.

Comments closed

Classification with Random Forest

I have a new video:

In this video, I cover a powerful ensemble method for classification: random forests. We get an idea of how this differs from CART, learn the best possible metaphor for random forests, and dig into random search for hyperparameter optimization.

Click through to see the video in all its glory.

Comments closed