Press "Enter" to skip to content

Category: Python

Combining Flink SQL, Streamlit, and Kafka

Lucia Cerchie has a pair of posts. First up, Lucia sets the stage:

n part 1 of this series, we’ll make an app, hosted on Streamlit, that allows a user to select a stock, in this case SPY, or the SPDR S&P 500 ETF Trust. Upon selection, a live chart of the stock’s bid prices, calculated every five seconds, will appear.

What are the pieces that go into making this work? The source of the data is the Alpaca Market Data API. We’ll hook up a Kafka producer to the websocket stream and send data to a Kafka topic in Confluent Cloud. Then we’ll use Flink SQL within Confluent Cloud’s Flink SQL workspace to tumble an average bid price every five seconds. Finally, we’ll use a Kafka consumer to receive that data and populate it to a Streamlit component in real time. This frontend component will be deployed on Streamlit as well.

Part 2 then closes the trap:

In part one of this series, we walked through how to use Streamlit, Apache Kafka®, and Apache Flink® to create a live data-driven user interface for a market data application to select a stock (e.g., SPY) and discussed the structure of the app at a high level. First, data with information on stock bid prices is moved via an Alpaca websocket, then, it’s produced to a Kafka topic in Confluent Cloud where it is also processed with Flink SQL. 

Now comes the tricky part: running the Kafka consumer and producer in the same application.

Click through for a good demonstration of a practical solution. Lucia also has a GitHub repo with all of the code, a demo of the site in action, and some links to additional resources.

Comments closed

Tweedie Distributions and Generalized Linear Modeling

Christian Lorentzen talks about Tweedie distributions:

Tweedie distributions and Generalised Linear Models (GLM) have an intertwined relationship. While GLMs are, in my view, one of the best reference models for estimating expectations, Tweedie distributions lie at the heart of expectation estimation. In fact, basically all applied GLMs in practice use Tweedie distributions with three notable exceptions: the binomial, the multinomial and the negative binomial distribution.

Read on for a bit more about its history and how it ties in with several other distributions.

Comments closed

Getting the Top N Results in a PySpark Notebook

Gilbert Quevauvilliers only needs the top 1:

How to get the TopN rows using Python in Fabric Notebooks

When working with data there are sometimes weird and wonderful requirements which must be created in order to get to the desired solution.

In today’s blog post I had a situation where I wanted to get a single row with the highest duration.

Gilbert uses the Spark SQL version, specifically the Python function variant. You could also use Spark SQL and write a query using the LIMIT operator.

Comments closed

Dynamic Historical Partition Refresh in Power BI

Marc Lelijveld digs into partition refreshing:

I’ve heard the question pretty often from customers: “You told me to use incremental refresh, but how can I regularly run a full load or refresh onder partitions?” Well, there are perfect ways to do this using Tabular Editor or SQL Server Management Studio. But this often includes manual work to trigger the processing.

Today, this question was asked again to me. I thought, there should be a smarter way to do this. Since I recently explored more in the wonderful world of Fabric Notebooks and Python, decided to dive a bit deeper in this world and see if it is possible to script something like this using Semantic Link. And obviously, the answer is “Yes!”

Read on to learn how to do it with a bit of Python and Microsoft Fabric’s Semantic Link library (sempy).

Comments closed

Looping through Data in Microsoft Fabric PySpark Notebooks

Gilbert Quevauvilliers builds a loop:

Continuing with my existing blog series on what I’m learning with notebooks and PySpark.

Today, I’m going to explain to you how I found a way to loop through data in a notebook.

In this example, I’m going to show you how I loop through a range of dates, which can then be used in a subsequent query to extract data by passing through each date into a DAX query.

Click through for Gilbert’s example. Here’s an alternative using something called a list comprehension. First, build a function that does what you want to do—that’d be the innards of Gilbert’s Python code, lines 31-54.

def perform_dax_query(row):
    var_Date = row["Date"]
    ...
    display(df_DAX_QueryResult)

Then, call that function for each row:

[perform_dax_query(row) for row in data_collect]

In this particular scenario, I’d personally stick with Gilbert’s composition, but in cases where you’re transforming a list of elements into a new list—for example, if you’re performing some data cleanup for each row in a list and you want the output to be a new list with cleaned-up data—then the list comprehension works really well.

Comments closed

Accuracy is Not Enough for Classification

I have a new video:

In this video, I explain why accuracy is not the be-all, end-all measure for classification. After that, I introduce the confusion matrix, a mechanism for tracking predicted versus actual values. Then, I talk about a variety of measures and how we can derive them from the confusion matrix.

The trickiest part of the confusion matrix measures is just remembering which measures comport to which combinations in the matrix. The second-trickiest part of the confusion matrix is that R and Python invert them, so reading across the top row in R is equivalent to reading down the first column in Python.

Comments closed

Data Quality Issues in Python-Based Time Series Analysis

Hadi Fadlallah checks out the data:

Time-series data analysis is one of the most important analysis methods because it provides great insights into how situations change with time, which helps in understanding trends and making the right choices. However, there is a high dependence on its quality.

Data quality mistakes in time series data sets have implications that extend over a large area, such as the accuracy and trustworthiness of analyses, as well as their interpretation. For instance, mistakes can be caused by modes of data collection, storage, and processing. Specialists working on these data sets must acknowledge these data quality obstacles.

Read on for several examples of data quality issues you might run into in a time series dataset, as well as their fixes.

Comments closed

Adding the Current Date and Time to a PySpark Data Frame

Gilbert Quevauvilliers wants to know what time it is:

How to add current DateTime to existing PySpark data frame in a Fabric Notebook

In the blog post below, I am going to describe how to add the current Date Time to your existing Spark data frame.

This is really useful when I am inserting data into a Fabric Lakehouse table, and I want to know when the data got inserted.

Read on for the answer.

Comments closed