Category: Data Science

Understanding the Poisson Distribution

Published 2022-06-23 by Kevin Feasel

Achim Zeileis shows off my favorite statistical distribution:

The Poisson distribution has many distinctive features, e.g., both its expectation and variance are equal and given by the parameter λλ. Thus, E(Y)=λE(Y)=λ and Var(Y)=λVar(Y)=λ. Moreover, the Poisson distribution is related to other basic probability distributions. Namely, it can be obtained as the limit of the binomial distribution when the number of attempts is high and the success probability low. Or the Poisson distribution can be approximated by a normal distribution when λλ is large. See Wikipedia (2002) for further properties and references.
Here, we leverage the distributions3 package (Hayes et al. 2022) to work with the Poisson distribution in R. In distributions3, Poisson distribution objects can be generated with the Poisson() function. Subsequently, methods for generic functions can be used print the objects; extract mean and variance; evaluate density, cumulative distribution, or quantile function; or simulate random samples.

Read on for a detailed tutorial. H/T R-bloggers.

Comments closed

Comparing Data Analysis in Java and Python

Published 2022-06-16 by Kevin Feasel

Manu Barriola does some data analysis in a pair of quite different languages:

Python is a dynamically typed language, very straightforward to work with, and is certainly the language of choice to do complex computations if we don’t have to worry about intricate program flows. It provides excellent libraries (Pandas, NumPy, Matplotlib, ScyPy, PyTorch, TensorFlow, etc.) to support logical, mathematical, and scientific operations on data structures or arrays.
Java is a very robust language, strongly typed, and therefore has more stringent syntactic rules that make it less prone to programmatic errors. Like Python provides plenty of libraries to work with data structures, linear algebra, machine learning, and data processing (ND4J, Mahout, Spark, Deeplearning4J, etc.).
In this article, we’re going to focus on a narrow study of how to do simple data analysis of large amounts of tabular data and compute some statistics using Java and Python. We’ll see different techniques on how to do the data analysis on each platform, compare how they scale, and the possibilities to apply parallel computing to improve their performance.

Read on to see how the two compare. Note that this is base Java and Python+Pandas, not Spark/PySpark, Koalas, etc.

Comments closed

An Overview of Clustering Algorithms

Published 2022-06-15 by Kevin Feasel

Gavita Regunath has a two-parter on clustering. First, an explanation of the concept:

Clustering, or cluster analysis, is an unsupervised machine learning method. As the name implies, unsupervised machine learning refers to how the model ‘learns’ the data. It is a learning process opposite to supervised learning. With supervised learning, models are trained or “supervised” using labelled datasets (a known function output to our data). An example of a supervised learning method is where a model is trained to recognise animals based on their labels of being a cat, dog and rabbit.
Unsupervised learning works with unlabelled data where there are no known function outputs, and the aim is to identify patterns within a dataset. There are many unsupervised learning algorithms, however, the three main types are clustering algorithms, dimensionality reduction and anomaly detection. The focus of this blog will be on clustering, as it is the most commonly used unsupervised learning technique.

Second, a review of ten clustering algorithms:

There are many clustering algorithms. In fact, there are more than 100 clustering algorithms that have been published so far. However, despite the various types of clustering algorithms, they can generally be categorised into four methods. Let’s look at these briefly:

Read on to learn more about clustering.

Comments closed

Lasso and Ridge Regression

Published 2022-06-14 by Kevin Feasel

Niraj Kumar explains how two regression techniques work:

Lasso Regression is a regularization technique used for feature selection using a Shrinkage method also referred to as the penalized regression method.
Lasso is short for Least Absolute Shrinkage and Selection Operator, which uses both for regularization and model selection.
If a model uses the L1 regularization technique, then known as lasso regression.

Click through for a summary of the two techniques.

Comments closed

Movie Color Swaps in R

Published 2022-05-09 by Kevin Feasel

Mark White does some coloration switcharoos:

I also love film, and I started thinking about ways I could generate color palettes from films that use color beautifully. There are a number of packages that can generate color palettes from images in R, but I wanted to try writing the code myself.
I also wanted to not just generate a color palette from an image, but then swapping it with a different color palette from a different film. This is similar to neural style transfer with TensorFlow, but much simpler. I’m one of those people that likes to joke how OLS is undefeated; I generally praise the use of simpler models over more complex ones. So instead of a neural network, I use k-means clustering to transfer a color palette of one still frame from a film onto another frame from a different movie.

There are some interesting outcomes in the post, including a mashup of 2001: A Space Odyssey’s color scheme onto Arrival, as well as Kill Bill and Dr. Strangelove. The latter reminds me of a still from the credits sequence to a 1970s movie. H/T R-Bloggers.

Comments closed

Discovering Data Drift with DVC

Published 2022-05-05 by Kevin Feasel

Milecia McGregor looks at a version control system for ML projects (and data):

What happens when the machine learning model you’ve worked so hard to get to production becomes stale? Machine learning engineers and data scientists face this problem all the time. You usually have to figure out where the data drift started so you can determine what input data has changed. Then you need to retrain the model with this new dataset.
Retraining could involve a number of experiments across multiple datasets, and it would be helpful to be able to keep track of all of them. In this tutorial, we’ll walk through how using DVC, an open source version control system for machine learning projects, can help you keep track of those experiments and how this will speed up the time it takes to get new models out to production, preventing stale ones from lingering too long.

My team is working on integrating DVC. It’s a really good project for analytics teams, as it extends the notion of version control to datasets and helps you tie in code (source control), models (tools like MLflow), and data.

Comments closed

Quantifying Model Uncertainty with Tensorflow Probability

Published 2022-04-29 by Kevin Feasel

Vini Jaiswal reviews the Tensorflow Probability library:

In this blog, we look at the topic of uncertainty quantification for machine learning and deep learning. By no means is this a new subject, but the introduction of tools such as Tensorflow Probability and Pyro have made it easy to perform probabilistic modeling to streamline uncertainty calculations. Consider the scenario in which we predict the value of an asset like a house, based on a number of features, to drive purchasing decisions. Wouldn’t it be beneficial to know how certain we are of these predicted prices? Tensorflow Probability allows you to use the familiar Tensorflow syntax and methodology but adds the ability to work with distributions. In this introductory post, we leave the priors and the Bayesian treatment behind and opt for a simpler probabilistic treatment to illustrate the basic principles. We use the likelihood principle to illustrate how an uncertainty measure can be obtained along with predicted values by applying them to a deep learning regression problem.

Read on for an interesting explanation and tutorial.

Comments closed

Custom Model Evaluation Metrics with MLflow

Published 2022-04-22 by Kevin Feasel

Mark Zhang shows off a new bit of functionality in MLflow:

According to an internal customer survey, 75% of respondents say they frequently or always use specialized, business-focused metrics in addition to basic ones like accuracy and loss. Data scientists often utilize these custom metrics as they are more descriptive of business objectives (e.g. conversion rate), and contain additional heuristics not captured by the model prediction itself.
In this blog, we introduce an easy and convenient way of evaluating MLflow models on user-defined custom metrics. With this functionality, a data scientist can easily incorporate this logic at the model evaluation stage and quickly determine the best-performing model without further downstream analysis

Click through to see how to use built-in metrics but also how to create your own.

Comments closed

Data Profiling in Python

Published 2022-04-05 by Kevin Feasel

Brendan Tierney uses Python to look at some data:

One of the most common, and sometimes boring, task when working with datasets is writing some code to profile the data. Most data scientists will have built a set of tools/scripts to help them with this regular and slightly boring task. As with most IT tasks we should be trying to automate what we can, to allow us to spend more time on more important tasks, such as deriving insights and delivering value to the business, instead of repeatedly writing code to produce various statistics about the data and drawing pretty pictures.
I’ve written previously about automating and using some data profiling libraries to help us with this task. There are lots of packages available on pypi.og and on GitHub. Below I give examples of 5 Python Data Profiling libraries, with links to their GitHubs.

Brendan includes some good examples of libraries here so check it out.

Comments closed

“Production” in Data Analytics

Published 2022-03-22 by Kevin Feasel

Joey Jablonski brings up an important point:

Data-driven environments have a fundamentally different set of needs around testing, deployment, and visibility then traditional business applications. Data driven environments need access to fresh data on a high level of update frequency to ensure that data engineers and data scientists are able to effect outputs and recommendations on a timeline that has a positive impact on business decisions and customer experiences.

My day job involves running a predictive analytics team. We train models on production data—there’s very little value in training models on artificial dev data (outside of understanding the parameters of the modeling process), so even our development data generally comes from production. I don’t know that I’m sold on data mesh as a solution to this but it’s worth investigation.

Comments closed

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30