Press "Enter" to skip to content

Category: Data Science

Time Series Stationarity Testing in R

Steven Sanderson isn’t just spinning in place:

Before we delve into the ts_adf_test() function, let’s understand the concept behind it. The Augmented Dickey-Fuller (ADF) test is a crucial tool in time series analysis. It’s like the Sherlock Holmes of time series data, helping us detect whether a series is stationary or not. Stationarity is a fundamental assumption in time series modeling because many models work best when applied to stationary data.

So, why “Augmented”? Well, it’s an extension of the original Dickey-Fuller test that accounts for more complex relationships within the time series data.

Click through to see how you can use the ts_adf_test() function to get a better feel for whether a time series is stationary.

Comments closed

A Primer on A/B Testing for Engineers

John Mount performs some testing:

I’d like to discuss a simple variation of A/B testing in an engineering style.
By “an engineering style” I mean:

  • We will work a simulated example to see that the system works as claimed.
  • We will exhibit examples of problems before trying to fix them.
  • We will demonstrate all of the top level claims as calculations, and not delegate these to references.
  • We will leave fundamental math to the references, and not try to re-derive it.

In my opinion far too few A/B testing treatments check soundness, even on simulated data. This makes it easy for such articles to leave out important steps. If a relied on reference omits a step, the derived work may have to do the same.
We will implement the experiment design directly, instead of using a canned power calculator so we have a place to discuss some of the design issues in A/B test design.

This is an excellent dive into the topic and I highly recommend taking the time to read it.

Comments closed

New R Package: hstats

Michael Mayer has a new package:

The current version offers:

  • H statistics per feature, feature pair, and feature triple
  • multivariate predictions at no additional cost
  • a convenient API
  • other important tools from explainable ML:
    • performance calculations
    • permutation importance (e.g., to select features for calculating H-statistics)
    • partial dependence plots (including grouping, multivariate, multivariable)
    • individual conditional expectations (ICE)
  • Case-weights are available for all methods, which is important, e.g., in insurance applications.

Click through for an example of how it works, followed by some simple benchmarking to give you an idea of how it performs compared to similar tools.

Comments closed

Reshaping Records using cdata

John Mount takes us through a common data wrangling problem:

In many data science projects we have the data, but it “is in the wrong format.” Fortunately re-formatting or reshaping data is a solved problem, with many different available tools.

For this note, I would like to show how to reshape data using the data algebra‘s cdata data reshaping tool. This should give you familiarity with a tool to use on your own data.

Click through for an example in Python. Mount and Nina Zumel also have an R package for cdata.

Comments closed

Plotting Decision Trees in R

Steven Sanderson builds a tree:

Decision trees are a powerful machine learning algorithm that can be used for both classification and regression tasks. They are easy to understand and interpret, and they can be used to build complex models without the need for feature engineering.

Once you have trained a decision tree model, you can use it to make predictions on new data. However, it can also be helpful to plot the decision tree to better understand how it works and to identify any potential problems.

In this blog post, we will show you how to plot decision trees in R using the rpart and rpart.plot packages. We will also provide an extensive example using the iris data set and explain the code blocks in simple to use terms.

Read on to see an example of how to do this.

Comments closed

Using DVC to Store Data Science Artifacts in Azure

I have a new video up:

In this video, we introduce DVC, a tool for version control management of data science and machine learning artifacts. We learn why Git isn’t the best place to store those large data files, how DVC integrates with Git, and how you can save your files in Azure Blob Storage.

Click through for the video, as well as a variety of links which helped me put it together.

Comments closed

Multivariate Histograms in R

Steven Sanderson wants multiple breakdowns:

Histograms are powerful tools for visualizing the distribution of a single variable, but what if you want to compare the distributions of two variables side by side? In this blog post, we’ll explore how to create a histogram of two variables in R, a popular programming language for data analysis and visualization.

We’ll cover various scenarios, from basic histograms to more advanced techniques, and explain the code step by step in simple terms. So, grab your favorite dataset or generate some random data, and let’s dive into the world of dual-variable histograms!

Click through for several techniques.

Comments closed

Initial Thoughts on the Microsoft Fabric Data Science Experience

Tori Tompkins shares some thoughts:

Fabric is Microsoft’s recently announced SaaS all-in-one analytics platform. It brings together Azure Data Factory, Azure Synapse Analytics and Power BI into a single cohesive platform without the overhead of setting up resources, maintenance, and configuration. Fabric wouldn’t be an end-to-end data analytics platform without data science, so in this blog we will explore the data science and machine learning capabilities of Microsoft Fabric and assess where the platform fits in the completive data science landscape.

Click through for Tori’s overview, where Fabric does a good job in its preview, and where it currently falls short.

Comments closed

Plotting SVM Decision Boundaries in R

Steven Sanderson goes right up to the edge:

Support Vector Machines (SVM) are a powerful tool in the world of machine learning and classification. They excel in finding the optimal decision boundary between different classes of data. However, understanding and visualizing these decision boundaries can be a bit tricky. In this blog post, we’ll explore how to plot an SVM object using the e1071 library in R, making it easier to grasp the magic happening under the hood.

Read on to see how you can perform this analysis as well.

Comments closed