Press "Enter" to skip to content

Category: Data Science

De Moivre’s Equation and Sample Size-Based Variance

Holger von Jouanne-Diedrich demonstrates de Moivre’s equation:

Over one billion dollars have been spent in the US to split up big schools into smaller ones because small schools regularly show up in rankings as top performers.

In this post, I will show you why that money was wasted because of a widespread (but not so well known) statistical artifact, so read on!

Do read on to learn more about this paradox.

Comments closed

What is Pandas?

Lina Kovacheva starts a new series on Pandas:

First and foremost – what is Pandas?

Pandas is a popular Python library that allows users to easily analyse and manipulate data. It offers powerful and flexible data structures and is vastly popular among data scientists and analysts. As with any other library to be able to use Pandas you have to import the library. 

Click through to learn more.

Comments closed

Testing Stock Market Efficiency with Compression Algorithms

Holger von Jouanne-Diedrich has a clever test:

One of the most fiercely fought debates in quantitative finance is whether the stock market (or financial markets in general) is (are) efficient, i.e. whether you can find patterns in them that can be profitably used.

If you want to learn about an ingenious method (that is already present in anyone’s computer) to approach that question, read on!

As soon as I saw the post, my Eugene Fama senses were tingling. The results are not surprising (at least, to anyone who got my reference in the prior sentence), but I did enjoy the rather clever approach to the question.

Comments closed

Subgroup Analysis via Bayesian Hierarchical Modeling

Keith Goldfield ponders subgroup analysis:

Which got me thinking, of course, about subgroup analyses. In the context of a null hypothesis significance testing framework, it is well known that conducting numerous post hoc analyses carries the risk of dramatically inflating the probability of a Type 1 error – concluding there is some sort of effect when in fact there is none. So, if there is no overall effect, and you decide to look at a subgroup of the sample (say patients over 50), you may find that the treatment has an effect in that group. But, if you failed to adjust for multiple tests, than that conclusion may not be warranted. And if that second subgroup analysis was not pre-specified or planned ahead of time, that conclusion may be even more dubious.

If we use a Bayesian approach, we might be able to avoid this problem, and there might be no need to adjust for multiple tests. I have started to explore this a bit using simulated data under different data generation processes and prior distribution assumptions. It might all be a bit too much for a single post, so I am planning on spreading it out a bit.

Read on for two separate Bayesian model approaches to the problem. H/T R-Bloggers.

Comments closed

Using tsoutliers() to Detect Time Series Outliers

Rob J. Hyndman shows off a function in the forecast package in R:

The tsoutliers() function in the forecast package for R is useful for identifying anomalies in a time series. However, it is not properly documented anywhere. This post is intended to fill that gap.

The function began as an answer on CrossValidated and was later added to the forecast package because I thought it might be useful to other people. It has since been updated and made more reliable.

Read on to see how it works. This is one of the reasons I like the R programming language so much for data analysis and statistics: usually, somebody smarter than me has already built a solution to the problem and it’s just a matter of finding the right function. H/T R-Bloggers

Comments closed

Estimating the Likelihood of an Underdog Winning at Soccer

Holger von Jouanne-Diedrich lays out the math for us:

The Bundesliga is Germany’s primary football league. It is one of the most important football leagues in the world, broadcast on television in over 200 countries.

If you want to get your hands on a tool to forecast the result of any game (and perform some more statistical analyses), read on!

What I would like is a tool which has SC Freiburg utterly dominating Bayern. Said tool may be more mythological than scientific (or at least a copy of Football Manager and a little bit of save scumming…), but I’ll take it.

Comments closed

From API Call to ML Services Prediction

Tomaz Kastrun continues a series:

From the previous two blog posts:

Creating REST API for reading data from Microsoft SQL Server in web browser

Writing Data to Microsoft SQL Server from web browser using REST API and node.js

We have looked into the installation process of Node.js, setup of Microsoft SQL Server and made couple of examples on reading the data from database through REST API and how to insert data back to database.

In this post, we will be looking the R predictions using API calls against a sample dataset.

Click through to see it in action.

Comments closed

A Learning Path for Data Science with R

Holger von Jouanne-Diedrich has a greatest hits album:

Over the course of the last two and a half years, I have written over one hundred posts for my blog “Learning Machines” on the topics of data science, i.e. statistics, artificial intelligence, machine learning, and deep learning.

I use many of those in my university classes and in this post, I will give you the first part of a learning path for the knowledge that has accumulated on this blog over the years to become a well-rounded data scientist, so read on!

Read on for links to dozens of posts on interesting topics.

Comments closed

Diving into Prophet for Time Series Analysis

Dan Lantos continues a series on the Prophet library:

These plots give us a little insight into how the model is formed. The trend plot (top) exhibits a linear, piecewise function, with approximately appropriate values for our dataset throughout the years. This looks to be a baseline for predictions.
The weekly plot (middle) demonstrates some interesting behaviour – weekdays have a small negative impact on the predictions (approximately -50), and we see large spikes for the weekends. This appears peculiar, as we have no weekend data in our dataset, but it is a product of fitting a 7-day periodic function to only 5 days of data. Thankfully, this won’t be an issue as we have no need to forecast weekends.
The yearly plot (bottom) shows a much more volatile impact on predictions (-200 to +180) with frequent changepoints throughout. This points to a more sensitive and complex relationship between the time of year and the FTSE100 index than the day of the week.

If you’re already familiar with techniques like ARMA or ARIMA, this post will let you see immediately what the key differences are.

Comments closed

Orchestrating ML Pipelines with Amazon Managed Workflows for Airflow

Juston Leto, et al, show off MLOps capabilities in AWS:

The ability to scale machine learning operations (MLOps) at an enterprise is quickly becoming a competitive advantage in the modern economy. When firms started dabbling in ML, only the highest priority use cases were the focus. Businesses are now demanding more from ML practitioners: more intelligent features, delivered faster, and continually maintained over time. An effective MLOps strategy requires a unified platform that can orchestrate and automate complex data processing and ML tasks, and integrates with the latest tooling to best complete those tasks.

This post demonstrates the value of using Amazon Managed Workflows for Apache Airflow (Amazon MWAA) to orchestrate an ML pipeline using the popular XGBoost (eXtreme Gradient Boosting) algorithm. For more advanced and comprehensive MLOps capabilities, including a purpose-built model orchestration framework and a continuous integration and continuous delivery (CI/CD) service for ML, readers are encouraged to check out Amazon SageMaker Pipelines.

Read on for a step-by-step tutorial on the process.

Comments closed