Press "Enter" to skip to content

Category: Data Science

Demand Forecasting with knime

Shubham Goyal walks us through using knime for a product demand forecasting scenario:

In this blog, we are going to see, Importance of demand forecasting and how we can easily create these forecasting workflows with Knime.

Market request forecasting is a basic procedure for any business, however maybe none more so than those in buyer packaged products. Stock, production, storage, delivering, showcasing – each aspect of CPG and retail organizations’ activities are influenced by accurate forecasting. Identifying shoppers’ preferences and their likeness to buy, make these organizations settle on better choices with respect to product offerings, entering new markets and their supply chains, guarantee that stores/stocks are stocked, and limit the danger of stock shortages or overflow.

Click through for the process.

Comments closed

Optimizing a Poisson Survival Model

Joshua Entrop shows off optimx() in R to perform a survival analysis:

In this blog post, we will fit a Poisson regression model by maximising its likelihood function using optimx() in R. As an example we will use the lung cancer data set included in the {survival} package. The data set includes information on 228 lung cancer patients from the North Central Cancer Treatment Group (NCCTG). Specifically, we will estimate the survival of lung cancer patients by sex and age using a simple Poisson regression model. You can download the code that I will use throughout post here

Read the whole thing. H/T R-bloggers

Comments closed

Splitting Data with T-SQL

Chris Hyde shows a few techniques for splitting out data into training, testing, and validation sets:

We see right away that this method failed horribly as all of the data was placed into the same dataset. This holds true no matter how many times we execute the code, and it happens because the RAND() function is only evaluated once for the whole query, and not individually for each row. To correct this we’ll instead use a method that Jeff Moden taught me at a SQL Saturday in Detroit several years ago – generating a NEWID() for each row, using the CHECKSUM() function to turn it into a random number, and then the % (modulus) function to turn it into a number between 0 and 99 inclusive.

I’d have to test it out, but I’d think you could modify method 3 to include a CROSS APPLY to perform one ABS(CHECKSUM(NEWID()) and get exact counts that way without a temp table.

Comments closed

The Basics of Randomized Response

Holger von Jouanne-Diedrich explains how randomized response can protect any single person’s opinion from a pollster while providing insight on the whole population:

So, is there a method to find the respective proportion of people without putting them on the spot? Actually, there is! If you want to learn about randomized response (and how to create flowcharts in R along the way) read on!

The question is how can you get a truthful result overall without being able to attribute a certain answer to any single individual. As it turns out, there is a very elegant and ingenious method, called randomized response. The big idea is to, as the name suggests, add noise to every answer without compromising the overall proportion too much, i.e. add noise to every answer so that it cancels out overall!

Click through for the process. It’s definitely a clever idea.

Comments closed

Contrasting ANOVA against Regression

Stephanie Glen contrasts ANOVA against typical regression techniques using a picture:

If you scour the internet for “ANOVA vs Regression”, you might be confused by the results. Are they the same? Or aren’t they? The answer is that they can be the same procedure, if you set them up to be that way. But there are differences between the two methods. This one picture sums up those differences.

Click through to see that image.

Comments closed

Comparing Gradient Descent to the Normal Equation for Small Data Sets

Pushkara Sharma compares two techniques for regression:

In this article, we will see the actual difference between gradient descent and the normal equation in a practical approach. Most of the newbie machine learning enthusiasts learn about gradient descent during the linear regression and move further without even knowing about the most underestimated Normal Equation that is far less complex and provides very good results for small to medium size datasets.

If you are new to machine learning, or not familiar with a normal equation or gradient descent, don’t worry I’ll try my best to explain these in layman’s terms. So, I will start by explaining a little about the regression problem.

I was surprised by the results.

Comments closed

Understanding the Bayesian Nature of Kalman Filters

Holger von Jouanne-Diedrich gives us an interesting interpretation of Kalman filters:

The Kalman filter is a very powerful algorithm to optimally include uncertain information from a dynamically changing system to come up with the best educated guess about the current state of the system. Applications include (car) navigation and stock forecasting. If you want to understand how a Kalman filter works and build a toy example in R, read on!

The following post is based on the post “Das Kalman-Filter einfach erklärt” which is written in German and uses Matlab code (so basically two languages nobody is interested in any more 😉 ). This post is itself based on an online course “Artificial Intelligence for Robotics” by my colleague Professor Sebastian Thrun of Standford University.

In fairness, I regret only one thing about learning German: that I’ve forgotten so much over the years.

Comments closed

The Basics of Autoregressive Models

Holger von Jouanne-Diedrich explains some of the principels of autoregressive models through a demonstration:

Well, this seems to be good news for the sales team: rising sales! Yet, how does this model arrive at those numbers? To understand what is going on we will now rebuild the model. Basically, everything is in the name already: auto-regressive, i.e. a (linear) regression on (a delayed copy of) itself (auto from Ancient Greek self)!

So, what we are going to do is create a delayed copy of the time series and run a linear regression on it. We will use the lm() function from base R for that (see also Learning Data Science: Modelling Basics).

Read on for some additional understanding.

Comments closed

Using INLA for Spatial Regression in R

Lionel Hertzog continues a series on spatial regression:

INLA is a package that allows to fit a broad range of model, it uses Laplace approximation to fit Bayesian models much, much faster than algorithms such as MCMC. INLA allows for fitting geostatistical models via stochastic partial differential equation (SPDE), a good place for more background informations on this are these two gitbooks: spde-gitbook and inla-gitbook.

This is not the gentlest introduction, so if you’re new to the concept go back and read part 1.

Comments closed