Press "Enter" to skip to content

Category: Data Science

A Bayesian Approach to Pricing

John Cook has some thoughts on an interesting problem:

Suppose you want to determine how to price a product and you initially don’t know what the market is willing to pay. This post outlines some of the things you might think about, and how Bayesian modeling might help.

This post is not the final word on the subject, or even my final word on the subject. It is essentially a reply to a friend’s question turned into a blog post rather than an email.

This is a really interesting problem. Price is ultimately a signal rather than the solution. This is an attempt to understand how you start to build up enough information to start setting effective prices. Over a long enough timeframe, we’ll eventually land on a commonly accepted price but that’s an economist’s way of hand-waving away the process.

Comments closed

Finding Performance Regressions

Angus Croll embraces the power of control charts:

Technically, “performance” metrics are those relating to the responsiveness or latency of the app, including start up time.

But TV devices also tend to be more memory constrained than other devices, and as such are more liable to crash during a memory spike — so for Netflix TV we actually care about memory at least as much as performance, maybe more so.

At Netflix the term “performance” usually encompasses both performance metrics (in the strict meaning) and memory metrics, and that’s how we’re using the term here.

Click through for a primer on how they use changepoint and anomaly detection to do this.

Comments closed

Handling Categorical Data in R

The RSquared Academy blog has a two-parter on handling categorical data in R. Part 1 elaborates on kinds of categorical data and introduces a case study:

While we can rank the categories, we cannot assign a value to them. For example, in satisfaction ranking, we cannot say that like is twice as positive as dislike i.e. we are unable to say how much they differ from each other. While the order or rank of data is meaningful, the difference between two pieces of data cannot be measured/determined or are meaningless. Ordinal data provide information about relative comparisons, but not the magnitude of the differences.

Part 2 shows off ways to work with categorical data in tables:

In this section, we will explore the above ways of summarizing categorical data. We will also spend some time learning about tables as you will be using them extensively while working with categorical data. R has many packages for tabulating data and we list and explore all of them in the R scripts shared in the GitHub repository.

Click through for both guides. H/T R-Bloggers.

Comments closed

Anomaly Detection in Two Ways

Muhammad Asad Iqbal Khan shows how you can use isolation forests and kernel density estimation for outlier detection:

Just like the random forests, isolation forests are built using decision trees. They are implemented in an unsupervised fashion as there are no pre-defined labels. Isolation forests were designed with the idea that anomalies are “few and distinct” data points in a dataset.

Recall that decision trees are built using information criteria such as Gini index or entropy. The obviously different groups are separated at the root of the tree and deeper into the branches, the subtler distinctions are identified. Based on randomly picked characteristics, an isolation forest processes the randomly subsampled data in a tree structure. Samples that reach further into the tree and require more cuts to separate them have a very little probability that they are anomalies. Likewise, samples that are found on the shorter branches of the tree are more likely to be anomalies, since the tree found it simpler to distinguish them from the other data.

Click through for descriptions and the code.

Comments closed

Choosing a Statistical Test

Antoine Soetewey has a handy chart for us:

Being a teaching assistant in statistics for students with diverse backgrounds, I have the chance to see what is globally not well understood by students.

I have realized that it is usually not a problem for students to do a specific statistical test when they are told which one to use (as long as they have good resources and they have been attentive during classes, of course). However, it appears that the task is much more difficult for them when they need to choose what test to do.

Click through for the chart, as well as a PDF version. H/T R-Bloggers.

Comments closed

Solving Linear Constraints with Python

Luke Menzies and Gavita Regunath create a schedule:

Linear optimisation (often referred to as linear programming) is not cutting edge or new. It has been around for a very long time. It was first introduced within the field of operational research during World War II, where it was used to help minimise costings. The method proposed for solving these problems is known as the simplex method, and it hasn’t changed much today. Although this method hasn’t changed significantly, what has changed significantly is the computing power and accessibility of this technique, allowing these methods to be used on very complex scenarios with almost a click of a button. Convenient libraries have allowed the intricate complexities of setting these problems up on a computer to be simplified.

Read on for an example of linear programming. This is something I’ve always enjoyed, but haven’t had many places to use this technique in my professional career. That said, shout out to everyone who’s ever used LINGO.

Comments closed

Monotonic Constraints on Random Forests

Michael Mayer has some interesting R and Python code for us:

On ML competition platforms like Kaggle, complex and unintuitively behaving models dominate. In this respect, reality is completely different. There, the majority of models do not serve as pure prediction machines but rather as fruitful source of information. Furthermore, even if used as prediction machine, the users of the models might expect a certain degree of consistency when “playing” with input values.

A classic example are statistical house appraisal models. An additional bathroom or an additional square foot of ground area is expected to raise the appraisal, everything else being fixed (ceteris paribus). The user might lose trust in the model if the opposite happens.

One way to enforce such consistency is to monitor the signs of coefficients of a linear regression model. Another useful strategy is to impose monotonicity constraints on selected model effects.

Certain types of regression algorithm make this easy, but random forest? Not so much. That’s where Michael steps in.

Comments closed