Press "Enter" to skip to content

Category: Data Science

Solving The Monty Hall Problem With R

Miroslav Rajter builds a Monty Hall problem simulator using R:

The original and most simple scenario of the Monty Hall problem is this: You are in a prize contest and in front of you there are three doors (A, B and C). Behind one of the doors is a prize (Car), while behind others is a loss (Goat). You first choose a door (let’s say door A). The contest host then opens another door behind which is a goat (let’s say door B), and then he ask you will you stay behind your original choice or will you switch the door. The question behind this is what is the better strategy?

This is something that puzzled me for a very long time. This is fundamentally a Bayesian problem built around processing new information, and once I understood that, the answer was a lot clearer. H/T R-Bloggers.

Comments closed

codecentric.ai Bootcamp

Shirin Glander announces a free German-language bootcamp:

This bootcamp is a free online course for everyone who wants to learn hands-on machine learning and AI techniques, from basic algorithms to deep learning, computer vision and NLP. However, the course language is German only, but for every chapter I did, you will find an English R-version here on my blog (see below for links).

Right now, the course is in beta phase, so we are happy about everyone who tests our content and leaves feedback. Also, not the entire curriculum is finished yet, we will update and extend the course during the next months. If there are specific topics you’d like to have us cover, just let us know!

If you understand German and want to learn about data science, check this out and leave feedback.

Comments closed

Misinterpretation and Misuse of P-Values and Confidence Intervals

Dave Giles has some good details on common problems of misinterpretation:

There are so many things in statistics (and hence in econometrics) that are easily, and frequently, misinterpreted. Two really obvious examples are p-values and confidence intervals.

I’ve devoted some space in earlier posts to each of these concepts, and their mis-use. For instance, in the case of p-values, see the posts here and here; and for confidence intervals, see here and here.

Click through for more in this vein, including a reference to an interesting-looking paper.

Comments closed

Gartner Advanced Analytics Magic Quadrant Updates

William Vorhies summarizes the changes to the Gartner Advanced Analytics magic quadrant:

The Gartner Magic Quadrant for Data Science and Machine Learning Platforms is just out and once again there are big changes in the leaderboard.  Say what you will about our profession but as a platform developer you certainly can’t rest on your laurels.  Some traditional leaders have fallen (SAS, KNIME, H2Oai, IBM) and some challengers have risen (Alteryx, TIBCO, RapidMiner).

Databricks is making a big push and there’s more movement than usual in this year’s chart. Check it out.

Comments closed

Market Basket Analysis With arulesSequences

Allison Koenecke takes us through the arulesSequences package in R:

In the following tutorial, we answer both questions using the R package arulesSequences [4], which implements the SPADE algorithm [5]. Concretely, given data in an Excel spreadsheet containing historical customer service purchase data, we produce two separate Excel sheet deliverables: a list of service bundles, and a set of temporal rules showing how service bundles evolve over time.  We will focus on interpreting the latter result by showing how to use temporal rules in making predictive sales recommendations.

Our running example below is inspired by the need for Microsoft’s Azure Services salespeople to suggest which additional products to recommend to customers, given the customers’ current cloud product consumption services mix.  We’d like to know, for instance, if customers who have implemented web services also purchase web analytics within the next month.  Actual Azure Service names have been removed for confidentiality reasons.

Market basket analysis is an interesting topic, though in my limited experience, it really falls apart when you have a large number of products to compare, so it tends to work better with toy examples or limited product selections because when you have a 50,000+ SKU inventory, the lift of any individual combination of products rarely gets above the level of noise.

Comments closed

Building Credit Scorecards

Andre Violante uses SAS to build credit scorecards and analyze credit data:

For this analysis I’m using the SAS Open Source library called SWAT (Scripting Wrapper for Analytics Transfer) to code in Python and execute SAS CAS Action Sets. SWAT acts as a bridge between the python language to CAS Action Sets. CAS Action Sets are synonymous to libraries in Python or packages in R. The one main difference and benefit is that the algorithms within these action sets have been highly parallelized to run on a CAS (Cloud Analytic Services) server. The CAS server is a distributed in-memory engine where I can do all my heavy lifting or computations. The code and Jupyter Notebook are available on GitHub.

Click through for the analysis.

Comments closed

P-Hacking and Multiple Comparison Bias

Patrick David has a great article on hypothesis testing, p-hacking, and multiple comparison bias:

The most important part of hypothesis testing is being clear what question we are trying to answer. In our case we are asking:
“Could the most extreme value happen by chance?”
The most extreme value we define as the greatest absolute AMVR deviation from the mean. This question forms our null hypothesis.

Give this one a careful read and try out the code. This is an important topic for anyone who analyzes data to understand.

Comments closed

Feature And Text Classification Using Naive Bayes In R

I wrap up my series on the Naive Bayes class of algorithms, finally writing some code along the way:

Now we’re going to look at movie reviews and predict whether a movie review is a positive or a negative review based on its words. If you want to play along at home, grab the data set, which is under 3MB zipped in 2000 reviews in total.

Unike last time, I’m going to break this out into sections with commentary in between. If you want the full script with notebook, check out the GitHub repo I put together for this talk.

Assuming I ever get a chance to do this talk again, I’m probably going to change the data sets in the example given how overplayed iris is.

Comments closed

Classifying Texts With Naive Bayes

I continue my series on Naive Bayes with another hand-calculation post:

Step two is, on the surface, pretty tough: how do we figure out if a set of words is a business phrase or a baseball phrase? We could try to think up a set of features. For example, how long is the phrase? How many unique words does it have? Is there a pile of sunflower seeds near the phrase? But there’s an easier way.

Remember the “naive” part of Naive Bayes: all features are independent. And in this case, we can use as features the individual words. Therefore, the probability of a word being a baseball-related word or a business-related word is what matters, and we cross-multiply those probabilities to determine if the overall phrase is a baseball phrase or a business phrase.

Click through for a sports-heavy example and a bonus Nate Barkerson reference.

Comments closed

Where Machine Learning And Econometrics Collide

Dave Giles shares some thoughts on how machine learning and econometrics relate:

What is Machine Learning (ML), and how does it differ from Statistics (and hence, implicitly, from Econometrics)?

Those are big questions, but I think that they’re ones that econometricians should be thinking about. And if I were starting out in Econometrics today, I’d take a long, hard look at what’s going on in ML.

Click through for some quick thoughts and several resources on the topic.

Comments closed