Press "Enter" to skip to content

Category: Data Science

Most Business Ideas Fail

Eric Colson, et al, have a humbling thought for us:

The introduction of data science into the business world has contributed far more than recommendation algorithms; it has also taught us a lot about the efficacy with which we manage our businesses. Specifically, data science has introduced rigorous methods for measuring the outcomes of business ideas. These are the strategic ideas that we implement in order to achieve our business goals. For example, “We’ll lower prices to increase demand by 10%” and “we’ll implement a loyalty program to improve retention by 5%.” Many companies simply execute on their business ideas without measuring if they delivered the impact that was expected. But, science-based organizations are rigorously quantifying this impact and have learned some sobering lessons:

1. The vast majority of business ideas fail to generate a positive impact.

2. Most companies are unaware of this.

3. It is unlikely that companies will increase the success rate for their business ideas.

Read the whole thing. It gives a lot of perspective to a difficult problem: there aren’t as many “free wins” in a business as you might expect. To paraphrase Adam Smith, there is a lot of ruin in a company…but that doesn’t mean you know what exactly it is or how exactly to fix it. Coming in with appropriate humility and a flexible mind (by which I mean a willingness to see reality even when it doesn’t comport to the mental model you’ve built over time) can help improve those odds.

Comments closed

Document Classification in Python

Brendan Tierney performs a bit of document classification with scikit-learn and nltk:

Text mining is a popular topic for exploring what text you have in documents etc. Text mining and NLP can help you discover different patterns in the text like uncovering certain words or phases which are commonly used, to identifying certain patterns and linkages between different texts/documents. Combining this work on Text mining you can use Word Clouds, time-series analysis, etc to discover other aspects and patterns in the text. Check out my previous blog posts (post 1post 2) on performing Text Mining on documents (manifestos from some of the political parties from the last two national government elections in Ireland). These two posts gives you a simple indication of what is possible.

We can build upon these Text Mining examples to include other machine learning algorithms like those for Classification. With Classification we want to predict or label a record or document to have a particular value. With Classification this could involve labeling a document as being positive or negative (movie or book reviews), or determining if a document is for a particular domain such as Technology, Sports, Entertainment, etc

Click through for a walkthrough of this process.

Comments closed

DBScan for Clustering in Python

Brendan Tierney takes us through the DBScan algorithm:

Let’s illustrate the use of DBScan (Density Based Spatial Clustering of Applications with Noise), using the scikit-learn Python package, for a “manufactured” dataset. This example will illustrate how this density based algorithm works (See my other blog post which compares different Clustering algorithms for this same dataset). DBSCAN is better suited for datasets that have disproportional cluster sizes (or densities), and whose data can be separated in a non-linear fashion.

Click through for an interesting read on a dataset which is historically difficult to cluster (unless you know the general shape and translate everything to polar coordinates).

Comments closed

Understanding Support Vector Machines

Luis Valencia takes us through the algorithm for support vector machines:

A support vector machine (SVM) is a supervised machine learning model that uses classification algorithms for two-group classification problem. Compared to newer algorithms like neural networks, they have two main advantages: higher speed and better performance with a limited number of samples (in the thousands).

Pepperidge Farms remembers when we used genetic algorithms to solve problems because support vector machines were too slow.

Comments closed

Word Stemming and Text Processing in R

Genrikh Ananiev takes us through some examples of text processing in R:

First, there are a lot of classes (in fact, how many products you have so many classes) And if in this process you have to work not only with the company’s products, but also competitors, the growth of such new classes can occur every day – therefore it becomes meaningless to teach one time Model to be repeatedly used to predict new products.

Secondly, the number of documents (different variations of the same product) in the classes are not very balanced: there may be one by one to class, and maybe more.

Click through for an example of the classical technique versus a classification-based technique.

Comments closed

Understanding Logistic Regression

Luis Valencia explains the idea of logistic regression:

Logistic Regression is a Machine Learning classification algorithm that is used to predict the probability of a categorical dependent variable. In logistic regression, the dependent variable is a binary variable that contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.). In other words, the logistic regression model predicts P(Y=1) as a function of X.

However, unlike ordinary linear regression, in it’s most basic form logistic regressions target value is a binary variable instead of a continuous value.

Read on to learn more about logistic regression. The point I like to make about logistic regression is that people brand new to it say it’s regression, because hey, it has “regression” in its name! People who are more familiar with it say that’s a misnomer and it’s really a classification algorithm, not a regression algorithm. But as Luis shows, people who are very familiar with it understand that it is a regression algorithm, which just happens to have nice classification properties because in many cases, elements get pushed to the edges (0 and 1).

Comments closed

Rolling Means with MazamaRollUtils

Jonathan Callahan has an interesting R package for us:

The initial release of MazmaRollUtils provides all the basic rolling functions with features like alignment and missing value removal along with additional capabilities for smoothing, damping and outlier detection — all common activities in time series analysis.

Click through for an explanation of the process, and then check out the package itself on GitHub. H/T R-Bloggers.

Comments closed

De Moivre’s Equation and Sample Size-Based Variance

Holger von Jouanne-Diedrich demonstrates de Moivre’s equation:

Over one billion dollars have been spent in the US to split up big schools into smaller ones because small schools regularly show up in rankings as top performers.

In this post, I will show you why that money was wasted because of a widespread (but not so well known) statistical artifact, so read on!

Do read on to learn more about this paradox.

Comments closed

What is Pandas?

Lina Kovacheva starts a new series on Pandas:

First and foremost – what is Pandas?

Pandas is a popular Python library that allows users to easily analyse and manipulate data. It offers powerful and flexible data structures and is vastly popular among data scientists and analysts. As with any other library to be able to use Pandas you have to import the library. 

Click through to learn more.

Comments closed