Press "Enter" to skip to content

Category: Data Science

Solving Linear Constraints with Python

Luke Menzies and Gavita Regunath create a schedule:

Linear optimisation (often referred to as linear programming) is not cutting edge or new. It has been around for a very long time. It was first introduced within the field of operational research during World War II, where it was used to help minimise costings. The method proposed for solving these problems is known as the simplex method, and it hasn’t changed much today. Although this method hasn’t changed significantly, what has changed significantly is the computing power and accessibility of this technique, allowing these methods to be used on very complex scenarios with almost a click of a button. Convenient libraries have allowed the intricate complexities of setting these problems up on a computer to be simplified.

Read on for an example of linear programming. This is something I’ve always enjoyed, but haven’t had many places to use this technique in my professional career. That said, shout out to everyone who’s ever used LINGO.

Comments closed

Monotonic Constraints on Random Forests

Michael Mayer has some interesting R and Python code for us:

On ML competition platforms like Kaggle, complex and unintuitively behaving models dominate. In this respect, reality is completely different. There, the majority of models do not serve as pure prediction machines but rather as fruitful source of information. Furthermore, even if used as prediction machine, the users of the models might expect a certain degree of consistency when “playing” with input values.

A classic example are statistical house appraisal models. An additional bathroom or an additional square foot of ground area is expected to raise the appraisal, everything else being fixed (ceteris paribus). The user might lose trust in the model if the opposite happens.

One way to enforce such consistency is to monitor the signs of coefficients of a linear regression model. Another useful strategy is to impose monotonicity constraints on selected model effects.

Certain types of regression algorithm make this easy, but random forest? Not so much. That’s where Michael steps in.

Comments closed

Replacing p-values with Bootstrapped Confidence Intervals

Florent Buisson has an interesting post on avoiding p-value calculations:

And indeed, I worked with highly-skilled data scientists who had a very sharp understanding of statistics. But after years of designing and analyzing experiments, I grew dissatisfied with the way we communicated results to decision-makers. I felt that the over-reliance on p-values led to sub-optimal decisions. After talking to colleagues in other companies, I realized that this was a broader problem, and I set up to write a guide to better data analysis. In this article, I’ll present one of the biggest recommendations of the book, which is to ditch p-values and use Bootstrap confidence intervals instead.

I’m a committed Bayesian (or at least a Bayesian who should be committed—depends on who you ask), so I’d consider this a big step forward.

Comments closed

When to Start Using a Database with R or Python

Roel Hogervorst thinks about data sizes in R and Python:

Your dataset becomes so big and unwieldy that operations take a long time. How long is too long? That depends on you, I get annoyed if I don’ t get feedback within 20 seconds (and I love it when a program shows me a progress bar at that point, at least I know how long it will take!), your boundary may lay at some other point. When you reach that point of annoyance or point of no longer being able to do your work. You should improve your workflow.

I will show you how to do some speedups by using other R packages, in python moving from pandas to polars, or leveraging databases. I see some hesitancy about moving to a database for analytical work, and that is too bad. Bad for two reasons, one: it is super simple, two it will save you a lot of time.

I definitely agree with Roel’s bottom line here. Granted, part of that is domain knowledge, but databases are extremely good at handling data and both languages have plenty of database accessibility.

One last tip, though: if you’re on the data science or data analytics track, learn SQL. Yes, libraries like dbplyr in R or ORMs in Python can cover up a lot, but that comes at a cost, typically in terms of performance. Building these skills will make your life considerably easier.

Comments closed

Most Business Ideas Fail

Eric Colson, et al, have a humbling thought for us:

The introduction of data science into the business world has contributed far more than recommendation algorithms; it has also taught us a lot about the efficacy with which we manage our businesses. Specifically, data science has introduced rigorous methods for measuring the outcomes of business ideas. These are the strategic ideas that we implement in order to achieve our business goals. For example, “We’ll lower prices to increase demand by 10%” and “we’ll implement a loyalty program to improve retention by 5%.” Many companies simply execute on their business ideas without measuring if they delivered the impact that was expected. But, science-based organizations are rigorously quantifying this impact and have learned some sobering lessons:

1. The vast majority of business ideas fail to generate a positive impact.

2. Most companies are unaware of this.

3. It is unlikely that companies will increase the success rate for their business ideas.

Read the whole thing. It gives a lot of perspective to a difficult problem: there aren’t as many “free wins” in a business as you might expect. To paraphrase Adam Smith, there is a lot of ruin in a company…but that doesn’t mean you know what exactly it is or how exactly to fix it. Coming in with appropriate humility and a flexible mind (by which I mean a willingness to see reality even when it doesn’t comport to the mental model you’ve built over time) can help improve those odds.

Comments closed

Document Classification in Python

Brendan Tierney performs a bit of document classification with scikit-learn and nltk:

Text mining is a popular topic for exploring what text you have in documents etc. Text mining and NLP can help you discover different patterns in the text like uncovering certain words or phases which are commonly used, to identifying certain patterns and linkages between different texts/documents. Combining this work on Text mining you can use Word Clouds, time-series analysis, etc to discover other aspects and patterns in the text. Check out my previous blog posts (post 1post 2) on performing Text Mining on documents (manifestos from some of the political parties from the last two national government elections in Ireland). These two posts gives you a simple indication of what is possible.

We can build upon these Text Mining examples to include other machine learning algorithms like those for Classification. With Classification we want to predict or label a record or document to have a particular value. With Classification this could involve labeling a document as being positive or negative (movie or book reviews), or determining if a document is for a particular domain such as Technology, Sports, Entertainment, etc

Click through for a walkthrough of this process.

Comments closed

DBScan for Clustering in Python

Brendan Tierney takes us through the DBScan algorithm:

Let’s illustrate the use of DBScan (Density Based Spatial Clustering of Applications with Noise), using the scikit-learn Python package, for a “manufactured” dataset. This example will illustrate how this density based algorithm works (See my other blog post which compares different Clustering algorithms for this same dataset). DBSCAN is better suited for datasets that have disproportional cluster sizes (or densities), and whose data can be separated in a non-linear fashion.

Click through for an interesting read on a dataset which is historically difficult to cluster (unless you know the general shape and translate everything to polar coordinates).

Comments closed

Understanding Support Vector Machines

Luis Valencia takes us through the algorithm for support vector machines:

A support vector machine (SVM) is a supervised machine learning model that uses classification algorithms for two-group classification problem. Compared to newer algorithms like neural networks, they have two main advantages: higher speed and better performance with a limited number of samples (in the thousands).

Pepperidge Farms remembers when we used genetic algorithms to solve problems because support vector machines were too slow.

Comments closed

Word Stemming and Text Processing in R

Genrikh Ananiev takes us through some examples of text processing in R:

First, there are a lot of classes (in fact, how many products you have so many classes) And if in this process you have to work not only with the company’s products, but also competitors, the growth of such new classes can occur every day – therefore it becomes meaningless to teach one time Model to be repeatedly used to predict new products.

Secondly, the number of documents (different variations of the same product) in the classes are not very balanced: there may be one by one to class, and maybe more.

Click through for an example of the classical technique versus a classification-based technique.

Comments closed