Mixed Integer Optimization

David Smith discusses the ompr package in R:

Counterintuitively, numerical optimizations are easiest (though rarely actually easy) when all of the variables are continuous and can take any value. When integer variables enter the mix, optimization becomes much, much harder. This typically happens when the optimization is constrained by a limited selection of objects, for example packages in a weight-limited cargo shipment, or stocks in a portfolio constrained by sector weightings and transaction costs. For tasks like these, you often need an algorithm for a specialized type of optimization: Mixed Integer Programming.

For problems like these, Dirk Schumacher has created the ompr package for R. This package provides a convenient syntax for describing the variables and contraints in an optimization problem. For example, take the classic “knapsack” problem of maximizing the total value of objects in a container subject to its maximum weight limit.

Read the whole thing.

Deploying VMs To Azure Using Powershell

Rob Sewell shows how to use Powershell to create your own Azure VM instance of the Microsoft data science virtual machine:

First, an annoyance. To be able to deploy Data Science virtual machines in Azure programmatically  you first have to login to the portal and click some buttons.

In the Portal click new and then marketplace and then search for data science. Choose the Windows Data Science Machine and under the blue Create button you will see a link which says “Want to deploy programmatically? Get started” Clicking this will lead to the following blade.

Click through for a screenshot-laden explanation which leaves you with a working VM in Azure.

Bandit Algorithms

Tanner Thompson describes usage of a multi-armed bandit algorithm to drive conversions:

The functional idea behind a bandit algorithm is that you make an informed decision every time you assign a visitor to a test arm. Several bandit-type algorithms have been proved to be mathematically optimal; that is, they obtain the maximum future revenue given the data they have at any given point. Gittins indexing is perhaps the foremost of these algorithms. However, the trade-off of these methods is that they tend to be very computationally intensive.

This article doesn’t show any code, but it is useful for thinking about the problem.

Data Science Languages

Alessandro Piva provides preliminary metrics on language usage among self-described data scientists:

Programming is one of the five main competence areas at the base of the skill set for a Data Scientist, even if is not the most relevant in term of expertise (see What is the right mix of competences for Data Scientists?). Considering the results of the survey, that involved more than 200 Data Scientist worldwide until today, there isn’t a prevailing choice among the programming languages used during the data science’s activities. However, the choice appears to be addressed mainly to a limited set of alternatives: almost 96% of respondents affirm to use at least one of R, SQL or Python.

These results don’t surprise me much.  R has slightly more traction than Python, but the percentage of people using both is likely to increase.  SQL, meanwhile, is vital for getting data, and as we’re seeing in the Hadoop space, as data platform products get more mature, they tend to gravitate toward a SQL or SQL-like language.  Cf. Hive, Spark SQL, Phoenix, etc.

The Data Science Delusion

Anand Ramanathan has a strong critique of “data science” as it stands today:

Illustration: Consider the sentiment-tagging task again. A Q1 resource uses an off-the-shelf model for movie reviews, and applies it to a new task (say, tweets about a customer service organization). Business is so blinded by spectacular charts [14] and anecdotal correlations (“Look at that spiteful tweet from a celebrity … so that’s why the sentiment is negative!”), that even questions about predictive accuracy are rarely asked until a few months down the road when the model is obviously floundering. Then too, there is rarely anyone to challenge the assumptions, biases and confidence intervals (Does the language in the tweets match the movie reviews? Do we have enough training data? Does the importance of tweets change over time?).

Overheard“Survival analysis? Never heard of it … Wait … There is an R package for that!”

This is a really interesting article and I recommend reading it.

Price Optimization Using Decision Trees

Bernard Antwi Adabankah uses a decision tree to model price changes:

The sample included N = 262 individual orders for Interlocking Hearts Design Cake Knife/Server set with OrderItemSKU as 2401 from the period ranging from 1st March 2014 to 20th April 2016 with an ecommerce company which sells on Amazon.co.uk

The Profit response variable is measured as the product sale price on amazon.co.uk which includes amazon.co.uk commission and any applicable postage costs less the purchase price of the Hearts Design Cake Knife/Server set from the supplier.

Read the comments for a couple good critiques of the article.

Multi-Model Time Series Analysis

The folks at ELEKS discuss what to do when a single time series model just won’t cut it:

With the emergence of the powerful forecasting methods based on Machine Learning, future predictions have become more accurate. In general, forecasting techniques can be grouped into two categories: qualitative and quantitative. Qualitative forecasts are applied when there is no data available and prediction is based only on expert judgement. Quantitative forecasts are based on time series modeling. This kind of models uses historical data and is especially efficient in forecasting some events that occur over periods of time: for example prices, sales figures, volume of production etc.

The existing models for time series prediction include the ARIMA models that are mainly used to model time series data without directly handling seasonality; VAR modelsHolt-Winters seasonal methods, TAR modelsand other. Unfortunately, these algorithms may fail to deliver the required level of the prediction accuracy, as they can involve raw data that might be incomplete, inconsistent or contain some errors. As quality decisions are based only on quality data, it is crucial to perform preprocessing to prepare entry information for further processing.

Treating time series data as a set of waveform functions can generate some very interesting results.

Data Curation

Christina Prevalsky makes the case for data curation:

The gaining popularity of self-service analytical tools such as Tableau increases the necessity of having curated data in your database. These tools aim to allow the end users to intuitively query data “at the speed of thought” from the data warehouse and visualize the results quickly. That type of capability allows users to go through several different iterations of the data to really explore the data and generate unique insights. These tools do not work well when the underlying database tables have not been curated properly.

This is a difficult and lengthy process, but it’s vital; data minus context is a lot less relevant than you’d hope.

Analyzing The Simpsons

Todd Schneider has a fun analysis of the Simpsons:

Per Wikipedia:

While later seasons would focus on Homer, Bart was the lead character in most of the first three seasons

I’ve heard this argument before, that the show was originally about Bart before switching its focus to Homer, but the actual scripts only seem to partially support it.

Bart accounted for a significantly larger share of the show’s dialogue in season 1 than in any future season, but Homer’s share has always been higher than Bart’s. Dialogue share might not tell the whole story about a character’s prominence, but the fact is that Homer has always been the most talkative character on the show.

My reading is that it took a couple seasons for show writers to realize that Homer is the funniest character and that Bart’s character was too context-sensitive to be consistently funny.  It took quite a bit more time before merchandisers figured that out, to the extent that they ever did.

Levenshtein Distances

Peter Coates provides an extremely fast estimate of Levenshtein Distance:

If your application requires a precise LD value, this heuristic isn’t for you, but the estimates are typically within about 0.05 of the true distance, which is more than enough accuracy for such tasks as:

  • Confirming suspected near-duplication.

  • Estimating how much two document vary.

  • Filtering through large numbers of documents to look for a near-match to some substantial block of text.

The estimation process is pretty interesting.  Worth a read.

Categories

June 2019
MTWTFSS
« May  
 12
3456789
10111213141516
17181920212223
24252627282930