Counterintuitively, numerical optimizations are easiest (though rarely actually easy) when all of the variables are continuous and can take any value. When integer variables enter the mix, optimization becomes much, much harder. This typically happens when the optimization is constrained by a limited selection of objects, for example packages in a weight-limited cargo shipment, or stocks in a portfolio constrained by sector weightings and transaction costs. For tasks like these, you often need an algorithm for a specialized type of optimization: Mixed Integer Programming.
For problems like these, Dirk Schumacher has created the ompr package for R. This package provides a convenient syntax for describing the variables and contraints in an optimization problem. For example, take the classic “knapsack” problem of maximizing the total value of objects in a container subject to its maximum weight limit.
Read the whole thing.
First, an annoyance. To be able to deploy Data Science virtual machines in Azure programmatically you first have to login to the portal and click some buttons.
In the Portal click new and then marketplace and then search for data science. Choose the Windows Data Science Machine and under the blue Create button you will see a link which says “Want to deploy programmatically? Get started” Clicking this will lead to the following blade.
Click through for a screenshot-laden explanation which leaves you with a working VM in Azure.
The functional idea behind a bandit algorithm is that you make an informed decision every time you assign a visitor to a test arm. Several bandit-type algorithms have been proved to be mathematically optimal; that is, they obtain the maximum future revenue given the data they have at any given point. Gittins indexing is perhaps the foremost of these algorithms. However, the trade-off of these methods is that they tend to be very computationally intensive.
This article doesn’t show any code, but it is useful for thinking about the problem.
Programming is one of the five main competence areas at the base of the skill set for a Data Scientist, even if is not the most relevant in term of expertise (see What is the right mix of competences for Data Scientists?). Considering the results of the survey, that involved more than 200 Data Scientist worldwide until today, there isn’t a prevailing choice among the programming languages used during the data science’s activities. However, the choice appears to be addressed mainly to a limited set of alternatives: almost 96% of respondents affirm to use at least one of R, SQL or Python.
These results don’t surprise me much. R has slightly more traction than Python, but the percentage of people using both is likely to increase. SQL, meanwhile, is vital for getting data, and as we’re seeing in the Hadoop space, as data platform products get more mature, they tend to gravitate toward a SQL or SQL-like language. Cf. Hive, Spark SQL, Phoenix, etc.
Illustration: Consider the sentiment-tagging task again. A Q1 resource uses an off-the-shelf model for movie reviews, and applies it to a new task (say, tweets about a customer service organization). Business is so blinded by spectacular charts  and anecdotal correlations (“Look at that spiteful tweet from a celebrity … so that’s why the sentiment is negative!”), that even questions about predictive accuracy are rarely asked until a few months down the road when the model is obviously floundering. Then too, there is rarely anyone to challenge the assumptions, biases and confidence intervals (Does the language in the tweets match the movie reviews? Do we have enough training data? Does the importance of tweets change over time?).
Overheard: “Survival analysis? Never heard of it … Wait … There is an R package for that!”
This is a really interesting article and I recommend reading it.
The sample included N = 262 individual orders for Interlocking Hearts Design Cake Knife/Server set with OrderItemSKU as 2401 from the period ranging from 1st March 2014 to 20th April 2016 with an ecommerce company which sells on Amazon.co.uk
The Profit response variable is measured as the product sale price on amazon.co.uk which includes amazon.co.uk commission and any applicable postage costs less the purchase price of the Hearts Design Cake Knife/Server set from the supplier.
Read the comments for a couple good critiques of the article.
With the emergence of the powerful forecasting methods based on Machine Learning, future predictions have become more accurate. In general, forecasting techniques can be grouped into two categories: qualitative and quantitative. Qualitative forecasts are applied when there is no data available and prediction is based only on expert judgement. Quantitative forecasts are based on time series modeling. This kind of models uses historical data and is especially efficient in forecasting some events that occur over periods of time: for example prices, sales figures, volume of production etc.
The existing models for time series prediction include the ARIMA models that are mainly used to model time series data without directly handling seasonality; VAR models, Holt-Winters seasonal methods, TAR modelsand other. Unfortunately, these algorithms may fail to deliver the required level of the prediction accuracy, as they can involve raw data that might be incomplete, inconsistent or contain some errors. As quality decisions are based only on quality data, it is crucial to perform preprocessing to prepare entry information for further processing.
Treating time series data as a set of waveform functions can generate some very interesting results.
The gaining popularity of self-service analytical tools such as Tableau increases the necessity of having curated data in your database. These tools aim to allow the end users to intuitively query data “at the speed of thought” from the data warehouse and visualize the results quickly. That type of capability allows users to go through several different iterations of the data to really explore the data and generate unique insights. These tools do not work well when the underlying database tables have not been curated properly.
This is a difficult and lengthy process, but it’s vital; data minus context is a lot less relevant than you’d hope.
While later seasons would focus on Homer, Bart was the lead character in most of the first three seasons
I’ve heard this argument before, that the show was originally about Bart before switching its focus to Homer, but the actual scripts only seem to partially support it.
Bart accounted for a significantly larger share of the show’s dialogue in season 1 than in any future season, but Homer’s share has always been higher than Bart’s. Dialogue share might not tell the whole story about a character’s prominence, but the fact is that Homer has always been the most talkative character on the show.
My reading is that it took a couple seasons for show writers to realize that Homer is the funniest character and that Bart’s character was too context-sensitive to be consistently funny. It took quite a bit more time before merchandisers figured that out, to the extent that they ever did.
If your application requires a precise LD value, this heuristic isn’t for you, but the estimates are typically within about 0.05 of the true distance, which is more than enough accuracy for such tasks as:
Confirming suspected near-duplication.
Estimating how much two document vary.
Filtering through large numbers of documents to look for a near-match to some substantial block of text.
The estimation process is pretty interesting. Worth a read.