Mastering Tools

The folks at Sharp Sight Labs explain that future obsolescence of a tool does not mean you should not master it:

The heart of his critique is this: data science is changing very fast, and any tool that you learn will eventually become obsolete.

This is absolutely true.

Every tool has a shelf life.

Every. single. one.

Moreover, it’s possible that tools are going to become obsolete more rapidly than in the past, because the world has just entered a period of rapid technological change. We can’t be certain, but if we’re in a period of rapid technological change, it seems plausible that toolset-changes will become more frequent.

The thing I would tie it to is George Stigler’s paper on information theory.  There’s a cost of knowing—which the commenter notes—but there’s also a cost to search, given the assumption that you know where to look.  Being effective in any role, be it data scientist or anything else, involves understanding the marginal benefit of pieces of information.  This blog post gives you a concrete example of that in the realm of data science.

vtreat

John Mount introduces vtreat, an R package for data preparation:

Our group is distributing a detailed write up of the theory and operation behind our R realization of a set of sound data preparation and cleaning procedures called vtreat here: arXiv:1611.09477 [stat.AP]. This is where you can find out what vtreat does, decide if it is appropriate for your problem, or even find a specification allowing the use of the techniques in non-R environments (such as Python/Pandas/scikit-learn, Spark, and many others).

We have submitted this article for formal publication, so it is our intent you can cite this article (as it stands) in scientific work as a pre-print, and later cite it from a formally refereed source.

Or alternately, below is the tl;dr (“too long; didn’t read”) form.

Read more about vtreat on the package page or the vtreat vignette.

Linux Data Science Virtual Machine

David Smith mentions the Linux data science virtual machine on Azure:

The Linux Data Science Virtual Machine includes all of the tools a modern data scientist needs, in one easy-to-launch package. With it, you can try exploring data with Apache Drill, train deep neural networks for computer vision with MXNet, develop AI applications with the Cognitive Toolkit, or create statistical models with big data in R with Microsoft R Server 9.0.

They also offer a free trial, so check it out.

Learning Versus Remembering

Via R-Bloggers, a discussion on learning versus remembering with respect to data science:

If you’re like most aspiring data scientists, you’ll try to learn this code by using the copy-and-paste method. You’ll take this code from a blog post like this, copy it into RStudio and run it.

Most aspiring data scientists do the exact same thing with online courses. They’ll watch a few videos, open the course’s sample code, and then copy-and-paste the code.

Watching videos, reading books, and copy-and-pasting code do help you learn, at least a little. If you watch a video about ggplot2, you’ll probably learn how it works pretty quickly. And if you copy-and-paste some ggplot2 code, you’ll probably learn a little bit about how the code works.

Here’s the problem: if you learn code like this, you’ll probably forget it within a day or two.

This is a thought-provoking article that applies to all disciplines, not just data science.

Mixed Integer Optimization

David Smith discusses the ompr package in R:

Counterintuitively, numerical optimizations are easiest (though rarely actually easy) when all of the variables are continuous and can take any value. When integer variables enter the mix, optimization becomes much, much harder. This typically happens when the optimization is constrained by a limited selection of objects, for example packages in a weight-limited cargo shipment, or stocks in a portfolio constrained by sector weightings and transaction costs. For tasks like these, you often need an algorithm for a specialized type of optimization: Mixed Integer Programming.

For problems like these, Dirk Schumacher has created the ompr package for R. This package provides a convenient syntax for describing the variables and contraints in an optimization problem. For example, take the classic “knapsack” problem of maximizing the total value of objects in a container subject to its maximum weight limit.

Read the whole thing.

Deploying VMs To Azure Using Powershell

Rob Sewell shows how to use Powershell to create your own Azure VM instance of the Microsoft data science virtual machine:

First, an annoyance. To be able to deploy Data Science virtual machines in Azure programmatically  you first have to login to the portal and click some buttons.

In the Portal click new and then marketplace and then search for data science. Choose the Windows Data Science Machine and under the blue Create button you will see a link which says “Want to deploy programmatically? Get started” Clicking this will lead to the following blade.

Click through for a screenshot-laden explanation which leaves you with a working VM in Azure.

Bandit Algorithms

Tanner Thompson describes usage of a multi-armed bandit algorithm to drive conversions:

The functional idea behind a bandit algorithm is that you make an informed decision every time you assign a visitor to a test arm. Several bandit-type algorithms have been proved to be mathematically optimal; that is, they obtain the maximum future revenue given the data they have at any given point. Gittins indexing is perhaps the foremost of these algorithms. However, the trade-off of these methods is that they tend to be very computationally intensive.

This article doesn’t show any code, but it is useful for thinking about the problem.

Data Science Languages

Alessandro Piva provides preliminary metrics on language usage among self-described data scientists:

Programming is one of the five main competence areas at the base of the skill set for a Data Scientist, even if is not the most relevant in term of expertise (see What is the right mix of competences for Data Scientists?). Considering the results of the survey, that involved more than 200 Data Scientist worldwide until today, there isn’t a prevailing choice among the programming languages used during the data science’s activities. However, the choice appears to be addressed mainly to a limited set of alternatives: almost 96% of respondents affirm to use at least one of R, SQL or Python.

These results don’t surprise me much.  R has slightly more traction than Python, but the percentage of people using both is likely to increase.  SQL, meanwhile, is vital for getting data, and as we’re seeing in the Hadoop space, as data platform products get more mature, they tend to gravitate toward a SQL or SQL-like language.  Cf. Hive, Spark SQL, Phoenix, etc.

The Data Science Delusion

Anand Ramanathan has a strong critique of “data science” as it stands today:

Illustration: Consider the sentiment-tagging task again. A Q1 resource uses an off-the-shelf model for movie reviews, and applies it to a new task (say, tweets about a customer service organization). Business is so blinded by spectacular charts [14] and anecdotal correlations (“Look at that spiteful tweet from a celebrity … so that’s why the sentiment is negative!”), that even questions about predictive accuracy are rarely asked until a few months down the road when the model is obviously floundering. Then too, there is rarely anyone to challenge the assumptions, biases and confidence intervals (Does the language in the tweets match the movie reviews? Do we have enough training data? Does the importance of tweets change over time?).

Overheard“Survival analysis? Never heard of it … Wait … There is an R package for that!”

This is a really interesting article and I recommend reading it.

Price Optimization Using Decision Trees

Bernard Antwi Adabankah uses a decision tree to model price changes:

The sample included N = 262 individual orders for Interlocking Hearts Design Cake Knife/Server set with OrderItemSKU as 2401 from the period ranging from 1st March 2014 to 20th April 2016 with an ecommerce company which sells on Amazon.co.uk

The Profit response variable is measured as the product sale price on amazon.co.uk which includes amazon.co.uk commission and any applicable postage costs less the purchase price of the Hearts Design Cake Knife/Server set from the supplier.

Read the comments for a couple good critiques of the article.

Categories

March 2019
MTWTFSS
« Feb  
 123
45678910
11121314151617
18192021222324
25262728293031