Press "Enter" to skip to content

Category: Data Science

Forecasting Restaurant Inspection Failures

David Smith writes about an R model which predicts which restaurants are more likely to fail inspection:

Chicago’s Department of Public Health used the R language to build and deploy the model, and made the code available as an open source project on GitHub. The reasons given are twofold:

An open source approach helps build a foundation for other models attempting to forecast violations at food establishments. The analytic code is written in R, an open source, widely-known programming language for statisticians. There is no need for expensive software licenses to view and run this code.

Read on for more details and check out their GitHub repo.

Comments closed

Basic Non-Linear Regression In R

Renata Ghisloti Duarte de Souza gives an example of running a non-linear regression in R:

Now, suppose you were able to find a good function to model your data. With that, we are able to predict future values for our small dataset.

One important thing about the predict() function in R is that it expects a similar dataframe with the same column name and type as the one you used in your model.

Click through for several examples.

Comments closed

Mastering Tools

The folks at Sharp Sight Labs explain that future obsolescence of a tool does not mean you should not master it:

The heart of his critique is this: data science is changing very fast, and any tool that you learn will eventually become obsolete.

This is absolutely true.

Every tool has a shelf life.

Every. single. one.

Moreover, it’s possible that tools are going to become obsolete more rapidly than in the past, because the world has just entered a period of rapid technological change. We can’t be certain, but if we’re in a period of rapid technological change, it seems plausible that toolset-changes will become more frequent.

The thing I would tie it to is George Stigler’s paper on information theory.  There’s a cost of knowing—which the commenter notes—but there’s also a cost to search, given the assumption that you know where to look.  Being effective in any role, be it data scientist or anything else, involves understanding the marginal benefit of pieces of information.  This blog post gives you a concrete example of that in the realm of data science.

Comments closed

vtreat

John Mount introduces vtreat, an R package for data preparation:

Our group is distributing a detailed write up of the theory and operation behind our R realization of a set of sound data preparation and cleaning procedures called vtreat here: arXiv:1611.09477 [stat.AP]. This is where you can find out what vtreat does, decide if it is appropriate for your problem, or even find a specification allowing the use of the techniques in non-R environments (such as Python/Pandas/scikit-learn, Spark, and many others).

We have submitted this article for formal publication, so it is our intent you can cite this article (as it stands) in scientific work as a pre-print, and later cite it from a formally refereed source.

Or alternately, below is the tl;dr (“too long; didn’t read”) form.

Read more about vtreat on the package page or the vtreat vignette.

Comments closed

Linux Data Science Virtual Machine

David Smith mentions the Linux data science virtual machine on Azure:

The Linux Data Science Virtual Machine includes all of the tools a modern data scientist needs, in one easy-to-launch package. With it, you can try exploring data with Apache Drill, train deep neural networks for computer vision with MXNet, develop AI applications with the Cognitive Toolkit, or create statistical models with big data in R with Microsoft R Server 9.0.

They also offer a free trial, so check it out.

Comments closed

Learning Versus Remembering

Via R-Bloggers, a discussion on learning versus remembering with respect to data science:

If you’re like most aspiring data scientists, you’ll try to learn this code by using the copy-and-paste method. You’ll take this code from a blog post like this, copy it into RStudio and run it.

Most aspiring data scientists do the exact same thing with online courses. They’ll watch a few videos, open the course’s sample code, and then copy-and-paste the code.

Watching videos, reading books, and copy-and-pasting code do help you learn, at least a little. If you watch a video about ggplot2, you’ll probably learn how it works pretty quickly. And if you copy-and-paste some ggplot2 code, you’ll probably learn a little bit about how the code works.

Here’s the problem: if you learn code like this, you’ll probably forget it within a day or two.

This is a thought-provoking article that applies to all disciplines, not just data science.

Comments closed

Mixed Integer Optimization

David Smith discusses the ompr package in R:

Counterintuitively, numerical optimizations are easiest (though rarely actually easy) when all of the variables are continuous and can take any value. When integer variables enter the mix, optimization becomes much, much harder. This typically happens when the optimization is constrained by a limited selection of objects, for example packages in a weight-limited cargo shipment, or stocks in a portfolio constrained by sector weightings and transaction costs. For tasks like these, you often need an algorithm for a specialized type of optimization: Mixed Integer Programming.

For problems like these, Dirk Schumacher has created the ompr package for R. This package provides a convenient syntax for describing the variables and contraints in an optimization problem. For example, take the classic “knapsack” problem of maximizing the total value of objects in a container subject to its maximum weight limit.

Read the whole thing.

Comments closed

Deploying VMs To Azure Using Powershell

Rob Sewell shows how to use Powershell to create your own Azure VM instance of the Microsoft data science virtual machine:

First, an annoyance. To be able to deploy Data Science virtual machines in Azure programmatically  you first have to login to the portal and click some buttons.

In the Portal click new and then marketplace and then search for data science. Choose the Windows Data Science Machine and under the blue Create button you will see a link which says “Want to deploy programmatically? Get started” Clicking this will lead to the following blade.

Click through for a screenshot-laden explanation which leaves you with a working VM in Azure.

Comments closed

Bandit Algorithms

Tanner Thompson describes usage of a multi-armed bandit algorithm to drive conversions:

The functional idea behind a bandit algorithm is that you make an informed decision every time you assign a visitor to a test arm. Several bandit-type algorithms have been proved to be mathematically optimal; that is, they obtain the maximum future revenue given the data they have at any given point. Gittins indexing is perhaps the foremost of these algorithms. However, the trade-off of these methods is that they tend to be very computationally intensive.

This article doesn’t show any code, but it is useful for thinking about the problem.

Comments closed

Data Science Languages

Alessandro Piva provides preliminary metrics on language usage among self-described data scientists:

Programming is one of the five main competence areas at the base of the skill set for a Data Scientist, even if is not the most relevant in term of expertise (see What is the right mix of competences for Data Scientists?). Considering the results of the survey, that involved more than 200 Data Scientist worldwide until today, there isn’t a prevailing choice among the programming languages used during the data science’s activities. However, the choice appears to be addressed mainly to a limited set of alternatives: almost 96% of respondents affirm to use at least one of R, SQL or Python.

These results don’t surprise me much.  R has slightly more traction than Python, but the percentage of people using both is likely to increase.  SQL, meanwhile, is vital for getting data, and as we’re seeing in the Hadoop space, as data platform products get more mature, they tend to gravitate toward a SQL or SQL-like language.  Cf. Hive, Spark SQL, Phoenix, etc.

Comments closed