Press "Enter" to skip to content

Category: R

Analyzing Taxi Data With Microsoft R Server

Ali Zaidi builds a Spark cluster to analyze 1.1 billion taxi cab rides using Microsoft R Server:

In a similar spirit to how sparklyr allowed us to reuse our functions from the dplyr package to manipulate Spark DataFrames, the RxSpark API allows a data scientist to develop code that can be deployed in a multitude of environments. This allows the developer to shift their focus from writing code that’s specific to a certain environment, and instead focus on the complex analysis of their data science problem. We call this flexibility Write Once, Deploy Anywhere, or WODA for the acronym lovers.

For a deeper dive into the RevoScaleR package, I recommend you take a look at the online course, Analyzing Big Data with Microsoft R Server. Much of this blogpost follows along the last section of the course, on deployment to Spark.

R isn’t just for small, one-off jobs anymore.

Comments closed

Data Science Languages

Alessandro Piva provides preliminary metrics on language usage among self-described data scientists:

Programming is one of the five main competence areas at the base of the skill set for a Data Scientist, even if is not the most relevant in term of expertise (see What is the right mix of competences for Data Scientists?). Considering the results of the survey, that involved more than 200 Data Scientist worldwide until today, there isn’t a prevailing choice among the programming languages used during the data science’s activities. However, the choice appears to be addressed mainly to a limited set of alternatives: almost 96% of respondents affirm to use at least one of R, SQL or Python.

These results don’t surprise me much.  R has slightly more traction than Python, but the percentage of people using both is likely to increase.  SQL, meanwhile, is vital for getting data, and as we’re seeing in the Hadoop space, as data platform products get more mature, they tend to gravitate toward a SQL or SQL-like language.  Cf. Hive, Spark SQL, Phoenix, etc.

Comments closed

Interactive Decision Trees

Longhow Lam describes the interactive decision tree in Microsoft R Server 9.0:

Despite all the more modern machine learning algorithms, a good old single decision tree can still be useful. Moreover, in a business analytics context they can still keep up in predictive power. In the last few months I have created different predictive response and churn models. I usually just try different learners, logistic regression models, single trees, boosted trees, several neural nets, random forests. In my experience a single decision tree is usually ‘not bad’, often only slightly less predictive power than the more fancy algorithms.

An important thing in analytics is that you can ‘sell‘ your predictive model to the business. A single decision tree is a good way to to do just that, and with an interactive decision tree (created by Microsoft R) this becomes even more easy.

I’d like the labels in Longhow’s tree to be a little clearer, but I do like this from the perspective of giving end users something to experience.

Comments closed

Microsoft R Server 9.0

David Smith reports that Microsoft R Server 9.0 is now available:

Microsoft R Server 9.0, Microsoft’s R distribution with added big-data, in-database, and integration capabilities, was released today and is now available for download to MSDN subscribers. This latest release is built on Microsoft R Open 3.3.2, and adds new machine-learning capabilities, new ways to integrate R into applications, and additional big-data support for Spark 2.0.

There’s also a new version of Microsoft R Client and Microsoft R Open.

Comments closed

R + Power Query

Ryan Wade makes his argument that R can be more powerful than M inside Power Query:

I want to leave you with two more things. If you look at the trade balance data set you will notice that it is not in a good format for data analysis. Here is a link to the file if you want to take a closer look. When you are doing data analysis you want your data to be in a “tidy” format. A “tidy” format means that each column represents a variable and each row represents an observation. To make this data set “tidy” you need to reformat the data into the following format: Country, Year, Trade Balance, Exports, and Imports.

This was an interesting example.

Comments closed

Multivariate Analysis In R

Mala Mahadevan looks at using R to describe data sets with two explanatory variables:

From the plot we can see that type 3 trees have the smallest circumference while type 4 have the largest, with type 2 close to type 4. We can also see that type 1 trees have the thinnest dispersion of circumference while type 4 has the highest, closely followed by type 2.  We can also see that there are no significant outliers in this data.

Understanding whether variables are categorical or continuous is vital to understanding what you can and should do with them.

Comments closed

Custom R Visuals In Power BI

Ginger Grant notes that there are R-powered custom visuals for Power BI:

Interacting with R visuals works differently than with other report visualizations as you cannot click on elements within the visualization and filter other items on the page. Other visuals on the page will filter the data contained within the R visual. For example, let’s say my report contains a total field, a slicer which contains years and a correlation plot which contains products. If the slicker is changed to select a year, total field and the data within the R visual will change to reflect that. If on the other hand, I choose to click on the R visual to select one of the product categories, the total field will not change and the R visual will not change. The R visual’s appearance will not change in any way.

Read on for more.

Comments closed

Data Wrangling: R Versus M

Ryan Wade argues that R is a better language choice for working with data in Power BI than M:

Now let’s do something that I think is pretty slick. Let’s create a data set that combines the home games of the Pacers (IND) and the home games of the Hawks (ATL). Given the naming convention used by the files we will have to identify the files in our working directory that starts with an eight numeric digits > then a period > then a 3 character team abbreviation for the away team > then either “ATL” or “IND” > then finally “.csv”. We can create a regular expression to find the files that matches that pattern. I did so in the code below:

I’m interested in catching the rest of the series.  This is a controversial statement that I’m not entirely sold on yet, but Ryan does set the stage for his full argument.

Comments closed

Tabulizer

Troy Walters uses the Tabulizer package to extract tables from a PDF and turn them into an R matrices or data frames:

Next we will use the extract_tables() function from tabulizer. First, I specify the url of the pdf file from which I want to extract a table. This pdf link includes the most recent data, covering the period from July 1, 2016 to November 25, 2016. I am using the default parameters for extract_tables. These are guess and method. I’ll leave guess set to TRUE, which tells tabulizer that we want it to figure out the locations of the tables on its own. We could set this to FALSE if we want to have more granular control, but for this application we don’t need to. We leave the method argument set to “matrix”, which will return a list of matrices (one for each pdf page). This could also be set to return data frames instead.

This is nice.  I have to imagine it only works for text-based PDFs and not ones which are generated from a series of images.

Comments closed

Solving The German Tank Problem

Frank Portman shows how to figure out how many taxicabs—or tanks—there are:

For the uninitiated, the Taxicab / Germany Tank problem is as follows:

Viewing a city from the train, you see a taxi numbered x. Assuming taxicabs are consecutively numbered, how many taxicabs are in the city?

This was also applied to counting German tanks in World War II to know when/if to attack. Statstical methods ended up being accurate within a few tanks (on a scale of 200-300) while “intelligence” (unintelligence) operations overestimated numbers about 6-7x. Read the full details on Wikipedia here (and donate while you’re over there).

Click through for the solution and how to implement it in R.

Comments closed