Category: R

ggplot2 Basics

Published 2017-10-30 by Kevin Feasel

Bharani Akella has an introduction to ggplot2:

Plot10: Scatter-plot

ggplot(data = mtcars,aes(x=mpg,y=hp,col=factor(cyl)))+geom_point()

mpg(miles/galloon) is assigned to the x-axis
hp(Horsepower) is assigned to the y-axis
factor(cyl) {Number of cylinders} determines the color
The geometry used is scatter plot. We can create a scatter plot by using the geom_point() function.

He has a number of similar examples showing several variations on bar, line, and scatterplot charts.

Comments closed

Data Set Robustness

Published 2017-10-26 by Kevin Feasel

Tomaz Kastrun shows how robust the iris data set is:

Conclusion, IRIS dataset is – due to the nature of the measurments and observations – robust and rigid; one can get very good accuracy results on a small training set. Everything beyond 30% for training the model, is for this particular case, just additional overload.

The general concept here is, how small can you arbitrarily slice the data and still come up with the same result as the overall data set? Or, phrased differently, how much data do you need to collect before predictions stabilize? Read on to see how Tomaz solves the problem.

Comments closed

The Correct Way To Load Libraries In R

Published 2017-10-24 by Kevin Feasel

Gerald Belton opens a can of worms:

When I was an R newbie, I was taught to load packages by using the command library(package). In my Linear Models class, the instructor likes to use require(package). This made me wonder, are the commands interchangeable? What’s the difference, and which command should I use?

Interchangeable commands . . .

The way most users will use these commands, most of the time, they are actually interchangeable. That is, if you are loading a library that has already been installed, and you are using the command outside of a function definition, then it makes no difference if you use “require” or “library.” They do the same thing.

… Well, almost interchangeable

Read on to understand the differences between the two. I end up doing something very similar to his code snippet for exactly the reason he describes. H/T R-Bloggers

Comments closed

More R Services Internals Spelunking

Published 2017-10-23 by Kevin Feasel

Niels Berglund continues his series on R Services internals:

What happens is that straight after the AuthenticateConnection you will hit the WriteAsync breakpoint twice, the same way as we see in Code Snippet 4. The first hit at WriteAsync sort of makes sense, as it ties up with the call-stack we see in Code Snippet 5. But what about the second WriteAsync (the one that causes the second package to be sent), where does that come from? To try to figure that out, we start with the call-stack for that particular WriteAsync.

Execute the code in Code Snippet 1 again, and continue to the second WriteAsync. When you hit the breakpoint do a kc again. The call-stack should now look somewhat like this (this time it is the full call-stack):

It’s interesting the kind of stuff you can find with Wireshark and a debugger.

Comments closed

Using Service Broker To Queue Up External Script Calls

Published 2017-10-20 by Kevin Feasel

Arvind Shyamsundar shows how to use Service Broker to run external R or Python scripts based on new data coming into a transactional system:

Here, we will show you how you can use the asynchronous execution mechanism offered by SQL Server Service Broker to ‘queue’ up data inside SQL Server which can then be asynchronously passed to a Python script, and the results of that Python script then stored back into SQL Server.

This is effectively similar to the external message queue pattern but has some key advantages:

The solution is integrated within the data store, leading to fewer moving parts and lower complexity

Because the solution is in-database, we don’t need to make copies of the data. We just need to know what data has to be processed (effectively a ‘pointer to the data’ is what we need).

Service Broker also offers options to govern the number of readers of the queue, thereby ensuring predictable throughput without affecting core database operations.

There are several interconnected parts here, and Arvind walks through the entire scenario.

Comments closed

Reasons For Using Docker With R

Published 2017-10-17 by Kevin Feasel

Jeroen Ooms gives us a few reasons why we might want to containerize our R-based products:

The flagship of the OpenCPU system is the OpenCPU server: a mature and powerful Linux stack for embedding R in systems and applications. Because OpenCPU is completely open source we can build and ship on DockerHub. A ready-to-go linux server with both OpenCPU and RStudio can be started using the following (use port 8004 or 80):
docker run -t -p 8004:8004 opencpu/rstudio
Now simply open http://localhost:8004/ocpu/ and http://localhost:8004/rstudio/ in your browser! Login via rstudio with user: opencpu (passwd: opencpu) to build or install apps. See the readme for more info.

This is in the context of one particular product, but the reasons fit other scenarios too. H/T R-Bloggers

Comments closed

Linear Discriminant Analysis

Published 2017-10-13 by Kevin Feasel

Jake Hoare explains Linear Discriminant Analysis:

Linear Discriminant Analysis takes a data set of cases (also known as observations) as input. For each case, you need to have a categorical variable to define the class and several predictor variables (which are numeric). We often visualize this input data as a matrix, such as shown below, with each case being a row and each variable a column. In this example, the categorical variable is called “class” and the predictive variables (which are numeric) are the other columns.

Following this is a clear example of how to use LDA. This post is also the second time this week somebody has suggested The Elements of Statistical Learning, so I probably should make time to look at the book.

Comments closed

Azure Data Lake Store File Management With httr

Published 2017-10-12 by Kevin Feasel

Leila Etaati shows how to generate RESTful statements in R using httr:

In this post, I am going to share my experiment in how to do file management in ADLS using R studio,

to do this you need to have below items

1. An Azure subscription

2. Create an Azure Data Lake Store Account

3. Create an Azure Active Directory Application (for the aim of service-to-service authentication).

4. An Authorization Token from Azure Active Directory Application

It’s pretty easy to do, as Leila shows.

Comments closed

Using R In Azure Data Lake Analytics

Published 2017-10-12 by Kevin Feasel

David Smith links to a tutorial which shows how to use R against Azure Data Lake Analytics:

The Azure Data Lake store is an Apache Hadoop file system compatible with HDFS, hosted and managed in the Azure Cloud. You can store and access the data within directly via the API, by connecting the filesystem directly to Azure HDInsight services, or via HDFS-compatible open-source applications. And for data science applications, you can also access the data directly from R, as this tutorial explains.

To interface with Azure Data Lake, you’ll use U-SQL, a SQL-like language extensible using C#. The R Extensions for U-SQL allow you to reference an R script from a U-SQL statement, and pass data from Data Lake into the R Script. There’s a 500Mb limit for the data passed to R, but the basic idea is that you perform the main data munging tasks in U-SQL, and then pass the prepared data to R for analysis. With this data you can use any function from base R or any R package. (Several common R packages are provided in the environment, or you can upload and install other packages directly, or use the checkpoint package to install everything you need.) The R engine used is R 3.2.2.

Click through for the details.

Comments closed

Linear Regression With Deducer

Published 2017-10-12 by Kevin Feasel

Sunil Kappal demonstrates how to use Deducer, a GUI for R, to perform a simple linear regression:

Selecting the variables in the Deducer GUI:

Outcome variable: Y, or the dependent variable, should be put on this list
As numeric: Independent variables that should be treated as covariates should be put in this section. Deducer automatically converts a factor into a numeric variable, so make sure that the order of the factor level is correct
As factor: Categorically independent variables (language, ethnicity, etc.).
Weights: This option allows the users to apply sampling weights to the regression model.
Subset: Helps to define if the analysis needs to be done within a subset of the whole dataset.

Deducer is open source and looks like a pretty decent way of seeing what’s available to you in R.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31