Somehow I have become the R DBA at my job which I don’t mind, I plan on taking Microsoft’s Professional Program on Data Science to be familiar with it. But recently I’ve had to upload files to our R servers which the first time wasn’t too bad. Copy these files to six different servers but come the second time around it became apparent that the Predictive Analytics Manger was going to be asking me to do this more frequently than I wanted to to it manually. So I wrote a quick PowerShell function to take care of this added to our module we use in house. It unzips the file provided to the correct location. It does assume you have administrative rights to your server i.e. you can use the admin shares (c$) for example on the server. You will need to get the function Get-CMSHost from my Running SQL Scripts Against Multiple Servers Using PowerShell post to run the code below.
Click through for the script. This is particularly useful for deploying in-house packages and you don’t want to set up a miniCRAN.
In the formula above the interaction effect is, of course, dosegendertype. The ANOVA results can be seen below (we have only kept the line presenting the third-order interaction effect).Df Sum Sq Mean Sq F value Pr(>F) dose:gender:type 2 187 93.4 22.367 3.81e-10
The interaction effect is statistically significant: F(2)=22.367, p<0.01. In other words, we do have a third-order interaction effect. In this situation, it is not advisable to report and interpret the second-order interaction effects (they could be misleading). Therefore, we are going to compute the simple second-order interaction effects.
This is definitely not a trivial article, but there are useful techniques in it.
It has been a long held dream of mine to create a spinning globe using nothing but R (I wish I was joking, but I’m not). Thanks to the brilliant mapmate package created by Matt Leonawicz and shed loads of computing power, today that dream became a reality. The globe below took 19 hours and 30 processors to produce from a relatively low resolution NASA black marble data, and so I accept R is not the best software to be using for this – but it’s amazing that you can do this in R at all!
Now all that is missing is a giant TV and an evil lair.
Basically speaking, the “http user” will be used to authenticate through the HDInsight gateway, which is used to protect the HDInsight clusters you created. This user is used to access the Ambari UI, YARN UI, as well as many other UI components.
The “ssh user” will be used to access the cluster through secure shell. This user is actually a user in the Linux system in all the head nodes, worker nodes, edge nodes, etc., so you can use secure shell to access the remote clusters.
For Microsoft R Server on HDInsight type cluster, it’s a bit more complex, because we put R Studio Server Community version in HDInsight, which only accepts Linux user name and password as login mechanisms (it does not support passing tokens), so if you have created a new cluster and want to use R Studio, you need to first login using the http user’s credential and login through the HDInsight Gateway, and then use the ssh user’s credential to login to RStudio.
It’s a good read and also includes a sample Spark-R job.
I have been a practicing data scientist with an emphasis on predictive modeling for about 16 years. I know enough R to be dangerous but when I want to build a model I reach for my SAS Enterprise Miner (could just as easily be SPSS, Rapid Miner or one of the other complete platforms).
The key issue is that I can clean, prep, transform, engineer features, select features, and run 10 or more model types simultaneously in less than 60 minutes (sometimes a lot less) and get back a nice display of the most accurate and robust model along with exportable code in my selection of languages.
The reason I can do that is because these advanced platforms now all have drag-and-drop visual workspaces into which I deploy and rapidly adjust each major element of the modeling process without ever touching a line of code.
I have almost exactly the opposite thought on the matter: that drag-and-drop development is intolerably slow; I can drag and drop and connect and click and click and click for a while, or I can write a few lines of code. Nevertheless, I think Bill’s post is well worth reading.
Simon Jackson has a couple posts on how to use ggplot2 to make graphs prettier. First, histograms:
Time to jazz it up with colour! The method I’ll present was motivated by my answer to this StackOverflow question.
We can add colour by exploiting the way that ggplot2 stacks colour for different groups. Specifically, we fill the bars with the same variable (
cutinto multiple categories:
Shape and size
There are many ways to tweak the
sizeof the points. Here’s the combination I settled on for this post:
There are some nice tricks here around transparency, color scheme, and gradients, making it a great series. As a quick note, this color scheme in the histogram headliner photo does not work at all for people with red-green color-blindness. Using a URL color filter like Toptal’s is quite helpful in discovering these sorts of issues.
The new Visual Studio 2017 has built-in support for programming in R and Python. For older versions of Visual Studio, support for these languages has been available via the RTVS and PTVS add-ins, but the new Data Science Workloads in Visual Studio 2017 make them available without a separate add-in. Just choose the “Data Science and analytical applications” option during installation to install everything you need, including Microsoft R Client and the Anaconda Python distribution.
I’m personally going to wait a little bit before jumping onto Visual Studio 2017, but I’m glad that RTVS is now available.
Looking at this data, the first thing I thought was untidy. There has to be a better way. When I think of tidy data, I think of the tidyr package, which is used to help make data tidy, easier to work with. Specifically, I thought of the
spread()function, where I could break things up. Once data was spread into appropriate columns, I figure I can operate on the data a bit better.
Sarah has also made the data set available in case you’re interested in following along.
Call Detail Record (CDR) is the information captured by the telecom companies during Call, SMS, and Internet activity of a customer. This information provides greater insights about the customer’s needs when used with customer demographics. Most of the telecom companies use CDR information for fraud detection by clustering the user profiles, reducing customer churn by usage activity, and targeting the profitable customers by using RFM analysis.
In this blog, we will discuss about clustering of the customer activities for 24 hours by using unsupervised K-means clustering algorithm. It is used to understand segment of customers with respect to their usage by hours.
For example, customer segment with high activity may generate more revenue. Customer segment with high activity in the night hours might be fraud ones.
This article won’t really explain k-means clustering in any detail, but it does give you an example to apply the technique using R.
Looking at package names, one strategy that I commonly observe is to use small words, a verb or noun, and add the letter R to it. A good example is
dstands for dataframe, ply is just a tool, and R is, well, you know. In a conventional sense, the name of this popular tool is informative and easy to remember. As always, the extremes are never good. A couple of bad examples of package naming are
BBand so on. Googling the package name is definitely not helpful. On the other end, package
samplesizelogisticcasecontrolprovides a lot of information but it is plain unattractive!
Another strategy that I also find interesting is developers using names that, on first sight, are completely unrelated to the purpose of the package. But, there is a not so obvious link. One example is package
sandwich. At first sight, I challenge anyone to figure out what it does. This is an econometric package that computes robust standard errors in a regression model. These robust estimates are also called sandwich estimators because the formula looks like a sandwich. But, you only know that if you studied a bit of econometric theory. This strategy works because it is easier to remember things that surprise us. Another great example is package
janitor. I’m sure you already suspect that it has something do to with data cleaning. And you are right! The message of the name is effortless and it works! The author even got the privilege of using letter R in the name.
Marcelo uses word and character analysis to come up with his conclusions, making this a good way of seeing how to graph and slice data. h/t R-bloggers