I have been a practicing data scientist with an emphasis on predictive modeling for about 16 years. I know enough R to be dangerous but when I want to build a model I reach for my SAS Enterprise Miner (could just as easily be SPSS, Rapid Miner or one of the other complete platforms).
The key issue is that I can clean, prep, transform, engineer features, select features, and run 10 or more model types simultaneously in less than 60 minutes (sometimes a lot less) and get back a nice display of the most accurate and robust model along with exportable code in my selection of languages.
The reason I can do that is because these advanced platforms now all have drag-and-drop visual workspaces into which I deploy and rapidly adjust each major element of the modeling process without ever touching a line of code.
I have almost exactly the opposite thought on the matter: that drag-and-drop development is intolerably slow; I can drag and drop and connect and click and click and click for a while, or I can write a few lines of code. Nevertheless, I think Bill’s post is well worth reading.
Simon Jackson has a couple posts on how to use ggplot2 to make graphs prettier. First, histograms:
Time to jazz it up with colour! The method I’ll present was motivated by my answer to this StackOverflow question.
We can add colour by exploiting the way that ggplot2 stacks colour for different groups. Specifically, we fill the bars with the same variable (
cutinto multiple categories:
Shape and size
There are many ways to tweak the
sizeof the points. Here’s the combination I settled on for this post:
There are some nice tricks here around transparency, color scheme, and gradients, making it a great series. As a quick note, this color scheme in the histogram headliner photo does not work at all for people with red-green color-blindness. Using a URL color filter like Toptal’s is quite helpful in discovering these sorts of issues.
The new Visual Studio 2017 has built-in support for programming in R and Python. For older versions of Visual Studio, support for these languages has been available via the RTVS and PTVS add-ins, but the new Data Science Workloads in Visual Studio 2017 make them available without a separate add-in. Just choose the “Data Science and analytical applications” option during installation to install everything you need, including Microsoft R Client and the Anaconda Python distribution.
I’m personally going to wait a little bit before jumping onto Visual Studio 2017, but I’m glad that RTVS is now available.
Looking at this data, the first thing I thought was untidy. There has to be a better way. When I think of tidy data, I think of the tidyr package, which is used to help make data tidy, easier to work with. Specifically, I thought of the
spread()function, where I could break things up. Once data was spread into appropriate columns, I figure I can operate on the data a bit better.
Sarah has also made the data set available in case you’re interested in following along.
Call Detail Record (CDR) is the information captured by the telecom companies during Call, SMS, and Internet activity of a customer. This information provides greater insights about the customer’s needs when used with customer demographics. Most of the telecom companies use CDR information for fraud detection by clustering the user profiles, reducing customer churn by usage activity, and targeting the profitable customers by using RFM analysis.
In this blog, we will discuss about clustering of the customer activities for 24 hours by using unsupervised K-means clustering algorithm. It is used to understand segment of customers with respect to their usage by hours.
For example, customer segment with high activity may generate more revenue. Customer segment with high activity in the night hours might be fraud ones.
This article won’t really explain k-means clustering in any detail, but it does give you an example to apply the technique using R.
Looking at package names, one strategy that I commonly observe is to use small words, a verb or noun, and add the letter R to it. A good example is
dstands for dataframe, ply is just a tool, and R is, well, you know. In a conventional sense, the name of this popular tool is informative and easy to remember. As always, the extremes are never good. A couple of bad examples of package naming are
BBand so on. Googling the package name is definitely not helpful. On the other end, package
samplesizelogisticcasecontrolprovides a lot of information but it is plain unattractive!
Another strategy that I also find interesting is developers using names that, on first sight, are completely unrelated to the purpose of the package. But, there is a not so obvious link. One example is package
sandwich. At first sight, I challenge anyone to figure out what it does. This is an econometric package that computes robust standard errors in a regression model. These robust estimates are also called sandwich estimators because the formula looks like a sandwich. But, you only know that if you studied a bit of econometric theory. This strategy works because it is easier to remember things that surprise us. Another great example is package
janitor. I’m sure you already suspect that it has something do to with data cleaning. And you are right! The message of the name is effortless and it works! The author even got the privilege of using letter R in the name.
Marcelo uses word and character analysis to come up with his conclusions, making this a good way of seeing how to graph and slice data. h/t R-bloggers
You may already know the trick about making the date dynamic to whatever date the report gets rendered on by using the inline R execution mode of rmarkdown to insert a value.
--- title: "My report" date: "`r Sys.Date()`" output: pdf_document ---
What you may not already know is that YAML fields get evaluated sequentially so you can use a value created further up in the
paramssection, to use it later in the block.
Click through to see how it’s done.
If you’re getting the following error when you’ve installed R 3.4.0 on Windows, you’re not alone.
Error in if (file.exists(dest) && file.mtime(dest) > file.mtime(lib) && :
missing value where TRUE/FALSE needed
Read on for the solution.
Then we need to add github repository to our project. I use the git command line for this:
git remote add origin email@example.com:stephlocke/datasauRus.git git push --set-upstream origin master
With just these things, I have a package that contains the unit test framework, documentation stubs, continuous integration and test coverage, and source control.
That is all you need to do to get things going!
This is great timing for me, as I’m starting to look at packaging internal code. Also, it’s great timing because it includes dinosaurs.
In shiny, you can use the
fileInputwith the parameter
multiple = TRUEto enable you to upload multiple files at once. But how do you process those multiple files in shiny and consolidate into a single dataset?
The bit we need from shiny is the
Read the whole thing.