Press "Enter" to skip to content

Category: R

When Binomials Converge

Mala Mahadevan shows an example of the central limit theorem in action, as a large enough sample from a binomial distribution approximates the normal:

An easier way to do it is to use the normal distribution, or central limit theorem. My post on the theorem illustrates that a sample will follow normal distribution if the sample size is large enough. We will use that as well as the rules around determining probabilities in a normal distribution, to arrive at the probability in this case.
Problem: I have a group of 100 friends who are smokers.  The probability of a random smoker having lung disease is 0.3. What are chances that a maximum of 35 people wind up with lung disease?

Click through for the example.

Comments closed

Logistic Regression With R

Raghavan Madabusi runs through a sample logistic regression:

Input Variables: These variables are called as predictors or independent variables.

  • Customer Demographics (Gender and Senior citizenship)
  • Billing Information (Monthly and Annual charges, Payment method)
  • Product Services (Multiple line, Online security, Streaming TV, Streaming Movies, and so on)
  • Customer relationship variables (Tenure and Contract period)

Output Variables: These variables are called as response or dependent variables. Since the output variable (Churn value) takes the binary form as “0” or “1”, it will be categorized under classification problem in the supervised machine learning.

One of the interesting things in this post was the use of missmap, which is part of Amelia.

Comments closed

Tidyverse Updates

Hadley Wickham has two announcements.  First, for a slew of tidyverse packages:

Over the couple of months there have been a bunch of smaller releases to packages in the tidyverse. This includes:

  • forcats 0.2.0, for working with factors.
  • readr 1.1.0, for reading flat-files from disk.
  • stringr 1.2.0, for manipulating strings.
  • tibble 1.3.0, a modern re-imagining of the data frame.

This blog post summarises the most important new features, and points to the full release notes where you can learn more.

Second, a new version of dplyr is coming:

dplyr 0.6.0 is a major release including over 100 bug fixes and improvements. There are three big changes that I want to touch on here:

  • Databases
  • Improved encoding support (particularly for CJK on windows)
  • Tidyeval, a new framework for programming with dplyr

You can see a complete list of changes in the draft release notes.

You can already get a tech preview of the new dplyr if you’re interested in trying it out.

Comments closed

The Basics Of SparkR

Yanbo Liang has an introductory article on what SparkR is and why you might want to use it:

However, data analysis using R is limited by the amount of memory available on a single machine and further as R is single threaded it is often impractical to use R on large datasets. To address R’s scalability issue, the Spark community developed SparkR package which is based on a distributed data frame that enables structured data processing with a syntax familiar to R users. Spark provides distributed processing engine, data source, off-memory data structures. R provides a dynamic environment, interactivity, packages, visualization. SparkR combines the advantages of both Spark and R.

In the following section, we will illustrate how to integrate SparkR with R to solve some typical data science problems from a traditional R users’ perspective.

This is a fairly introductory article, but gives an idea of what SparkR can accomplish.

Comments closed

Basics Of R Plotting

Aman Tsegai shows some basic ways to customize R’s plot function:

We’re going to be using the cars dataset that is built in R. To follow along with real code, here’s an interactive R Notebook. Feel free to copy it and play around with the code as you read along.

So if we were to simply plot the dataset using just the data as the only parameter, it’d look like this:

plot(dataset)

The plot function is great for cases where you don’t much care how the visual looks, and the simplicity is great for throwaway visuals.

Comments closed

R Plots In Power BI

Leila Etaati has a three-part series on displaying R visuals in Power BI.  Part 1 shows how to create a scatter plot:

so in the above picture we can see that we have 3 different fields that has been shown in the chart :highway and city speed in y and x axis. while the car’s cylinder varibale has been shown as different cycle size. However may be you need a bigger cycle to differentiate cylinder with 8 to 4 so we able to do that with add another layer by adding a function name

Part 2 shows how to use facet_grid to show multiple plots in one visual:

now I want to add other layer to this chart. by adding year and car drive option to the chart. To do that first choose year and drv  from data field in power BI. As I have mentioned before, now the dataset variable will  hold data about speed in city, speed in highway, number of cylinder, years of cars and type of drive.

I am going to use another function in the ggplot packages name “facet_grid” that helps me to show the different facet in my scatter chart. in this function, year and drv (driver) will be shown against each other.

Part 3 shows how to place charts on a map in R:

Now I have to merg the data to get the location information from “sPDF” into “ddf”. To do that I am going to use” merge” function. As you can see in below code, first argument is our first dataset “ddf” and the second one is the data on Lat and Lon of location (sPDF). the third and forth columns show the main variables for joining these two dataset as “ddf” (x) is “country” and in the second one “sPDF”  is “Admin”. the result will be stored in “df” dataset

Aside from my strong dislike of bar/pie charts on maps, this is good to know, particularly if there is not a built-in or customer Power BI visual to replicate something you can do in R.

Comments closed

Logging R Scripts

Tomaz Kastrun shows the places where you might be able to track R scripts running on your system:

Extensibility Log will store information about the session but it will not store the R or R environment information or data, just session information and data. Navigate to:

C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\LOG\ExtensibilityLog

to check the content and to see, if there is anything useful for your needs.

It’s not a great answer today.

Comments closed

Microsoft R Open 3.3.3

David Smith reports that Microsoft R Open 3.3.3 is now available:

Microsoft R Open (MRO), Microsoft’s enhanced distribution of open source R, has been upgraded to version 3.3.3, and is now available for download for Windows, Mac, and Linux. This update upgrades the R language engine to R 3.3.3, upgrades the installer, and updates the bundled packages.

R 3.3.3 makes just a few minor fixes compared to R 3.3.2 (see the full list of changes here), so you shouldn’t encounter any compatibility issues when upgrading from MRO 3.3.2. For CRAN packages, MRO 3.3.3 points to CRAN snapshot taken on March 15, 2017 but as always, you can use the built-in checkpoint package to access packages from an earlier date (for compatibility) or a later date (to access new and updated packages).

Click through for more details.  As a side note, CRAN R 3.4 is scheduled for release this month, so given their recent cadence, I’d guess MRO 3.4 to be out late this year.

Comments closed

Using OLS To Fit Rational Functions

Srini Kumar and Bob Horton show how to use the lm function to fit functions using the Pade Approximation:

Now we have a form that lm can work with. We just need to specify a set of inputs that are powers of x (as in a traditional polynomial fit), and a set of inputs that are y times powers of x. This may seem like a strange thing to do, because we are making a model where we would need to know the value of y in order to predict y. But the trick here is that we will not try to use the fitted model to predict anything; we will just take the coefficients out and rearrange them in a function. The fit_pade function below takes a dataframe with x and y values, fits an lm model, and returns a function of x that uses the coefficents from the model to predict y:

The lm function does more than just fit straight lines.

Comments closed

New RTVS Instructions

Ginger Grant has updated her instructions for installing R Tools for Visual Studio and getting R Services to work on SQL Server:

In addition to having an SQL Server 2016 instance with R Server installed, the following components are needed on a client machine

The Comprehensive R Archive Network

RStudio (optional)

Visual Studio 2015 R Tools

This list is a change from the previous list I have provided as RTVS contains an installation of R Client, there is no need to download that as well. You do not need to download Microsoft R Open if you are using R Server either.  Once RTVS is installed, there is a menu option on the R Tools window. Selecting Install R Client from the menu will handle the information. Unfortunately, there is no change to the menu option once R Client is installed, it always looks like you should install it.  To find out if R Client has been installed, look in the Workspaces window.

In other words, fewer dependencies and an easier installation process.  Read the whole thing to avoid RevoScaleR errors in your code post-upgrade.

Comments closed