Press "Enter" to skip to content

Category: R

Analyzing The StackLite Dataset

Marco Pasin looks at the StackLite data set:

According to Stack Overflow documentation, these are the categories of questions that may be closed by the community users:

  • duplicated
  • off topic
  • unclear
  • too broad
  • primarily opinion-based
Not everyone in the Stack Overflow community is able to close a question. In fact users need to have certain reputation expressed in points (more details here).

To calculate the overall website closure rate is easy. Just use the original “questions_2016” dataset and count how many questions have the field “Closed Date” populated. Over 10% of questions made in 2016 have been closed so far.

If you’re interested in learning more about data analysis, walk through the exercise as well and play around with the data set too.  Hat tip, R-Bloggers.

Comments closed

Running A Model On Separate Groups Of Data

Simon Jackson shows how to run the same model against separate groups of data in R:

Now that we can separate data for each group(s), we can fit a model to each tibble in data using map() from the purrr package (also tidyverse). We’re going to add the results to our existing tibble using mutate() from the dplyr package (again, tidyverse). Here’s a generic version of our pipe with adjustable parts in caps:

Read the whole thing.  Hat tip, R-Bloggers.

Comments closed

Lubridate Updates

Hadley Wickham reports on a Lubridate update:

  • Date time rounding (with round_date()floor_date() and ceiling_date()) now supports unit multipliers, like “3 days” or “2 months”:

    ceiling_date(ymd_hms("2016-09-12 17:10:00"), unit = "5 minutes")
    #> [1] "2016-09-12 17:10:00 UTC"

If you handle date and time data in R, Lubridate is a tremendous asset.

Comments closed

Predictive Maintenance Solution Template

Jaya Mathew has a SQL Server R Services template for predictive maintenance:

To illustrate the scenario, we will focus on companies who operate machines which encounter mechanical failures. These failures lead to downtime which has cost implications on any business, hence most companies are interested in predicting the failures ahead of time so that they can proactively prevent them. This scenario is aligned with an existing R Notebook published in the Cortana Intelligence Gallery but works with a larger dataset where we will focus on predicting component failures of a machine using raw telemetry, maintenance logs, previous errors/failures and additional information about the make/model of the machine. This scenario is widely applicable for almost any industry which uses machines that need maintenance. A quick overview of typical feature engineering techniques as well as how to build a model will be discussed below.

Understanding when machines are likely to break down is a very interesting statistical problem.  Check out the template.

Comments closed

Chi Square Tests

Mala Mahadevan discusses how to perform a Chi Square test:

For any dataset to lend itself to the Chi Square test it has to fit the following conditions  –

1 Both  variables are categorical (in this case – exposure to smoking – yes/no, and health condition – sick/not sick are both categorical).
2 Researchers used a random sample to collect data.
3 Researchers had an adequate sample size.Generally the sample size should be at least 100.
4 The number of respondents in each cell should be at least 5.

This is an easy case for using R over T-SQL—the Chi Square test is built in, whereas you have to roll your own T-SQL code.  Mala does show you how to do this from within SQL Server R Services as well.

Comments closed

Shiny 0.14 Released

Winston Chang reports that Shiny version 0.14 is now available:

If your Shiny app contains computations that take a long time to complete, a progress bar can improve the user experience by communicating how far along the computation is, and how much is left. Progress bars were added in Shiny 0.10.2. In Shiny 0.14, we’ve changed them to use the notifications system, which gives them a different look.

Important note: If you were already using progress bars and had customized them with your own CSS, you can add the style = "old" argument to yourwithProgress() call (or Progress$new()). This will result in the same appearance as before. You can also call shinyOptions(progress.style = "old") in your app’s server function to make all progress indicators use the old styling.

It looks like they’ve made some good progress with Shiny.

Comments closed

Pearson’s Correlation Coefficient

Mala Mahadevan explains correlation coefficients:

The statistical definition of Pearson’s R Coefficient, as it is called, can be found in detail here for those interested. A value of 1 indicates that there is a strong positive correlation(the two variables in question increase together), 0 indicates no correlation between them, and -1 indicates a strong negative correlation (the two variables decrease together). But you rarely get a perfect -1, 0 or 1. Most values are fractional and interpreted as follows:
High correlation: .5 to 1.0 or -0.5 to 1.0.
Medium correlation: .3 to .5 or -0.3 to .5.
Low correlation: .1 to .3 or -0.1 to -0.3.

Mala includes R and T-SQL code so you can follow along.

Comments closed

Data Quality

Milind Paradkar discusses clean data:

We decided to do a quick check and took a sample of 143 stocks listed on the National Stock Exchange of India Ltd (NSE). For these stocks, we downloaded the 1-minute intraday data for the period 1/08/2016 – 19/08/2016. The aim was to check whether Google finance captured every 1-minute bar during this period for each of the 143 stocks.

NSE’s trading session starts at 9:15 am and ends at 15:30 pm IST, thus comprising of 375 minutes. For 14 trading sessions, we should have 5250 data points for each of these stocks. We wrote a simple code in R to perform the check.

I like this post because it exposes a data quality issue people don’t tend to think about very often:  when all of the data is legitimate and correctly-structured, but there are gaps in the available data set.  This is one of many data quality problems you’ll run into, so it may be important to have a plan in place in case you hit this scenario.

Comments closed

Graphing Customer Churn

Fang Zhou and Wee Hyong Tok have released a case study on a telephone company’s customer churn:

In the case of telco customer churn, we collected a combination of the call detail record data and customer profile data from a mobile carrier, and then followed the data science process —  data exploration and visualization, data pre-processing and feature engineering, model training, scoring and evaluation — in order to achieve the churn prediction. With a churn indicator in the dataset taking value 1 when the customer is churned and taking value 0 when the customer is non-churned, we addressed the problem as a binary classification problem and tried varioustree-based models along with methods like bagging, random forests and boosting. Because the number of churned customers is much less than that of non-churned customers (making the data set quite unbalanced), SMOTE (Synthetic Minority Oversampling Technique) was applied to adjust the proportion of majority class over minority class in the training data set, thus further improving model performance, especially precision and recall.

All the above data science procedures could be implemented with base R. Rather than moving the data out from the database to an external machine running R, we instead run R scripts directly on SQL Server data by leveraging the in-database analytics capability provided by SQL Server R Services, taking advantage of the rich and powerful CRAN R packages plus the parallel external memory algorithms in the RevoScaleR library. In what follows, we will describe the specific R packages and algorithms that we used to implement the data science solution for predicting telco customer churn.

They have provided the relevant materials in GitHub as well.

Comments closed

Visualizing NFL Data

Allison Tharp looks at NFL play-by-play data using R:

Lets look at how teams played on offense depending on where they were on the field (their yardline) and the down they were on.  The fields in our dataframe that we will care about here are yfog (yards from own goal), type (rush or pass), dwn (current down number: 1,2,3, or 4).  We will want a table with each of these columns as well as a sum column.  That way, we can see how many times a pass attempt was done on the 4th down when a team was X yards from their own goal.

To do this, we will use a package called plyr.  The Internet says that this package makes it easy for us to split data, mess with it, and then put it back together.  I am not convinced the tool is easy, but I haven’t spent too much time with it.

Check it out for some ideas on what you can do with R.

Comments closed