We decided to do a quick check and took a sample of 143 stocks listed on the National Stock Exchange of India Ltd (NSE). For these stocks, we downloaded the 1-minute intraday data for the period 1/08/2016 – 19/08/2016. The aim was to check whether Google finance captured every 1-minute bar during this period for each of the 143 stocks.
NSE’s trading session starts at 9:15 am and ends at 15:30 pm IST, thus comprising of 375 minutes. For 14 trading sessions, we should have 5250 data points for each of these stocks. We wrote a simple code in R to perform the check.
I like this post because it exposes a data quality issue people don’t tend to think about very often: when all of the data is legitimate and correctly-structured, but there are gaps in the available data set. This is one of many data quality problems you’ll run into, so it may be important to have a plan in place in case you hit this scenario.
In the case of telco customer churn, we collected a combination of the call detail record data and customer profile data from a mobile carrier, and then followed the data science process — data exploration and visualization, data pre-processing and feature engineering, model training, scoring and evaluation — in order to achieve the churn prediction. With a churn indicator in the dataset taking value 1 when the customer is churned and taking value 0 when the customer is non-churned, we addressed the problem as a binary classification problem and tried varioustree-based models along with methods like bagging, random forests and boosting. Because the number of churned customers is much less than that of non-churned customers (making the data set quite unbalanced), SMOTE (Synthetic Minority Oversampling Technique) was applied to adjust the proportion of majority class over minority class in the training data set, thus further improving model performance, especially precision and recall.
All the above data science procedures could be implemented with base R. Rather than moving the data out from the database to an external machine running R, we instead run R scripts directly on SQL Server data by leveraging the in-database analytics capability provided by SQL Server R Services, taking advantage of the rich and powerful CRAN R packages plus the parallel external memory algorithms in the RevoScaleR library. In what follows, we will describe the specific R packages and algorithms that we used to implement the data science solution for predicting telco customer churn.
They have provided the relevant materials in GitHub as well.
Lets look at how teams played on offense depending on where they were on the field (their yardline) and the down they were on. The fields in our dataframe that we will care about here are yfog (yards from own goal), type (rush or pass), dwn (current down number: 1,2,3, or 4). We will want a table with each of these columns as well as a sum column. That way, we can see how many times a pass attempt was done on the 4th down when a team was X yards from their own goal.
To do this, we will use a package called plyr. The Internet says that this package makes it easy for us to split data, mess with it, and then put it back together. I am not convinced the tool is easy, but I haven’t spent too much time with it.
Check it out for some ideas on what you can do with R.
It is important to note that the SQL statements generated in the background are not executed unless explicitly requested by the command as.data.frame. Hence, you can merge, filter and aggregate your dataset on the database side and load only the result set into memory for R.
In general the design principle behind RDBL is to keep the models as close as possible to the usual data.frame logic, including (as shown later in detail) commands like aggregate, referencing columns by the \($\) operator and features like logical indexing using the \(\) operator.
Check it out. I’m not particularly excited about this for one simple reason: SQL is a better data retrieval and connection DSL than an R-based mapper. I get the value of sticking to one language as much as possible. I also get that not all queries need to be well-optimized—for example, you might be running queries on a local machine or against a slice of data which is not connected to an operational production environment. But I’m a big fan of using the right tool for the job, and the right tool for working with relational databases (and the “relational” part there is perhaps optional) is SQL.
This is a very exciting project with great interest from the R and more general data science community — in the past short 2 months (since we opened registration for the conference):
More than 160 persons signed up and paid for attendance from 17 countries so far (around 50-50% mix of academic and industry tickets, 30-70% mix of foreign and Hungarian attendees)
We received almost 40 voluntary talk proposals in a few weeks of time while the CfP was open
25 selected & awesome speakers agreed on to present at the conference
I’d like to see this take off, similar to SQL Saturdays.
Our real world scenario involves R scripts that process raw smoke monitoring data that is updated hourly. The raw data comes from various different instruments, set up by different agencies and transmitted over at least two satellites before eventually arriving on our computers.
Data can be missing, delayed or corrupted for a variety of reasons before it gets to us. And then our R scripts perform QC based on various columns available in the raw (aka “engineering level”) data.
Logging is one of the differences between toy code (even very useful toy code) and production-quality code. Read on for an easy way to do this in R.
Power BI, Microsoft’s data visualization and reporting platform, has made great strides in the past year integrating the R language. This Computerworld article describes the recent advances with Power BI and R. In short, you can:
import data into Power BI by using an R script
cleanse and transform other data sources coming into Power BI using R functions
Click through for more things you can do, as well as additional links and resources.
Our first data frame constrained of seven vectors, Customer_Id, loan_type, First_Name, Last_name, Gender, Zip_code and amount.
NOTE: R is case sensitive. That is why I have used lower and upper case for you to practice.
After we run the lines we want to see how our first data frame looks. Following command will suffice that need:
If you’re coming from a SQL background, data frames are tables. Well-formed (“clean”) data frames more or less follow first normal form.
The general approach behind each of the examples that we’ll cover below is to:
Fit a regression model to predict variable (Y).
Obtain the predicted and residual values associated with each observation on (Y).
Plot the actual and predicted values of (Y) so that they are distinguishable, but connected.
Use the residuals to make an aesthetic adjustment (e.g. red colour when residual in very high) to highlight points which are poorly predicted by the model.
The post is about 10% understanding what residuals are and 90% showing how to visualize them and spot major discrepancies.
RStudio has several ways to import data. One of the easiest ways is to import via URL. This link (https://data.montgomerycountymd.gov/api/views/6rqk-pdub/rows.csv?accessType=DOWNLOAD) gives us the salaries of all of the government employees for Montgomery County, MD in a CSV format. To import this into RStudio, copy the URL and go to Tools -> Import Dataset -> From Web URL…
R and Python are both good languages to learn for data analysis. I lean just a little bit toward R, but they’re both strong choices in this space.