Press "Enter" to skip to content

Category: R

Generating Basic Features From Text Data In R With textfeatures

Abdul Majed Raja demonstrates the textfeatures package in R:

Michael Kearney, Assistant Professor in University of Missouri, well known in the R community for the modern twitter package rtweet, has come up with a new R packaged called textfeatures that basically generates a bunch of features for any text data that you supply. Before you dream of Deep Learning based Package for Automated Text Feature Engineering, This isn’t that. This uses very simple Text Analysis principles and generates features like Number of Upper Case letters, Number of Punctuations – plain simple stuff and nothing fancy but pretty useful ones.

It’s a start for text analysis, though there’s a lot more after this.

Comments closed

Real-Time Data Visualization With R And SQL Server

Tomaz Kastrun shows how simple it can be to plot real(ish)-time data from SQL Server using R:

In the previous post, I have showed how to visualize near real-time data using Python and Dash module.  And it is time to see one of the many ways, how to do it in R. This time, I will not use any additional frames for visualization, like shiny, plotly or any others others, but will simply use base R functions and RODBC package to extract data from SQL Server.

Extracting data from SQL Server will and simulating inserts in SQL Server table will primarily simulate the near real-time data. If you have followed the previous post, you will notice that I am using same T-SQL table and query to extract real-time data.

Tomaz is using the base plot library, but if you want something nicer, there are several good alternatives.

Comments closed

Plotting ML Results In R

Bernardo Lares shows off the plots he creates in R to compare ML models:

Split and compare quantiles

This parameter is the easiest to sell to the C-level guys. “Did you know that with this model, if we chop the worst 20% of leads we would have avoided 60% of the frauds and only lose 8% of our sales?” That’s what this plot will give you.

The math behind the plot might be a bit foggy for some readers so let me try and explain further: if you sort from the lowest to the highest score all your observations / people / leads, then you can literally, for instance, select the top 5 or bottom 15% or so. What we do now is split all those “ranked” rows into similar-sized-buckets to get the best bucket, the second best one, and so on. Then, if you split all the “Goods” and the “Bads” into two columns, keeping their buckets’ colours, we still have it sorted and separated, right? To conclude, if you’d say that the worst 20% cases (all from the same worst colour and bucket) were to take an action, then how many of each label would that represent on your test set? There you go!

Read on to see what else he uses and how you can build it yourself.

Comments closed

Scatterplots For Multivariate Analysis

Neil Saunders declutters a complicated visual with a simple scatterplot:

Sydney’s congestion at ‘tipping point’ blares the headline and to illustrate, an interactive chart with bars for city population densities, points for commute times and of course, dual-axes.

Yuck. OK, I guess it does show that Sydney is one of three cities that are low density, but have comparable average commute times to higher-density cities. But if you’re plotting commute time versus population density…doesn’t a different kind of chart come to mind first? y versus x. C’mon.

Let’s explore.

Simple is typically better, and that adage holds here.

Comments closed

Using ggpairs To Find Correlations Between Variables In R

Akshay Mahale shows how to use the ggpairs function in R to see the correlation between different pairs of variables:

From the above matrix for iris we can deduce the following insights:

  • Correlation between Sepal.Length and Petal.Length is strong and dense.
  • Sepal.Length and Sepal.Width seems to show very little correlation as datapoints are spreaded through out the plot area.
  • Petal.Length and Petal.Width also shows strong correlation.

Note: The insights are made from the interpretation of scatterplots(with no absolute value of the coefficient of correlation calculated). Some more examination will be required to be done once significant variables are obtained for linear regression modeling. (with help of residual plots, the coefficient of determination i.e Multiplied R square we can reach closer to our results)

Click through to read the whole thing.

Comments closed

Testing Spatial Equilibrium Concepts With tidycensus

Ignacio Sarmiento Barbieri walks us through the concept of spatial equilibrium and tests using data from the tidycensus package:

Let’s take the model to the data and reproduce figures 2.1. and 2.2 of “Cities, Agglomeration, and Spatial Equilibrium”. The focus are two cities, Chicago and Boston. These cities are chosen because both differ in how easy is to access to their city centers. Chicago is fairly easy, Boston is more complicated. Our model then implies that gradients then should reflect the differential costs to access the city centers.

So let’s begin, the first step is to get some data. To do so I’m are going to use the “tidycensus” package. This package will allow me to get data from the census website using their API. We are also going to need the help of three other packages: “sf” to handle spatial data, “dplyr” my go-to package to wrangle data, and “ggplot2” to plot my results.

require("tidycensus", quietly=TRUE)
require("sf", quietly=TRUE)
require("dplyr", quietly=TRUE)
require("ggplot2", quietly=TRUE)

In order to get access to the Census API, I need to supply a key, which can be obtained from http://api.census.gov/data/key_signup.html.

Read on for theory and a test.  H/T R-bloggers

Comments closed

Partitioning Data For Performance Improvement In R

John Mount shares a few examples of partitioning and parallelizing data operations in R:

In this note we will show how to speed up work in R by partitioning data and process-level parallelization. We will show the technique with three different R packages: rqdatatabledata.table, and dplyr. The methods shown will also work with base-R and other packages.

For each of the above packages we speed up work by using wrapr::execute_parallel which in turn uses wrapr::partition_tables to partition un-related data.frame rows and then distributes them to different processors to be executed. rqdatatable::ex_data_table_parallelconveniently bundles all of these steps together when working with rquery pipelines.

There were some interesting results.  I expected data.table to be fast, but did not expect dplyr to parallelize so well.

Comments closed

Sharing R Notebooks

Hanyu Cui and Hossein Falaki show how to share a notebook using RMarkdown:

RMarkdown is the dynamic document format RStudio uses. It is normal Markdown plus embedded R (or any other language) code that can be executed to produce outputs, including tables and charts, within the document. Hence, after changing your R code, you can just rerun all code in the RMarkdown file rather than redo the whole run-copy-paste cycle. And an RMarkdown file can be directly exported into multiple formats, including HTML, PDF,  and Word.

Click through for the demo.

Comments closed

Pipe-Friendly Functions In R

William Doane gives some tips on writing pipe-friendly functions in R:

Languages that don’t begin by supporting pipes often eventually implement some version of them. In R, the magrittr package introduced the %>% infix operator as a pipe operator and is most often pronounced as “then”. For example, “take the mtcarsdata.frame, THEN take the head of it, THEN…” and so on.

For a function to be pipe friendly, it should at least take a data object (often named .data) as its first argument and return an object of the same type—possibly even the same, unaltered object. This contract ensures that your pipe-friendly function can exist in the middle of a piped workflow, accepting the input from its left-hand side and passing along output to its right-hand side.

Click through for a couple of examples.  H/T R-Bloggers

Comments closed

Grouping And Aggregating In SQL, R, And Python

Dejan Sarka has a few examples of aggregation in different languages, including SQL, R, and Python:

The query calculates the coefficient of variation (defined as the standard deviation divided the mean) for the following groups, in the order as they are listed in the GROUPING SETS clause:

  • Country and education – expression (g.EnglishCountryRegionName, c.EnglishEducation)
  • Country only – expression (g.EnglishCountryRegionName)
  • Education only – expression (c.EnglishEducation)
  • Over all dataset- expression ()

Note also the usage of the GROUPING() function in the query. This function tells you whether the NULL in a cell comes because there were NULLs in the source data and this means a group NULL, or there is a NULL in the cell because this is a hyper aggregate. For example, NULL in the Education column where the value of the GROUPING(Education) equals to 1 indicates that this is aggregated in such a way that education makes no sense in the context, for example aggregated over countries only, or over the whole dataset. I used ordering by NEWID() just to shuffle the results. I executed query multiple times before I got the desired order where all possibilities for the GROUPING() function output were included in the first few rows of the result set. Here is the result.

GROUPING SETS is an underappreciated bit of SQL syntax.

Comments closed