Data Cleaning Tips

Kevin Feasel

2017-07-12

R

Michael Grogan has a few tips for data cleaning with R:

6. Delete observations using head and tail functions

The head and tail functions can be used if we wish to delete certain observations from a variable, e.g. Sales. The head function allows us to delete the first 30 rows, while the tail function allows us to delete the last 30 rows.

When it comes to using a variable edited in this way for calculation purposes, e.g. a regression, the as.matrix function is also used to convert the variable into matrix format:

Salesminus30days←head(Sales,-30)
X1=as.matrix(Salesminus30days)
X1

Salesplus30days<-tail(Sales,-30)
X2=as.matrix(Salesplus30days)
X2

Some of these tips are for people familiar with Excel but fairly new to R.  These also use the base library rather than the tidyverse packages (e.g., using merge instead of dplyr’s join or as.date instead of lubridate).  You may consider that a small negative, but if it is, it’s a very small one.

Useful dplyr Functions

Kevin Feasel

2017-07-12

R

S. Richter-Walsh explains seven important dplyr functions with plenty of examples:

There are many useful functions contained within the dplyr package. This post does not attempt to cover them all but does look at the major functions that are commonly used in data manipulation tasks. These are:

select()
filter()
mutate()
group_by()
summarise()
arrange()
join()

The data used in this post are taken from the UCI Machine Learning Repository and contain census information from 1994 for the USA. The dataset can be used for classification of income class in a machine learning setting and can be obtained here.

That’s probably the bare minimum you should know about dplyr, but knowing just these seven can make data analysis in R much easier.

Tibbles In R

Kevin Feasel

2017-07-11

R

Tristan Mahr explains what tibbles and tribbles are and how they compare to built-in data frames:

The name “tribble” is short for “transposed tibble” (the transposed part referring to change from column-wise creation in tibble() to row-wise creation in tribble()).

I like to use light-weight tribbles for two particular tasks:

  • Recoding: Create a tribble of, say, labels for a plot and join it onto a dataset.

  • Exclusion: Identify observations to exclude, and remove them with an anti-join.

I’ve been more used to data frames than tibbles, but this post shows some interesting things you can do with tibbles a lot more easily than with data frames.  It’s enough to make me want to use tibbles more frequently.  H/T R-bloggers

Plotly And Power BI

Leila Etaati shows how to use Plotly to generate interactive R charts in Power BI:

In the last two posts (Part 1 and 2), I have explained the main process of creating the R custom Visual Packages in Power BI. there are some parts that still need improvement which I will do in next posts. In this post, I am going to show different R charts that can be used in power BI and when we should used them for what type of data, these are Facet jitter chart, Pie chart, Polar Scatter Chart, Multiple Box Plot, and Column Width Chart. I follow the same process I did in Post 1 and Post 2. I just change the R scripts  and will explain how to use these graphs

Leila includes several examples of chart types and shows that it’s pretty easy to get this working.

R’s iGraph + SQL Server Graphs

Dennes Torres has a post which shows how to use R’s iGraph library to visualize graphs created in SQL Server 2017:

The possibility to use both technologies together is very interesting. Using graph objects we can store relationships between elements, for example, relationships between forum members. Using R scripts we can build a cluster graph from the stored graph information, illustrating the relationships in the graph.

The script below creates a database for our example with a subset of the objects used in my article and a few more relationship records between the forum members.

Click through for the script.

Tidygraph

Kevin Feasel

2017-07-07

Graph, R

Thomas Lin Pedersen announces tidygraph, a tidyverse library for dealing with graphs and trees in R:

One of the simplest concepts when computing graph based values is that of
centrality, i.e. how central is a node or edge in the graph. As this
definition is inherently vague, a lot of different centrality scores exists that
all treat the concept of central a bit different. One of the famous ones is
the pagerank algorithm that was powering Google Search in the beginning.
tidygraph currently has 11 different centrality measures and all of these are
prefixed with centrality_* for easy discoverability. All of them returns a
numeric vector matching the nodes (or edges in the case of
centrality_edge_betweenness()).

This is a big project and is definitely interesting if you’re looking at analyzing graph data.

Don’t Fear The Tidyverse

Kevin Feasel

2017-07-06

Learning, R

David Robinson explains why he prefers to explain the tidyverse version of R first rather than base R:

I’d summarize the two “competing” curricula as follows:

  • Base R first: teach syntax such as $ and [[]], loops and conditionals, data types (numeric, character, data frame, matrix), and built-in functions like ave and tapply. Possibly follow up by introducing dplyr or data.table as alternatives.
  • Tidyverse first: Start from scratch with the dplyr package for manipulating a data frame, and introduce others like ggplot2, tidyr and purrr shortly afterwards. Introduce the %>% operator from magrittr immediately, but skip syntax like [[]] and $ or leave them for late in the course. Keep a single-minded focus on data frames.

I’ve come to strongly prefer the “tidyverse first” educational approach. This isn’t a trivial decision, and this post is my attempt to summarize my opinions and arguments for this position. Overall, they mirror my opinions about ggplot2: packages like dplyr and tidyr are not “advanced”; they’re suitable as a first introduction to R.

I think this is the better position of the two, particularly for people who already have some experience with languages like SQL.

Neural Nets With R And Power BI

Leila Etaati continues her series on using neural nets in Power BI:

we are going to predict the concrete strength using neural network. neural network can be used for predict a value or class, or it can be used for predicting multiple items. In this example, we are going to predict a value, that is concrete strength.

I have loaded the data in power bi first, and in “Query Editor” I am going to write some R codes. First we need to do some data transformations. As you can see in the below picture number 2,3 and 4,data is not in a same scale, we need to do some data normalization before applying any machine learning. I am going to write a code for that (Already explained the normalization in post KNN). So to write some R codes, I just click on the R transformation component (number 5).

There’s a lot going on in this demo; check it out.

dbplyr

Kevin Feasel

2017-06-29

R

Hadley Wickham announces dbplyr version 1.1.0:

Since you’ve read this far, I also wanted to touch on RStudio’s vision for databases. Many analysts have most of their data in databases, and making it as easy as possible to get data out of the database and into R makes a huge difference. Thanks to the community, R already has strong tools for talking to the popular open source databases. But support for connecting to enterprise databases and solving enterprise challenges has lagged somewhat. At RStudio we are actively working to solve these problems.

As well as dbplyr and DBI, we are working on many other pain points in the database ecosystem. You’ll hear much more about these packages in the future, but I wanted to touch on the highlights so you can see where we are heading. These pieces are not yet as integrated as they should be, but they are valuable by themselves, and we will continue to work to make a seamless database experience, that is as good as (or better than!) any other environment.

There’s some very interesting vision talk at the end, showing how Wickham and the RStudio group are dedicated to enterprise-grade R.

Running H2O In R On Azure HDInsight

Daisy Deng shows how to configure HDInsight to be able to run the H2O package in R rather than Python or Scala:

We provide a few script actions for installing rsparkling on Azure HDInsight. When creating the HDInsight cluster, you can run the following script action for header node:

https://bostoncaqs.blob.core.windows.net/scriptaction/scriptaction-head.sh

And run the following action for the worker node:

https://bostoncaqs.blob.core.windows.net/scriptaction/scriptaction-worker.sh

Please consult Customize Linux-based HDInsight clusters using Script Action for more details.

Click through for the full process.

Categories

August 2017
MTWTFSS
« Jul  
 123456
78910111213
14151617181920
21222324252627
28293031