6. Delete observations using head and tail functions
The head and tail functions can be used if we wish to delete certain observations from a variable, e.g. Sales. The head function allows us to delete the first 30 rows, while the tail function allows us to delete the last 30 rows.
When it comes to using a variable edited in this way for calculation purposes, e.g. a regression, the as.matrix function is also used to convert the variable into matrix format:
Some of these tips are for people familiar with Excel but fairly new to R. These also use the base library rather than the tidyverse packages (e.g., using merge instead of dplyr’s join or as.date instead of lubridate). You may consider that a small negative, but if it is, it’s a very small one.
There are many useful functions contained within the dplyr package. This post does not attempt to cover them all but does look at the major functions that are commonly used in data manipulation tasks. These are:select() filter() mutate() group_by() summarise() arrange() join()
The data used in this post are taken from the UCI Machine Learning Repository and contain census information from 1994 for the USA. The dataset can be used for classification of income class in a machine learning setting and can be obtained here.
That’s probably the bare minimum you should know about dplyr, but knowing just these seven can make data analysis in R much easier.
The name “tribble” is short for “transposed tibble” (the transposed part referring to change from column-wise creation in
tibble()to row-wise creation in
I like to use light-weight tribbles for two particular tasks:
Recoding: Create a tribble of, say, labels for a plot and join it onto a dataset.
Exclusion: Identify observations to exclude, and remove them with an anti-join.
I’ve been more used to data frames than tibbles, but this post shows some interesting things you can do with tibbles a lot more easily than with data frames. It’s enough to make me want to use tibbles more frequently. H/T R-bloggers
In the last two posts (Part 1 and 2), I have explained the main process of creating the R custom Visual Packages in Power BI. there are some parts that still need improvement which I will do in next posts. In this post, I am going to show different R charts that can be used in power BI and when we should used them for what type of data, these are Facet jitter chart, Pie chart, Polar Scatter Chart, Multiple Box Plot, and Column Width Chart. I follow the same process I did in Post 1 and Post 2. I just change the R scripts and will explain how to use these graphs
Leila includes several examples of chart types and shows that it’s pretty easy to get this working.
The possibility to use both technologies together is very interesting. Using graph objects we can store relationships between elements, for example, relationships between forum members. Using R scripts we can build a cluster graph from the stored graph information, illustrating the relationships in the graph.
The script below creates a database for our example with a subset of the objects used in my article and a few more relationship records between the forum members.
Click through for the script.
One of the simplest concepts when computing graph based values is that of
centrality, i.e. how central is a node or edge in the graph. As this
definition is inherently vague, a lot of different centrality scores exists that
all treat the concept of central a bit different. One of the famous ones is
the pagerank algorithm that was powering Google Search in the beginning.
tidygraphcurrently has 11 different centrality measures and all of these are
centrality_*for easy discoverability. All of them returns a
numeric vector matching the nodes (or edges in the case of
This is a big project and is definitely interesting if you’re looking at analyzing graph data.
I’d summarize the two “competing” curricula as follows:
- Base R first: teach syntax such as
[], loops and conditionals, data types (numeric, character, data frame, matrix), and built-in functions like
tapply. Possibly follow up by introducing dplyr or data.table as alternatives.
- Tidyverse first: Start from scratch with the dplyr package for manipulating a data frame, and introduce others like ggplot2, tidyr and purrr shortly afterwards. Introduce the
%>%operator from magrittr immediately, but skip syntax like
$or leave them for late in the course. Keep a single-minded focus on data frames.
I’ve come to strongly prefer the “tidyverse first” educational approach. This isn’t a trivial decision, and this post is my attempt to summarize my opinions and arguments for this position. Overall, they mirror my opinions about ggplot2: packages like dplyr and tidyr are not “advanced”; they’re suitable as a first introduction to R.
I think this is the better position of the two, particularly for people who already have some experience with languages like SQL.
we are going to predict the concrete strength using neural network. neural network can be used for predict a value or class, or it can be used for predicting multiple items. In this example, we are going to predict a value, that is concrete strength.
I have loaded the data in power bi first, and in “Query Editor” I am going to write some R codes. First we need to do some data transformations. As you can see in the below picture number 2,3 and 4,data is not in a same scale, we need to do some data normalization before applying any machine learning. I am going to write a code for that (Already explained the normalization in post KNN). So to write some R codes, I just click on the R transformation component (number 5).
There’s a lot going on in this demo; check it out.
Since you’ve read this far, I also wanted to touch on RStudio’s vision for databases. Many analysts have most of their data in databases, and making it as easy as possible to get data out of the database and into R makes a huge difference. Thanks to the community, R already has strong tools for talking to the popular open source databases. But support for connecting to enterprise databases and solving enterprise challenges has lagged somewhat. At RStudio we are actively working to solve these problems.
As well as dbplyr and DBI, we are working on many other pain points in the database ecosystem. You’ll hear much more about these packages in the future, but I wanted to touch on the highlights so you can see where we are heading. These pieces are not yet as integrated as they should be, but they are valuable by themselves, and we will continue to work to make a seamless database experience, that is as good as (or better than!) any other environment.
There’s some very interesting vision talk at the end, showing how Wickham and the RStudio group are dedicated to enterprise-grade R.
We provide a few script actions for installing rsparkling on Azure HDInsight. When creating the HDInsight cluster, you can run the following script action for header node:
And run the following action for the worker node:
Please consult Customize Linux-based HDInsight clusters using Script Action for more details.
Click through for the full process.