R – Page 135 – Curated SQL

Custom Power BI Shapes Using R

Published 2017-02-08 by Kevin Feasel

Koen Verbeeck uses R to create dynamically changing images in Power BI:

You can insert images into Power BI Desktop, but these are static images. If you want them to dynamically change, you need the Image Viewer custom visual. Unfortunately, it doesn’t support measures, only columns. Since we want dynamic changes, fixed column values are not going to work. Someone proposed a work around on the Power BI forums, but this only works if you have a fixed set of attributes you want to slice on (for example, 4 categories). I want a totally flexible solution (e.g. each month we have a couple of new weeks to slice on), so again, not possible.

The only solution I could think of that would still work: using R visuals.

Read on for the solution.

Comments closed

Using Sparklyr To Analyze Flight Data

Published 2017-02-07 by Kevin Feasel

Aki Ariga uses sparklyr on Apache Spark 2.0 to analyze flight data living in S3:

Using sparklyr enables you to analyze big data on Amazon S3 with R smoothly. You can build a Spark cluster easily with Cloudera Director. sparklyr makes Spark as a backend database of dplyr. You can create tidy data from huge messy data, plot complex maps from this big data the same way as small data, and build a predictive model from big data with MLlib. I believe sparklyr helps all R users perform exploratory data analysis faster and easier on large-scale data. Let’s try!

You can see the Rmarkdown of this analysis on RPubs. With RStudio, you can share Rmarkdown easily on RPubs.

Sparklyr is an exciting technology for distributed data analysis.

Comments closed

RevoScaleR

Published 2017-02-07 by Kevin Feasel

Tomaz Kastrun explains how the RevoScaleR package is useful:

RevoScaleR package and computational function were designed for parallel computation with no memory limitation, mainly because this package introduced it’s own file format, called XDF. eXternal Data Frame was designed for fast processing of smaller chunks of data, and gains it’s efficiency when reading and writing the XDF data by loading chucks of data into RAM one by at a time and only what is needed. The way this is done, means no limitations for the size of RAM, computations run much faster (because it is using C++ to write these algorithms, which is faster than original, which were written in interpretative language). Data scientist still make a single R call, bur R will use distrubuteR component to determine, how many cores, sockets and threads are available and then launch smaller portion of load into each thread, analyze data a bit at a time. With XDF, data is retrieved many times, but since it is 5-10times smaller (as I have already shown in previous blog posts when compared to *.txt or *.csv files), and it is written and stored into XDF file the same way as it was extracted from the memory, it enables faster computations, because no parsing of data chunks is required and because of the way, how data is stored, is maximizes the retrieval time of the data.

If you’re using SQL Server R Services, these rx functions will become very important to you.

Comments closed

Superheat

Published 2017-02-06 by Kevin Feasel

David Smith shows off a very cool heatmap package called superheat:

While the superheat pacakge uses the ggplot2 package internally, it doesn’t itself follow the grammar of graphics paradigm: the function is more like a traditional base R graphics function with a couple of dozen options, and it creates a plot directly rather than returning a ggplot2 object that can be further customized. But as long as the options cover your heatmap needs (and that’s likely), you should find it a useful tool next time you need to represent data on a grid.

The superheat package apparently works with any R version after 3.1 (and I can confirm it works on the most recent, R 3.3.2). This arXiv paper provides some details and several case studies, and you can find more examples here. Check out the vignette for detailed usage instructions, and download it from its GitHub repository linked below.

Click through for some great-looking examples.

Comments closed

Data Frame Serialization In R

Published 2017-02-03 by Kevin Feasel

David Smith shows a new contender for serializing data frames in R, fst:

And now there’s a new package to add to the list: the fst package. Like the data.table package (the fast data.frame replacement for R), the primary focus of the fst package is speed. The chart below compares the speed of reading and writing data to/from CSV files (with fwrite/fread), feather, fts, and the native R RDS format. The vertical axis is throughput in megabytes per second — more is better. As you can see, fst outperforms the other options for both reading (orange) and writing (green).

These early numbers look great, so this is a project worth keeping an eye on.

Comments closed

10,000 R Packages

Published 2017-01-31 by Kevin Feasel

David Smith notes that CRAN is now up to 10,000 packages:

Having so many packages available can be a double-edged sword though: it can take some searching to find the package you need. Luckily, there are some resources available to help you:

MRAN (the Microsoft R Application Network) provides a search tool for R packages on CRAN.
To find the most popular packages, Rdocumentation.org provides a leaderboard of packages by number of downloads. It also provides lists of newly-released and recently-updated packages.

R is a big language; having good heuristics for figuring out where to find appropriate packages is extremely important.

Comments closed

Principal Component Analysis Using R

Published 2017-01-26 by Kevin Feasel

Francisco Lima explains what principal component analysis is and shows how to do it in R:

Three lines of code and we see a clear separation among grape vine cultivars. In addition, the data points are evenly scattered over relatively narrow ranges in both PCs. We could next investigate which parameters contribute the most to this separation and how much variance is explained by each PC, but I will leave it for pcaMethods. We will now repeat the procedure after introducing an outlier in place of the 10th observation.

PCA is extremely useful when you have dozens of contributing factors, as it lets you narrow in on the big contributors quickly.

Comments closed

R Visuals In Power BI

Published 2017-01-25 by Kevin Feasel

Ryan Wade ties ggplot2 visuals into Power BI:

The package that we are going to use to develop our custom visualization is ggplot2. The ggplot2 package is arguably the most popular data visualization package in R. It is based on the “grammar of graphics” concept that was created by the statistician, Leland Wilkinson. The ggplot2 package allows you to approach creating charts and graphs in the same manner that Bob Ross approached painting trees in the forest. With ggplot2 you are able to start with a blank canvas and add layers upon layers via short code snippets that builds on each other until you end up with the desired visualization.

The pbix file that is being used in this blog can be found here: http://bit.ly/2jwoCyP. The GentleIntroToR_ChartExample.pbix file contains an example of using R to create a box plot chart that shows the distribution of player scores for the L.A. Lakers. Chiclet slicers were added that allows you to filter by division and/or opponent. The R visualization was created in four steps.

Check out the PBIX file.

Comments closed

Shredding Excel With R

Published 2017-01-18 by Kevin Feasel

John MacKintosh shows how to use R for wrangling + ETL:

I had over 140 files to process. That’s not usually a big deal – I normally use SQL Server Integration Services to loop through network folders, connect to hundreds of spreadsheets and extract the source data.

But this relies on the data being in a tabular format (like a dataframe or database table).

A quick glance at the first few sheets confirmed I could not use this approach – the data was not in tabular format. Instead it was laid out in a format suited to viewing the data on screen – with the required data scattered in different ranges throughout each sheet ( over 100 rows and many columns). It wasn’t going to be feasible to point SSIS at different locations within each sheet. (It can be done, but it’s pretty complex and I didn’t have time to experiment).

The other challenge was that over time, changes to design meant that data moved location e.g. dates that were originally in cell C2 moved to D7, then moved again as requirements evolved. There were 14 different templates in all, each with subtle changes. Each template was going to need a custom solution to extract the data.

This is a good look at how R can be about more than “just” statistical analysis.

Comments closed

Support Vector Machines In R

Published 2017-01-18 by Kevin Feasel

Deepanshu Bhalla explains what support vector machines are:

The main idea of support vector machine is to find the optimal hyperplane (line in 2D, plane in 3D and hyperplane in more than 3 dimensions) which maximizes the margin between two classes. In this case, two classes are red and blue balls. In layman’s term, it is finding the optimal separating boundary to separate two classes (events and non-events).

Deepanshu then goes on to implement this in R.

Comments closed

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

Category: R