R – Page 74 – Curated SQL

PDF Scraping with R

Published 2019-09-26 by Kevin Feasel

Jennifer Cooper shows how you can use R to scrape data from a text-based PDF:

Here’s a diagram of the workflow I used:
1. Start with PDF
2. Use tabulizer to extract tables
3. Clean up data into “tidy” format using tidyverse (mainly dplyr)
4. Visualize trends with ggplot2

Read on for more detail on each step in the process. H/T R-Bloggers.

Comments closed

Record Transformation with cdata

Published 2019-09-24 by Kevin Feasel

John Mount shows off one of the advantages of using cdata to define data-driven record transformation specifications:

We have a tutorial on how to design such transforms by writing down the shape your incoming data records are arranged in, and also the shape you wish your outgoing data records to be arranged in.
This simple data transform is in fact not a single pivot/un-pivot, as the result records spread data-values over multiple rows and multiple columns at the same time. We call the transform simple, because from a user point of view: it takes records of one form to another form (with the details left to the implementation).

Read the whole thing.

Comments closed

Linear Regression in Power BI

Published 2019-09-23 by Kevin Feasel

Joseph Yeates shows how to implement linear regression in Power BI:

The goal of a simple linear model is to fit a line onto this plot to summarize the shape of the data using the equation above.
The “a” value is the slope of the fitted line (rise over run) and the “b” value is the intercept on the y-axis (when x is equal to zero).
In the gapminder example, the life expectancy column was assigned as the “y” variable, as it is the outcome that we are interested in predicting or understanding. The year1950 column was assigned as the “x” variable, as it is what we are using to try and measure the change in life expectancy.

This is a little more complicated than adding a regression line to a scatterplot (the “normal” way to do linear regression with Power BI) but this method lets you work with the outputs in a way that the normal method doesn’t.

Comments closed

WVPlots

Published 2019-09-17 by Kevin Feasel

Nina Zumel announces a new version of WVPlots on CRAN:

WVPlots was originally a catch-all package of ggplot2 visualizations that we at Win-Vector tended to use repeatedly, and wanted to turn into “one-liners.” A consequence of this is that the older visualizations had our preferred color schemes hard-coded in. More recent additions to the package sometimes had palette or color controls, but not in a consistent way. Making color controls more consistent has been a “todo” for a while—one that I’d been putting off. A recent request from user Brice Richard (thanks Brice!) has pushed me to finally make the changes.

Click through to see what’s changed and for an example vignette.

Comments closed

Icon Maps in R

Published 2019-09-13 by Kevin Feasel

Laura Ellis shows how you can build maps full of little icons:

That was ok, but we should try to make the images more aesthetically pleasing using the magick package. We make each image transparent with the image_transparent() function. We can also make the resulting image a specific color with image_colorize().
I then saved the images using the image_write() function. I manually re-uploaded them to GH.

This was a great example of where laying icons on a map works.

Comments closed

R User Salaries By Country

Published 2019-09-11 by Kevin Feasel

Capri Granville shares a chart showing a box plot of salaries for professional R users by country:

Interesting analysis done in R, about salaries of R developers broken down by country, featuring salary range and median salary.
The dataset consists of survey answers from nearly 90,000 respondents. About 5,000 of them reported using R for “extensive development work over the past year”. The first filter used reduces the dataset from 88,883 respondents to 5,048. The second filter excludes students, hobby programmers and former developers. This reduces the dataset to 4,047 respondents. The third filter excludes unemployed and retired respondents and the dataset is further reduced to 3,871 respondents. Finally, we exclude respondents from an unknown country and respondents with unknown or zero salary.

Check out Tomaz Weiss’s detailed post which dives into these numbers for United States respondents.

Comments closed

Python and R Data Reshaping

Published 2019-09-10 by Kevin Feasel

John Mount takes us through a couple of data shaping packages:

The advantages of data_algebra and cdata are:
– The user specifies their desired transform declaratively by example and in data. What one does is: work an example, and then write down what you want (we have a tutorial on this here).
– The transform systems can print what a transform is going to do. This makes reasoning about data transforms much easier.
– The transforms, as they themselves are written as data, can be easily shared between systems (such as R and Python).

Let’s re-work a small R cdata example, using the Python package data_algebra.

Click through for the example.

Comments closed

Exploratory Data Analysis with ExPanDaR

Published 2019-09-04 by Kevin Feasel

Joachim Gassen walks us through the ExPanDaR package in R:

The ‘ExPanDaR’ package offers a toolbox for interactive exploratory data analysis (EDA). You can read more about it here. The ‘ExPanD’ shiny app allows you to customize your analysis to some extent but often you might want to continue and extend your analysis with additional models and visualizations that are not part of the ‘ExPanDaR’ package.
Thus, I am currently developing an option to export the ‘ExPanD’ data and analysis to an R Notebook. While it is not ready for CRAN yet, it seems to work reasonably well and I would love to see some people trying it and letting me know about any bugs or other issues that they encounter. Hence, this blog post.

Looks like an interesting package. H/T R-bloggers

Comments closed

Calculating Consistency of Ratings

Published 2019-09-03 by Kevin Feasel

Sebastian Sauer looks at computing reliability between raters:

Computing inter-rater reliability is a well-known, albeit maybe not very frequent task in data analysis. If there’s only one criteria and two raters, the proceeding is straigt forward; Cohen’s Kappa is the most widely used coefficient for that purpose. It is more challenging to compare multiple raters on one criterion; Fleiss’ Kappa is one way to get a coefficient. If there are multiple criteria, one way is to compute the mean of multiple Fleiss’ coefficients.
However, a different way, and the way presented in this post, consists of checking of all raters agree on one given item (and repeating that for all items). If rater A assigns two tags/criteria (tag1, tag2) to item A, then the other raters may not assign different tags (eg tag3, tag4) to that item, if a match should be scored. Note that this proceeding allows for different numbers of tags/criteria for the items (eg., item 1 with only 1 tag, but item 2 with 3 tags etc.). However, our grading should give some points, if, say, rater1 assigns tag1 and tag2, but raters 2 and 3 only assign tag1.

Read the whole thing.

Comments closed

Update to ggraph

Published 2019-09-03 by Kevin Feasel

Thomas Lin Pedersen has an update to ggraph:

If you are new to ggraph, a short description follows: It is an extension of ggplot2 that implement an extended grammar for relational data (e.g. trees and networks). It provides a huge variety of geoms for drawing nodes and edges, along with an assortment of layouts making it possible to produce a very wide range of network visualization types. It is to my knowledge the most feature packed network visualization framework available in R (and potentially in other languages as well), all building on top of the familiar ggplot2 API. If you want to learn more I invite you to browse the new pkgdown website that has been made available.

It looks really nice.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Category: R