R – Page 50 – Curated SQL

And indeed, I worked with highly-skilled data scientists who had a very sharp understanding of statistics. But after years of designing and analyzing experiments, I grew dissatisfied with the way we communicated results to decision-makers. I felt that the over-reliance on p-values led to sub-optimal decisions. After talking to colleagues in other companies, I realized that this was a broader problem, and I set up to write a guide to better data analysis. In this article, I’ll present one of the biggest recommendations of the book, which is to ditch p-values and use Bootstrap confidence intervals instead.

I’m a committed Bayesian (or at least a Bayesian who should be committed—depends on who you ask), so I’d consider this a big step forward.

Comments closed

When to Start Using a Database with R or Python

Published 2021-11-12 by Kevin Feasel

Roel Hogervorst thinks about data sizes in R and Python:

Your dataset becomes so big and unwieldy that operations take a long time. How long is too long? That depends on you, I get annoyed if I don’ t get feedback within 20 seconds (and I love it when a program shows me a progress bar at that point, at least I know how long it will take!), your boundary may lay at some other point. When you reach that point of annoyance or point of no longer being able to do your work. You should improve your workflow.
I will show you how to do some speedups by using other R packages, in python moving from pandas to polars, or leveraging databases. I see some hesitancy about moving to a database for analytical work, and that is too bad. Bad for two reasons, one: it is super simple, two it will save you a lot of time.

I definitely agree with Roel’s bottom line here. Granted, part of that is domain knowledge, but databases are extremely good at handling data and both languages have plenty of database accessibility.

One last tip, though: if you’re on the data science or data analytics track, learn SQL. Yes, libraries like dbplyr in R or ORMs in Python can cover up a lot, but that comes at a cost, typically in terms of performance. Building these skills will make your life considerably easier.

Comments closed

Voronoi Diagrams with R and x11()

Published 2021-11-01 by Kevin Feasel

Tomaz Kastrun creates a Voronoi diagram:

Yes. Finally, the Voronoi diagrams with the use of x11() function. This diagram is presentation of a plane that is partitioned every time, a user clicks on the canvas of x11. This plane is partitioned into smaller regions that are close to given set of points.
Partitioning into smaller regions or convex polygons happens in such manner that each polygon contains only one generating point and every point in a given polygon is closer to its generating point than to any other.

I had to take a look out of curiosity, and yes, the x11() function does work on Windows as well.

Comments closed

Showing Off R’s plot Function

Published 2021-10-19 by Kevin Feasel

Tomaz Kastrun unleashes the power of R’s native plotting function:

Plot() function is R’s most generic function for plotting different types of graphs. And making a animation of sample graphs with is as useless as it can be useful for educational purposes.

Click through for the code to build an animated image of various plots you can draw in R without importing any other libraries.

Comments closed

Building a D3 Visualization in R

Published 2021-10-13 by Kevin Feasel

The Jumping Rivers team show how to create a D3 visual in R:

D3.js, or just D3 as it’s more often referred to, is a JavaScript library used for creating interactive data visualisations optimised for the web. D3 stands for Data-Driven Documents. It is commonly used by those who enjoy making creative or otherwise unusual visualisations as it offers you a great deal of freedom as well as options for interactivity such as animated transitions and plot zooming.

Click through for the blog post and also check out the associated GitHub repo. D3 is an incredibly powerful framework, but is almost as complex as it is powerful.

Comments closed

Word Stemming and Text Processing in R

Published 2021-10-07 by Kevin Feasel

Genrikh Ananiev takes us through some examples of text processing in R:

First, there are a lot of classes (in fact, how many products you have so many classes) And if in this process you have to work not only with the company’s products, but also competitors, the growth of such new classes can occur every day – therefore it becomes meaningless to teach one time Model to be repeatedly used to predict new products.
Secondly, the number of documents (different variations of the same product) in the classes are not very balanced: there may be one by one to class, and maybe more.

Click through for an example of the classical technique versus a classification-based technique.

Comments closed

Working with Wide Data in R

Published 2021-10-05 by Kevin Feasel

Andrew Collier works with some wide data:

The concept of “wide data” is relative. In some domains 100 columns is considered “wide”, while in others that’s perfectly normal and you’d need to have thousands (or tens of thousands!) of columns for it to be considered even remotely “wide”. The data that we work with at Fathom Data generally lies in the first domain, but from time to time we do work on data that is considerably wider.
This post touches on a couple of approaches for dealing with that sort of data. We’ll be using some HCRIS (Healthcare Cost Report Information System) data, which are available for download here. Specifically, we’ll be working with an extract from the hcris2552_10_2017.csv file, which contains “select variables in flat shape”.

Click through for one example which has 1700 columns. H/T R-Bloggers.

Comments closed

Constraint Programming with R and MiniZinc

Published 2021-10-01 by Kevin Feasel

Holger von Jouanne-Diedrich solves a classic puzzle:

The following puzzle is a well-known meme in social networks. It is said to have been invented by young Einstein and back in the days I was ambitious enough to solve it by hand (you should try too!).
Yet, even simpler is to use Constraint Programming (CP). An excellent choice for doing that is MiniZinc, a free and open-source constraint modelling language. And the best thing is that you can control it by R! If you want to see how, read on!

I’d solved it once by hand as well, but here we get to see a much easier route. Constraint-based programming is one of those things which doesn’t show up very often in the business world, but I think part of the reason is that most programming languages lack the capacity to implement constraints really well. It could also be that people are usually pretty mushy about laying out proper constraints.

Comments closed

What is Parquet and Why Use It?

Published 2021-09-29 by Kevin Feasel

The folks at Jumping Rivers explain what the Parquet file format is and how you can use it in R:

Apache Parquet is a popular column storage file format used by Hadoop systems, such as Pig, Spark, and Hive. The file format is language independent and has a binary representation. Parquet is used to efficiently store large data sets and has the extension .parquet. This blog post aims to understand how parquet works and the tricks it uses to efficiently store data.

Read on for that explanation and plenty of sample code.

Comments closed

Rolling Means with MazamaRollUtils

Published 2021-09-28 by Kevin Feasel

Jonathan Callahan has an interesting R package for us:

The initial release of MazmaRollUtils provides all the basic rolling functions with features like alignment and missing value removal along with additional capabilities for smoothing, damping and outlier detection — all common activities in time series analysis.

Click through for an explanation of the process, and then check out the package itself on GitHub. H/T R-Bloggers.

Comments closed

Category: R

Replacing p-values with Bootstrapped Confidence Intervals

When to Start Using a Database with R or Python

Voronoi Diagrams with R and x11()

Showing Off R’s plot Function

Building a D3 Visualization in R

Word Stemming and Text Processing in R

Working with Wide Data in R

Constraint Programming with R and MiniZinc

What is Parquet and Why Use It?

Rolling Means with MazamaRollUtils