Press "Enter" to skip to content

Category: R

Generating Fake Data with R

Dave Mason takes a look at generating fake PII in R:

I’ve been thinking about R and how it can be used by developers, DBAs, and other SQL Server professionals that aren’t data scientists per se. A recent article about generating a data set of fake transactional data got me thinking about this again and I wondered, can R be used to obfuscate PII data?

In a word, yes. Well, mostly. (More on this in a bit.) As with anything R-related, there are probably multiple packages that are useful for any given task. For this one, I’ll focus on the “generator” package.

Click through to see what it does and Dave’s thoughts on the topic. It would also be possible to generate fake data in R by hitting a web API like Daniel Hutmacher’s service.

Comments closed

Fun with Palindromic Dates

Tomaz Kastrun has a bit of fun with the date February 2, 2020:

As of writing this blog-post, today is February 2nd, 2020. Or as I would say it, 2nd of February, 2020. There is nothing magical about it, it is just a sequence of numbers. On a boring Sunday evening, what could be more thrilling to look into this little bit further 🙂

Let’s kick R Studio and start writing a lot of useless stuff.

Tomaz also compares US versus EU palindromic dates and visualizes the different distributions.

Comments closed

Cloudera and R

Ian Cook shows us how Cloudera works for R users:

Our customers love it when they can use familiar syntax to work with data regardless of its size or its source. The popularity of sparklyr is a case in point: it enables R users to use either SQL or dplyr—both familiar to most R users—to work with large-scale data using Apache Spark. Two R packages developed at Cloudera—implyr and tidyquery—aim to provide this same choice of either SQL or dplyr when querying tables with Apache Impala and when manipulating R data frames.

The implyr package is new to me, but looks interesting.

Comments closed

Explaining Black Box Models with LIME

Holger von Jouanne-Diedrich takes us through the intuition of LIME:

There is a new hot area of research to make black-box models interpretable, called Explainable Artificial Intelligence (XAI), if you want to gain some intuition on one such approach (called LIME), read on!

Before we dive right into it it is important to point out when and why you would need interpretability of an AI. While it might be a desirable goal in itself it is not necessary in many fields, at least not for users of an AI, e.g. with text translation, character and speech recognition it is not that important why they do what they do but simply that they work.

In other areas, like medical applications (determining whether tissue is malignant), financial applications (granting a loan to a customer) or applications in the criminal-justice system (gauging the risk of recidivism) it is of the utmost importance (and sometimes even required by law) to know why the machine arrived at its conclusions.

One approach to make AI models explainable is called LIME for Local Interpretable Model-Agnostic Explanations. There is already a lot in this name!

LIME is not trivial to use and it can be very slow, but it is a great way to visualize models.

Comments closed

Generating Synthetic Data with R

Sidharth Macherla uses the conjurer package in R to generate synthetic data:

If you are building data science applications and need some data to demonstrate the prototype to a potential client, you will most likely need synthetic data. In this article, we discuss the steps to generating synthetic data using the R package ‘conjurer’. 

One of the toughest problems of generating data is making it look realistic enough. It’s one level of difficulty to build “steady-state” data, but if you want data to follow a combination of trend and random walk…that’s when things get dicey. H/T R-Bloggers

Comments closed

Concepts in Support Vector Machines

Abhijit Telang takes us through the calculations involved in Support Vector Machines and then gives us an example in R:

So, let’s take that out and we are back to old, classical vector algebra. It’s like a person with a bunch of sticks to figure out which one to lay where in a 2-D plane to separate one class of objects from another, provided class definitions are already known. 

The problem is which particular shape and length must be chosen to show maximum contrast between classes.

We need to arrive at a function definition, in such a way that the value a given function takes changes drastically (e.g. from a large positive value to a large negative value).

SVM is often great for two-class classification problems, and different variants also work well for multi-class problems.

Comments closed

Customizing Your Rprofile

Colin Gillespie shows how you can customize R via the .Rprofile file:

Every time R starts, it runs through a couple of R scripts. One of these scripts is the .Rprofile. This allows users to customise their particular set-up. However, some care has to be taken, as if this script is broken, this can cause R to break. If this happens, just delete the script!

Full details of how the .Rprofile works can be found in my book with Robin on Efficient R programming. However, roughly R will look for a file called .Rprofile first in your current working directory, then in your home area. Crucially, it will only load the first file found. This means you can have per project Rprofile.

Click through for a sample R profile which has a lot going on.

Comments closed

Simulating Feller’s Coin-Tossing Puzzle in R

David Robinson has another fun puzzle:

Mathematician William Feller posed the following problem:

If you flip a coin times, what is the probability there are no streaks of heads in a row?

Note that while the number of heads in a sequence is governed by the binomial distribution, the presence of consecutive heads is a bit more complicated, because the presence of a streak at various points in the sequence isn’t independent

Click through for a solution in R.

Comments closed

Cluster-Based Image Analysis and Reduction

Sebastian Sauer takes an image and reduces it to a group of colors:

This post is a remake of this casestudy: https://fallstudien.netlify.com/fallstudie_bildanalyse/bildanalyse

brought to you by Karsten Lübke.

The main purpose is to replace the base R command that Karsten used with a more tidyverse-friendly style. I think that’s easier (for me).

We will compute a cluster analysis to find the typical RGB color per cluster.

Click through for quite a bit of R code and a couple interesting turns.

Comments closed