Press "Enter" to skip to content

Category: R

Learning R Or Python?

David Smith tackles the age-old question:

If your interests lean more towards traditional statistical analysis and inference as used within industries like manufacturing, finance, and the life sciences, I’d lean towards R. If you’re more interested in machine learning and artificial intelligence applications, I’d lean towards Python. But even that’s not a hard-and-fast rule: R has excellent support for machine learning and deep learning frameworks, and Python is often used for traditional data science applications.

One thing I am quite sure of though: neither Python nor R is inherently better than the other, and arguments on that front are ultimately futile. (Trust me, I’ve been there.) Which is better for any given person depends on a wide variety of factors, and for some, it may even be worthwhile to learn both. Brian Ray recently posted a good overview of the factors that may lead you towards R or Python for data science: their history, the community, performance, third-party support, use cases, and even how to use them together. It’s great food for thought if you’re trying to decide which community to invest in.

Embrace the power of “and.”  The whole R versus Python bit is fun for purposes of arguing with people, but they’re both powerful languages and we’re seeing more and more overlap—for example, the Keras package David mentions runs Python’s TensorFlow under the covers.

Comments closed

Graphics In R

David Smith is following the kerfuffle that Edward Tufte unleashed on Twitter recently:

While graphics guru Edward Tufte recently claimed that “R coders and users just can’t do words on graphics and typography” and need additonal tools to make graphics that aren’t “clunky”, data journalists at major publications beg to differ. The BBC has been creating graphics “purely in R” for some time, with a typography style matching that of the BBC website. Senior BBC Data Journalist Christine Jeavans offers several examples, including this chart of life expectancy differences between men and women:

I think Tufte’s off base here.

Comments closed

Counting Arguments In R

Neil Saunders shares methods for interrogating argument lists in R:

“Some R functions have an awful lot of arguments”, you think to yourself. “I wonder which has the most?”

It’s not an original thought: the same question as applied to the R base package is an exercise in the Functions chapter of the excellent Advanced R. Much of the information in this post came from there.

There are lots of R packages. We’ll limit ourselves to those packages which ship with R, and which load on startup. Which ones are they?

It’s a fun exercise and helpful for learning a bit more about how to work with arguments when metaprogramming in R.

Comments closed

Analyzing Federal Reserve Data With Ordinary Least Squares

Sam Shum has a tutorial walking us through extracting and analyzing data from the St. Louis Federal Reserve’s FRED economic database:

Download specific macroeconomic data from FRED St. Louis economic databases and ETL the data. Many other data series can be found at the FRED’s website.

# get unemployment data time series from FRED St. Louis
dfunrate <- get_fred_series("UNRATE", "unrate", observation_start = startdate, observation_end = enddate)

# get University of Michigan consumer sentiment index data time series from FRED St. Louis
dfumcsent <- get_fred_series("UMCSENT", "umcsent", observation_start = startdate, observation_end = enddate)

# combine the two time series data into one data frame
dfall <- cbind(dfunrate,dfumcsent)

# strip or remove redundant month field from data downloaded from FRED St. Louis
dfall <- dfall[,c(1,2,4)]

# obtain the number of data points in the dataframe
mdx <- (1:nrow(dfall))  

# convert FRED date field from string to R's date type
dfall$date <- as.Date(dfall$date)

There’s a nice chart builder on the FRED website too, but it’s good to be able to grab the data on your own.

Comments closed

Converting Factors To Numbers In R

Sebastian Sauer shows us a pitfall of brute-force conversion of factors to integers:

Oh no! That’s not what we wanted! R has messed the thing up (?). The reason is that R sees the first factor level internally as the number 1 . The second level as number two. What’s the first factor level in our case? Let’s see:

factor(tips$sex) %>% head()
#> [1] Female Male   Male   Male   Female Male  
#> Levels: Female Male
factor(tips$sex_r) %>% head()
#> [1] 1 0 0 0 1 0
#> Levels: 0 1

That’s confusing: “0” is the first level of sex_r – internally for R represented by “1”. The second level of sex_r is “1” – internally represented by “2”.

Fortunately, we get the easy answer at the end of the post.

Comments closed

Parallelizing Linear Regression With MapReduce

Arthur Charpentier shows us the math behind using MapReduce to parallelize a linear regression:

Sometimes, with big data, matrices are too big to handle, and it is possible to use tricks to numerically still do the map. Map-Reduce is one of those. With several cores, it is possible to split the problem, to map on each machine, and then to aggregate it back at the end.

Arthur gives us an interesting example in R to boot.

Comments closed

Granting Non-Admin Users Access To Run ML Services

Niels Berglund walks through the rights needed for a non-administrative user to execute an external script using SQL Server Machine Learning Services:

Oops, something did go wrong, as it turns out that if you try to grant permissions on extended stored procedures, which SPEES is, you need to do it from the master database. Cool, let us switch to master and do it there. Well, if you try to do that – then you get another error: the user does not exist in master, sigh!

At this stage you have a couple of options:

  • Add the login for the user to the sysadmin role, or the user to the db_owner role in the actual database. No do not do that, I am only kidding! Do.Not.Do.That!

  • Create the user in master and grant the permission. That would work.

  • Grant the permission to public.

Check it out, as there are two parts to the process.

Comments closed

Using DALEX To Explain Black-Box Models

Przemyslaw Biecek explains that there’s more than LIME for explaining black-box models:

I’ve heard about a number of consulting companies, that decided to use simple linear model instead of a black box model with higher performance, because ,,client wants to understand factors that drive the prediction’’.
And usually the discussion goes as following: ,,We have tried LIME for our black-box model, it is great, but it is not working in our case’’, ,,Have you tried other explainers?’’, ,,What other explainers’’?

So here you have a map of different visual explanations for black-box models.

Check out DALEX, which includes a Jupyter notebook example.  H/T R-Bloggers

Comments closed

Comparing Keras In Python Versus R

Dmitry Kisler performs image classification using Keras in both Python and R:

From the plots above, one can see that:

  • the accuracy of your model doesn’t depend on the language you use to build and train it (the plot shows only train accuracy, but the model doesn’t have high variance and the bias accuracy is around 99% as well).

  • even though 10 measurements may be not convincing, but Python would reduce (by up to 15%) the time required to train your CNN model. This is somewhat expected because R uses Python under the hood when executes Keras functions.

This is just one example, but the results are about what I’d expect.

Comments closed