Press "Enter" to skip to content

Category: R

The Intuition Behind Averaging

The Stats Guy takes a look at averages:

In this diagram, there are a bunch of numbers and a single question mark. Behind the question, is also a number. The known numbers are the same as in our friend v above.

Our task is as follows:

– Make a guess on what that mystery number could be. And,
– If we can’t get it right, then reduce, as much as possible, the error we incur on our guess.

This is a well-written explanation of an important concept. H/T R-Bloggers

Comments closed

Building an Azure Function in R

David Smith has a demo for us:

It’s important to note that the model prediction is not being generated by the Shiny app: rather, it’s being generated by an Azure Function running R in the cloud. That means you could integrate the model estimate into any application written in any language: a mobile app, or an IoT service, or anything that can call an HTTP endpoint. Furthermore, you don’t need to worry how many apps are running or how often estimates will be requested by the app: Azure Functions will automatically scale to meet the demand as needed.

Read the whole thing. Given that R isn’t naturally supported by Azure Functions, I think this is quite interesting.

Comments closed

sparklyr 1.5 Released

Yitao Li announces version 1.5 of sparklyr:

A large fraction of pull requests that went into the sparklyr 1.5 release were focused on making Spark dataframes work with various dplyr verbs in the same way that R dataframes do. The full list of dplyr-related bugs and feature requests that were resolved in sparklyr 1.5 can be found in here.

In this section, we will showcase three new dplyr functionalities that were shipped with sparklyr 1.5.

Read on to learn more about this update. H/T R-Bloggers

Comments closed

Coalesce in SQL and R

John MacKintosh gives us a primer on the COALESCE function in both SQL and R:

What does coalesce mean? In the English language, it is generally used to convey a coming together, or creating one whole body, mass or system. How does that help us when working with data? We spend a lot of time cleaning our data, surely the last thing we want to do is lump it all together?

Click through for detail on the nuances of COALESCE(). H/T R-Bloggers.

Comments closed

Web Scraping in SQL Server Machine Learning Services

Rajendra Gupta shows us how we can use SQL Server Machine Learning Services and the R programming language to perform website scraping:

You can manually copy data from a website; however, if you regularly use it for your analysis, it requires automation. For this automation, usually, we depend on the developers to read the data from the website and insert it into SQL tables.

SQL Machine Learning language helps you in web scrapping with a small piece of code. In the previous articles for SQL Server R scripts, we explored the useful open-source libraries for adding new functionality in R.

Read on for a demo.

Comments closed

Thoughts on R’s New Pipe

John Mount has thoughts on the upcoming pipe operator in R:

There is a current active discussion on this prototype and some interesting points come up. Note the current proposal appears to disallow a |> f -> f(a), a currently popular transform.

1. This is a language feature presented as a soon-to-be-user-visible prototype, not an RFC.
2. Some are objecting to the term “pipe.”
3. Some call this sort of pipe function composition.
4. It is noticed that this sort of substitution is generally thought of as a “macro.”
5. There is a claim the proposed pipe seems to violate the beta-reduction rule of the lambda calculus: variables should be substitutable for values.

Read on for John’s take on this. I particularly appreciate his response to point number 2: other functional languages have pipes (in fact, |> is the F# pipe operator). Pipes are not unique to UNIX. John has a lot of interesting comments, so check them out.

Comments closed

ETL with R in SQL Server

Rajendra Gupta shows one reason for using R inside SQL Server:

Data professionals get requests to import, export data into various formats. These formats can be such as Comma-separated data(.CSV), Excel, HTML, JSON, YAML, Tab-separated data(.TSV). Usually, we use SQL Server integration service ETL packages for data transformations, import or export data.

SQL Machine Learning can be useful in dealing with various file formats. In the article, External packages in R SQL Server, we explored the R services and the various external packages for performing tasks using R scripts.

In this article, we explore the useful Rio package developed by Thomas J. Leeper to simplify the data export and import process.

One thing I’d like to reiterate is that even though you’re using R to move this data, you don’t need to perform any data science activities on the data—R can be the easiest approach for getting and cleaning up certain types of data.

Comments closed

Stochastic Processes in R

David Robinson takes us through simulation of a random walk in R:

What’s fun about this problem is that it’s an example of a random walk: a stochastic process made up of a sequence of random steps (in this case, left or right). What makes this a fun variation is that it’s a random walk in a circle- passing 5 to the left is the same as passing 15 to the right. I wasn’t previously familiar with a random walk in a circle, so I approached it through simulation to learn about its properties.

Click through for a simulation. Or 50,000 of them.

Comments closed

Understanding Decision Trees

Ram Tavva walks us through the algorithm to design decision trees:

A decision tree is made up of several nodes:

1.Root Node: A Root Node represents the entire data and the starting point of the tree. From the above example the
First Node where we are checking the first condition, whether the movie belongs to Hollywood or not that is the
Rood node from which the entire tree grows
2.Leaf Node: A Leaf Node is the end node of the tree, which can’t split into further nodes.
From the above example ‘watch movie’ and ‘Don’t watch ‘are leaf nodes.
3.Parent/Child Nodes: A Node that splits into a further node will be the parent node for the successor nodes. The
nodes which are obtained from the previous node will be child nodes for the above node.

Read on for an example of implementation in R.

Comments closed

The Effects of Undersampling and Oversampling on Predicted Probability

Bryan Shalloway has an interesting article for us:

In classification problems, under and over sampling techniques shift the distribution of predicted probabilities towards the minority class. If your problem requires accurate probabilities you will need to adjust your predictions in some way during post-processing (or at another step) to account for this.

Bryan has a clear example showing this problem in action.

Comments closed