R – Page 64 – Curated SQL

More Scraping Web Pages

Published 2020-07-15 by Kevin Feasel

Dave Mason continues scraping web pages for fun and profit:

In the last post, we looked at a way to scrape HTML table data from web pages, and save the data to a table in SQL Server. One of the drawbacks is the need to know the schema of the data that gets scraped–you need a SQL Server table to store the data, after all. Another shortcoming is if there are multiple HTML tables, you need to identify which one(s) you want to save.
For this post, we’ll revisit web scraping with Machine Learning Services and R. This time, we’ll take a schema-less approach that returns JSON data. As before, this web page will be scraped: Boston Celtics 2016-2017. It shows two HTML tables (grids) of data for the Boston Celtics, a professional basketball team. The first grid lists the roster of players, the second is a listing of games played during the regular season.

Click through to see how Dave manages this feat.

Comments closed

Understanding the Bayesian Nature of Kalman Filters

Published 2020-07-14 by Kevin Feasel

Holger von Jouanne-Diedrich gives us an interesting interpretation of Kalman filters:

The Kalman filter is a very powerful algorithm to optimally include uncertain information from a dynamically changing system to come up with the best educated guess about the current state of the system. Applications include (car) navigation and stock forecasting. If you want to understand how a Kalman filter works and build a toy example in R, read on!
The following post is based on the post “Das Kalman-Filter einfach erklärt” which is written in German and uses Matlab code (so basically two languages nobody is interested in any more 😉 ). This post is itself based on an online course “Artificial Intelligence for Robotics” by my colleague Professor Sebastian Thrun of Standford University.

In fairness, I regret only one thing about learning German: that I’ve forgotten so much over the years.

Comments closed

R and the TIOBE Index

Published 2020-07-13 by Kevin Feasel

Alex Woodie notices a change in fortunes for R:

Don’t look now, but R, which some had written off as a language in terminal decline in lieu of Python’s immense and growing popularity, appears to be staging a furious comeback the likes of which IT has rarely seen.
According to the TIOBE Index, which tracks the popularity of programming languages (as expressed in Web searches), R has risen an unprecedented 12 spots, up from number 20 in the summer of 2019 to number 8 on its list today.

I’m happy to see this, as frankly, I think R’s a better language for statistical analysis and data visualization than Python and it’s not close. That’s the advantage of being a DSL: you get to focus on doing one or two things really well, and for R that’s statistical analysis and data visualization.

Comments closed

Web Page Scraping with R and ML Services

Published 2020-07-13 by Kevin Feasel

Dave Mason shows how you can scrape webpages with R and pull the resulting data into SQL Server using Machine Learning Services:

For this post, it might make more sense to skip ahead to the end result, and then work our way backwards. Here is a web page with some data: Boston Celtics 2016-2017. It shows two HTML tables (grids) of data for the Boston Celtics, a professional basketball team. The first grid lists the roster of players. We will scrape the web page, and write the data from the “roster” grid to a SQL Server table.

Read on for a demonstration of the process.

Comments closed

The Basics of Autoregressive Models

Published 2020-07-01 by Kevin Feasel

Holger von Jouanne-Diedrich explains some of the principels of autoregressive models through a demonstration:

Well, this seems to be good news for the sales team: rising sales! Yet, how does this model arrive at those numbers? To understand what is going on we will now rebuild the model. Basically, everything is in the name already: auto-regressive, i.e. a (linear) regression on (a delayed copy of) itself (auto from Ancient Greek self)!
So, what we are going to do is create a delayed copy of the time series and run a linear regression on it. We will use the lm() function from base R for that (see also Learning Data Science: Modelling Basics).

Read on for some additional understanding.

Comments closed

R 4.0.2 Available

Published 2020-06-26 by Kevin Feasel

David Smith notes important changes in R 4.0.2:

R 4.0.2 is now available for download for Windows, Mac and Linux platforms. This update addresses a few minor bugs included in the R 4.0.0 release, and also a significant bug introduced in R 4.0.1 on the Windows platform.

Read on for the rest of David’s report.

Comments closed

Using Specific R Package Versions in Docker Images

Published 2020-06-26 by Kevin Feasel

Roman Lustrik shares how to fix package versions in Docker images:

Using package in R is easy. You install from CRAN using install.packages("packagename"), it resolves dependencies and you’re good to go. What R natively doesn’t handle so well is installing a particular package version without jumping through hoops. Technically you need the source file of the package version you want to install AND all source files of the dependencies (in the correct version, of course). This has been made almost seamless with packages packrat and recently, renv.
This comes handy when you are constructing a Docker file to run in production. Usually you want to run this defensively and do not want things to change from one image build to another. To get there, you can save all your package names and version into a file (renv.lock) and use that to reconstruct the now defined package structure with predictable versions (see renv vignette here).

This is quite useful as R package developers tend not to covet backwards compatibility, and one of the key benefits of containers is to have the option to keep the same code base and configuration in all environments.

Comments closed

Using INLA for Spatial Regression in R

Published 2020-06-25 by Kevin Feasel

Lionel Hertzog continues a series on spatial regression:

INLA is a package that allows to fit a broad range of model, it uses Laplace approximation to fit Bayesian models much, much faster than algorithms such as MCMC. INLA allows for fitting geostatistical models via stochastic partial differential equation (SPDE), a good place for more background informations on this are these two gitbooks: spde-gitbook and inla-gitbook.

This is not the gentlest introduction, so if you’re new to the concept go back and read part 1.

Comments closed

Simulating Data from a Gamma Distribution

Published 2020-06-24 by Kevin Feasel

Sebastian Sauer takes us through generating data which follows a Gamma distribution:

A Gamma distribution is useful for modeling positive, right skewed data such as waiting times; it is a continuous function.
In this post, we’ll illustrate some properties of the Gamma distribution by simulating a toy example.

Click through for the example.

Comments closed

Data Visualization in R

Published 2020-06-19 by Kevin Feasel

Dan Fitton provides an introductory overview to several visualization tools in R:

The other way to communicate data with R is to produce an interactive dashboard or web application within R using Shiny. Whereas Markdown reports are most useful for explanatory analysis; Shiny, in my opinion, is useful for exploratory data analysis. This is when you want to display information for investigative purposes, allowing the user to gain greater familiarity by having the ability to interact with data, filter it, and dig deeper into the underlying details.
Shiny is incredibly flexible, providing the user the capability of turning their R code and objects, including tables, plots, and analysis, into a comprehensive and interactive web page or app, without requiring a fully-fledged web development skillset. Although there is a steep learning curve, the freedom and precision Shiny brings means that for the most part you are limited only by your skillset rather than the tool itself.

I’ve seen some really useful Shiny dashboards. Dan is right that there can be a lot of work put into getting them right, but if you do, the results can be outstanding.

Comments closed

Category: R