Press "Enter" to skip to content

Category: R

Multiple Code Panes in R Studio

Tomaz Kastrun has good news for us:

On R studio home page, make sure to download the version 2021.09 Preview (as of writing of this blogpost, this is still in preview) and install this version on your client machine (supported windows machine, MacOS and Linux).

Once installation is completed, head to global options (Tools->Global options) and select Pane Layout. You will have a new set of buttons available (Add Column; Remove Column). With Add column an additional pane will be added to layout.

It’s not as convenient as the right-click –> “Split horizontally” or “Split vertically” that we get in tools like SSMS and VS Code, but I’m happy to see this change in R Studio.

Comments closed

Creating Fireworks with R

Tomaz Kastrun is ready for Silvester:

New Year’s eve is almost here and what best way to celebrate with fireworks. Snap, pop, crack, boom. This is the most peaceful, animal friendly, harmless, eco, children friendly, no-fire-needed, educative and nifty fireworks.

To get the fireworks, fire up the following R function.

I mean, but I enjoy fire… Though you could launch these in R and save the good stuff for the 4th of July.

Comments closed

Wrapping up a Spark Advent Calendar

Tomaz Kastrun did it: 25 posts in 25 days on Spark. Part 23 looks at Delta Live Tables:

Delta Live Tables is a framework for building reliable, maintainable, and testable data processing pipelines. User defines the transformations to be performed on the datasources and data, and the framework manages all the data engineering tasks: task orchestrations, cluster management, monitoring, data quality, and event error handling.

Delta Live Tables framework helps and manages how data is being transformed with help of target schema and can is a slight different experience with Databricks Tasks (with Apache Spark tasks in the background).

Part 24 takes us through a bit of visualization:

You can use any of the popular Python packages to do the visualisation; Plotly, Dash, Seaborn, Matplotlib, Bokeh, Leather, Glam, to name the couple and many others. Once the data is persisted in dataframe, you can use any of the packages. With the use of PySpark, plugin the Matplotlib. Here is an example

And part 25 wraps things up with links to additional resources:

To wrap up this year’s Advent of Spark 2021 – series of blogposts on Spark – it is essential to look at the list of additional learning resources for you to continue with this journey. Let’s divide this list not by type of the resource (book, on-line documentation, on-line courses, articles, Youtube channels, Discord channels, and others) but rather divide them by language flavour. Scala/Spark, R, and Python.

Great job on Tomaz’s part for gutting it out.

Comments closed

Diving into Spark Streaming

Tomaz Kastrun continues a series on Spark and is well into a section on Spark Streaming. Part 17 looks at watermarks:

Streaming data is considered as continuously ingested data with particular frequency and latency. It is considered “big data” and data that has no discrete beginning nor end.

The primary goal of any real-time stream processing system is to process the streaming data within a window frame (or considered this as frequency). Usually this frequency is “as soon as it arrives”. On the other hand, latency in streaming processing model is considered to have the means to work or deal with all the possible latencies (one second or one minute) and provides an end-to-end low latency system. If frequency of data analysing is on user’s side (destination), latency is considered on the device’s side (source).

Part 18 enumerates the supported types of windows:

Tumbling windows are fixed sized and static. They are non-overlapping and are contiguous intervals. Every ingested data can be (must be) bound to a singled window.

Sliding windows are also fixed sized and also static. Windows will overlap when the duration of the slide is smaller than the duration of the window. Ingested data can therefore be bound to two or more windows

Session windows are dynamic in size of the window length. The size depends on the ingested data. A session starts with an input and expands if the following input expands if the next ingested record has fallen within the gap duration.

Part 19 includes good information on how data engineers can work with streams of data:

Streaming data can be used in conjunction with other datasets. You can have Joining streaming data, joining data with watermarking, deduplication, outputting the data, applying foreach logic, using triggers and creating Stream API Tables.

All of the functions are available in Python, Scala and Java and some are not available with R. We will be focusing on Python and R.

Check out all three of these posts.

Comments closed

Spark DataFrames

Tomaz Kastrun continues a series on working with Apache Spark. Part 10 looks at the DataFrame construct:

We have looked in datasets and seen that a dataset is distributed collection of data. A dataset can be constructer from JVM object and later manipulated with transformation operations (e.g.: filter(), Map(),…). API for these datasets are also available in Scala and in Java. But in both cases of Python and R, you can also access the columns or rows from datasets.

On the other hand, dataframe is organised dataset with named columns. It offers much better optimizations and computations and still resembles a typical table (as we know it from database world). Dataframes can be constructed from arrays or from matrices from variety of files, SQL tables, and datasets (RDDs). Dataframe API is available in all flavours: Java, Scala, R and Python and hence it’s popularity.

Part 11 looks at external R and Python packages and DataFrame support:

When you install Spark, the extension of not only languages but also other packages, systems is huge. For example with R, not only that you can harvest the capabilities of distributed and parallel computations, you can also extend the use of R language.

Part 12 gets into Spark SQL:

Spark SQL is a one of the Spark modules for structured data processing and analysing. Spark provides Spark SQL and also API for execution of SQL queries. Spark SQL can read data from Hive instance, but also from datasets and dataframe. The communication between Spark SQL and execution engine will always result in a dataset or datafrane.

These formats are interchangeable. So interacting with SQL against result from a different API is possible, respectively. Plugging in the Java JDBD or standard ODBC drivers will also give your SQL interface access to different sources. This unification means that developers can easily switch back and forth between different APIs based on which provides the most natural way to express a given transformation.

With API unification, user can access Spark SQL using Scala spark-shell, using Python pyspark or using R sparkR shell.

DataFrames are so popular that they’ve become the de facto standard for working with data in Spark, and .NET languages only work with DataFrames, not with the raw RDDs.

Comments closed

Learning about RDDs in Spark

Tomaz Kastrun continues a series on Spark. Part 7 ties in R and gives us sample plotting in R and Python:

Let’s look into the local use of Spark. For R language, sparklyr package is availble and for Python pyspark is availble.

Part 8 gets us into the key data structure behind Spark’s success, the Resilient Distributed Dataset:

Spark is created around the concept of resilient distributed datasets (RDD). RDD is a fault-tolerant collection of files that can be used in parallel. RDDs can be created in two ways:
– parallelising an existing data collection in driver program
– referencing a datasets in external storage (HDFS, blob storage, shared filesystem, Hadoop InputFormat,…)

In a simple way, Spark RDD has two opeartions:
– transformations – creating a new RDD dataset on top of already existing one with the last transformation
– actions – to the action, and return a value to the driver program after running a computation on the dataset.

Part 9 looks a bit more at transformations and actions:

Two types of operations are available with RDD; transformations and actions. Transformations are lazy operations, meaning that they prepare the new RDD with every new operation but now show or return anything. We can say, that transformations are lazy because of updating existing RDD, these operations create another RDD. Actions on the other hand trigger the computations on RDD and show (return) the result of transformations.

Most modern work in Spark won’t directly use RDDs, though everything is built on top of them and it’s good to understand the foundation even if you don’t need to write all of those map(), fold(), and reduceByKey() operations yourself.

Comments closed

Monotonic Constraints on Random Forests

Michael Mayer has some interesting R and Python code for us:

On ML competition platforms like Kaggle, complex and unintuitively behaving models dominate. In this respect, reality is completely different. There, the majority of models do not serve as pure prediction machines but rather as fruitful source of information. Furthermore, even if used as prediction machine, the users of the models might expect a certain degree of consistency when “playing” with input values.

A classic example are statistical house appraisal models. An additional bathroom or an additional square foot of ground area is expected to raise the appraisal, everything else being fixed (ceteris paribus). The user might lose trust in the model if the opposite happens.

One way to enforce such consistency is to monitor the signs of coefficients of a linear regression model. Another useful strategy is to impose monotonicity constraints on selected model effects.

Certain types of regression algorithm make this easy, but random forest? Not so much. That’s where Michael steps in.

Comments closed

Replacing p-values with Bootstrapped Confidence Intervals

Florent Buisson has an interesting post on avoiding p-value calculations:

And indeed, I worked with highly-skilled data scientists who had a very sharp understanding of statistics. But after years of designing and analyzing experiments, I grew dissatisfied with the way we communicated results to decision-makers. I felt that the over-reliance on p-values led to sub-optimal decisions. After talking to colleagues in other companies, I realized that this was a broader problem, and I set up to write a guide to better data analysis. In this article, I’ll present one of the biggest recommendations of the book, which is to ditch p-values and use Bootstrap confidence intervals instead.

I’m a committed Bayesian (or at least a Bayesian who should be committed—depends on who you ask), so I’d consider this a big step forward.

Comments closed