Press "Enter" to skip to content

Category: Data Science

Network Analysis in R via netUtils

David Schoch has an R package for us:

During the last 5 years, I have accumulated various scripts with (personal) convenience functions for network analysis and I also implemented new methods from time to time which I could not find in any other package in R. The package netUtils gathers all these functions and makes them available for anyone who may also needs to apply “non-standard” network analytic tools. In this post, I will briefly highlight some of the most prominent functions of the package. All available functions are listed in the README on github.

Click through to see what’s available in the package. H/T R-Bloggers.

Comments closed

What’s in a Name?

Benjamin Smith analyzes a name change:

Recently, RStudio announced its name change to Posit. For many this name change was accepted with open arms, but for some-not so. Being the statistician that I am I decided to post a poll on LinkedIn to see the sentiment of my network. After running the poll for a week the results were in:

Read on for the responses as well as an analysis using RSTAN.

Comments closed

Anomaly Detection over Delta Live Tables

Avinash Sooriyarachchi and Sathish Gangichetty show off an interesting scenario:

Anomaly detection poses several challenges. The first is the data science question of what an ‘anomaly’ looks like. Fortunately, machine learning has powerful tools to learn how to distinguish usual from anomalous patterns from data. In the case of anomaly detection, it is impossible to know what all anomalies look like, so it’s impossible to label a data set for training a machine learning model, even if resources for doing so are available. Thus, unsupervised learning has to be used to detect anomalies, where patterns are learned from unlabelled data.

Even with the perfect unsupervised machine learning model for anomaly detection figured out, in many ways, the real problems have only begun. What is the best way to put this model into production such that each observation is ingested, transformed and finally scored with the model, as soon as the data arrives from the source system? That too, in a near real-time manner or at short intervals, e.g. every 5-10 minutes? This involves building a sophisticated extract, load, and transform (ELT) pipeline and integrating it with an unsupervised machine learning model that can correctly identify anomalous records. Also, this end-to-end pipeline has to be production-grade, always running while ensuring data quality from ingestion to model inference, and the underlying infrastructure has to be maintained.

Click through to see their solution using Databricks and delta lake.

Comments closed

Understanding Decision Trees

Durgesh Gupta provides a primer on the humble decision tree:

A decision tree is a graphical representation of all possible solutions to a decision.

The objective of using a Decision Tree is to create a training model that can use to predict the class or value of the target variable by learning simple decision rules inferred from training data.

It is a tree-structured classifier, where internal nodes represent the features of a dataset, branches represent the decision rules and each leaf node represents the outcome.

The way I like to describe decision trees, especially to developers, is that a tree is a set of if-else statements which leads to a conclusion. The nice part about decision trees is that once you understand how they work, you’re halfway there to gradient boosting (e.g., XGBoost) and random forests.

Comments closed

Worrying over Columns or Rows

John Mount explains an attitude difference:

I say: if you are a data scientist or working on an analytics projectworry over columns not rows.

In analytics “rows” are instances, and “columns” are possible measurements. For example: each click on a website might generate a row recording the visit, and this row would be populated with columns describing what was clicked on (and if you are lucky there are more records recording what else was presented and not clicked on).

Read the whole thing. This is also why formats like Parquet and ORC are so popular for data analysis. Same goes for business intelligence people, who reason mostly over columns, leading to columnstore indexes being so useful.

Comments closed

Mapping Income vs Rent in Counties

Rick Pack updates a package to support a project:

I am happy to announce a contribution to the biscale package that makes printing shorter labels using SI prefixes (e.g., 1,000,003 => 1M and 1,324 => 1.3k) far easier. This makes printing the legend in an attractive easier, although you can tell by the picture above that I still struggle with optimal uses of the cowplot package’s draw_plot(). I would love for the legend and map to be centered under the title.

The new si_levels argument for bi_class_breaks() takes a logical value of TRUE or FALSE for either a single or two-unit vector, with a single unit vector causing the specified value to be applied to both the X and Y variables. This matches Prener’s convenient functionality for the number of digits function dig_lab, as he requested in the Github Issue I created for this addition. Note that si_levels rounds the input number, if appropriate, based on the digits indicated by dig_lab, which defaults to 3.

Click through to get access to the update, as well as to see some of the visuals Rick put together with it.

1 Comment

Extracting Numbers from a Stacked Density Plot

Derek Jones digs into an image:

A month or so ago, I found a graph showing a percentage of PCs having a given range of memory installed, between March 2000 and April 2020, on a TechTalk page of PC Matic; it had the form of a stacked density plot. This kind of installed memory data is rare, how could I get the underlying values (a previous post covers extracting data from a heatmap)?

Read on for an interesting attempt at reverse-engineering the original numbers used to create an image. H/T R-Bloggers.

Comments closed

Customer Segmentation via Databricks Solution Accelerator

Gavita Regunath discovers customer segments in a dataset:

We will be using the German Credit dataset, a publicly available dataset provided by Dr. Hans Hofmann of the University of Hamburg. The German Credit dataset contains features describing 1000 loan applicants who have taken credit from the bank. Using this dataset, our aim will be to understand the following “How should the bank personalise its products for its customers?”.

Click through to see an example of clustering to generate customer segments.

Comments closed

Understanding the Poisson Distribution

Achim Zeileis shows off my favorite statistical distribution:

The Poisson distribution has many distinctive features, e.g., both its expectation and variance are equal and given by the parameter λλ. Thus, E(Y)=λE(Y)=λ and Var(Y)=λVar(Y)=λ. Moreover, the Poisson distribution is related to other basic probability distributions. Namely, it can be obtained as the limit of the binomial distribution when the number of attempts is high and the success probability low. Or the Poisson distribution can be approximated by a normal distribution when λλ is large. See Wikipedia (2002) for further properties and references.

Here, we leverage the distributions3 package (Hayes et al. 2022) to work with the Poisson distribution in R. In distributions3, Poisson distribution objects can be generated with the Poisson() function. Subsequently, methods for generic functions can be used print the objects; extract mean and variance; evaluate density, cumulative distribution, or quantile function; or simulate random samples.

Read on for a detailed tutorial. H/T R-bloggers.

Comments closed