Press "Enter" to skip to content

Category: Data Science

Random Number Generation in R

Adrian Tam rolls the dice:

Whether working on a machine learning project, a simulation, or other models, you need to generate random numbers in your code. R as a programming language, has several functions for random number generation. In this post, you will learn about them and see how they can be used in a larger program. Specifically, you will learn

  • How to generate Gaussian random numbers into a vector
  • How to generate uniform random numbers
  • How to manipulate random vectors and random matrices

And, of course, these are pseudo-random numbers because we’re still dealing with computers and random seeds, after all.

Comments closed

Visualizing Univariate Data Distributions in R

Steven Sanderson reviews the shape of the data:

Understanding the distribution of your data is a fundamental step in any data analysis process. It gives you insights into the spread, central tendency, and overall shape of your data. In this blog post, we’ll explore two popular functions in R for visualizing data distribution: density() and hist(). We’ll use the classic Iris dataset for our examples. Additionally, we will introduce the {TidyDensity} library and show how it can be used to create distribution plots.

Click through for three different functions for visualizing the density of a variable.

Comments closed

Adding Mean to Box Plots in R

Steven Sanderson tracks the sixth number of a five-number summary:

Data visualization is a powerful tool for understanding and interpreting data. In this blog post, we will explore how to create box plots with mean values using both base R and ggplot2. We will use the famous iris dataset as an example. So, grab your coding tools and let’s dive into the world of box plots!

Note that this is mean in addition to median in these visuals, not replacing the median.

Comments closed

Omitted Variables and Logistic Regression

John Mount misses a variable:

I would like to illustrate a way which omitted variables interfere in logistic regression inference (or coefficient estimation). These effects are different than what is seen in linear regression, and possibly different than some expectations or intuitions.

This is an interesting article and there’s a really good comment helping to explain this effect in epidemiology.

Comments closed

Trying Fabric Data Wrangler

Reza Rad looks at a new tool:

There is a tool (or you can consider it as an editor) in Fabric for data scientists. As a data scientist, you must work with the data, clean it, group it, aggerate it, and do other data preparation work. This might be needed to understand the data or be part of the process you do to prepare the data and load it into a table for further analysis. Data Wrangler is a tool that gives you such ability. You can use it to transform data and prepare and even generate Python code to make this process part of a bigger data analytics project.

Data Wrangler has a simple-to-use graphical user interface that makes the job of a data scientist easier.

Read on for a video as well as a demo in written format.

Comments closed

Model Diagnostics in Python

Christian Lorentzen has released a new package:

Version 1.0.0 of the new Python package for model-diagnostics was just released on PyPI. If you use (machine learning or statistical or other) models to predict a mean, median, quantile or expectile, this library offers tools to assess the calibration of your models and to compare and decompose predictive model performance scores.

This looks like a really useful package, so check it out.

Comments closed

Detecting AI-Generated Profile Photos

Shivansh Mundra, et al, report on some research:

With the rise of AI-generated synthetic media and text-to-image generated media, fake profiles have grown more sophisticated. And we’ve found that most members are generally unable to visually distinguish real from synthetically-generated faces, and future iterations of synthetic media are likely to contain fewer obvious artifacts, which might show up as slightly distorted facial features. To protect members from inauthentic interactions online, it is important that the forensic community develop reliable techniques to distinguish real from synthetic faces that can operate on large networks with hundreds of millions of daily users, like LinkedIn. 

There are some interesting findings here.

Comments closed

Using SHAP to Gauge Geographic Effects in R or Python

Michael Mayer runs an analysis:

This is the next article in our series “Lost in Translation between R and Python”. The aim of this series is to provide high-quality R and Python code to achieve some non-trivial tasks. If you are to learn R, check out the R tab below. Similarly, if you are to learn Python, the Python tab will be your friend.

This post is heavily based on the new {shapviz} vignette.

I appreciate the effort to include both R and Python code in this analysis, and recommend you peruse both sets of code listings.

Comments closed

Poisson Hidden Markov Models in SAS

Ji Shen shows off how to perform discrete time series in SAS:

The HMM procedure in SAS Viya supports hidden Markov models (HMMs) and other models embedded with HMM. PROC HMM supports finite HMM, Poisson HMM, Gaussian HMM, Gaussian mixture HMM, the regime-switching regression model, and the regime-switching autoregression model. This post introduces Poisson HMM, the latest addition to PROC HMM in the SAS Viya 2023.03 release.

Count time series is ill-suited for most traditional time series analysis techniques, which assume that the time series values are continuously distributed. This can present unique challenges for organizations that need to model and forecast them. As a popular discrete probability distribution to handle the count time series, the Poisson distribution or the mixed Poisson distribution might not always be suitable. This is because both assume that the events occur independently of each other and at a constant rate. In time series data, however, the occurrence of an event at one point in time might be related to the occurrence of an event at another point in time, and the rates at which events occur might vary over time.

HMM is a valuable tool that can handle overdispersion and serial dependence in the data. This makes it an effective solution for modeling and forecasting count time series. We will explain how the Poisson HMM can handle count time series by modeling different states by using distinct Poisson distributions while considering the probability of transitioning between them.

Read on for an overview of Hidden Markov Models (in general and the Poisson variation in particular) and some of the challenges you can run into when performing this test.

Comments closed

Paper Review: Moving Fast with Broken Data

Adnan Masood reviews a paper:

I recently came across an insightful research paper titled “Moving Fast With Broken Data” by Shreya Shankar, Labib Fawaz, Karl Gyllstrom, and Aditya G. Parameswaran from UC Berkeley and Meta. The paper addresses the significant issue of data corruption in machine learning (ML) pipelines, which often leads to decreased model accuracy. The authors present an automatic data validation system implemented at Meta that aims to solve this problem.

Sounds like I have some beach reading.

Ed. Note: He’s kidding, right?

Ed. 2 Note: About going to the beach maybe.

Ed. & Ed. 2 Note: HAHAHAHAHAH.

Yeah, I hired Statler and Waldorf as my editors. Worst Best decision of my life.

Comments closed