Press "Enter" to skip to content

Category: Data Science

Tidy Simulation of Stochastic Processes in R

David Robinson shows off my favorite distribution:

The Riddler puzzle describes a Poisson process, which is one of the most important stochastic processes. A Poisson process models the intuitive concept of “an event is equally likely to happen at any moment.” It’s named because the number of events occurring in a time interval of length is distributed according to , for some rate parameter (for this puzzle, the rate is described as one per day, ).

How can we simulate a Poisson process? This is an important connection between distributions. The waiting time for the next event in a Poisson process has an exponential distribution, which can be simulated with rexp().

Read on to learn about the Poisson distribution and Yule processes.

Comments closed

Distributed XGBoost in Cloudera

Harshal Patil walk us through the XGBoost algorithm and shows how we can use it in Cloudera Machine Learning:

DASK is an open-source parallel computing framework – written natively in Python – that integrates well with popular Python packages such as Numpy, Pandas, and Scikit-Learn. Dask was initially released around 2014 and has since built significant following and support. 

DASK uses Python natively, distinguishing it from Spark, which is written in Java, and has the overhead of running JVMs and context switching between Python and Java. It is also much harder to debug Spark errors vs. looking at a Python stack trace that comes from DASK.

We will run Xgboost on DASK to train in parallel on CML. The source code for this blog can be found here.

Click through for the process.

Comments closed

Generating Random Numbers with R

The folks at Data Sharkie walk us through random number generation in R:

Why is random numbers generation important and where is it used?

Random numbers generations have application in various fields like statistical sampling, simulation, test designs, and so on. Generally, when a data scientist is in need of a set of random numbers, they will have in mind

R programming language allows users to generate random distributed numbers with a set of built-in functions: runif()rnorm()rbinom().

Read on to generate random numbers across two separate distributions.

Comments closed

Handling Missing Data

Marina Wyss explains various techniques for handling missing data in data sets:

Missing or incomplete data can have a huge negative impact on any data science project. This is particularly relevant for companies in the early stages of developing solid data collection and management systems.

While the best solution for missing data is to avoid it in the first place by developing good data-collection and stewardship policies, often we have to make due with what’s available.

This blog covers the different kinds of missing data, and what we can do about missing data once we know what we’re dealing with. These strategies range from simple – for example, choosing models that handle missings automatically, or simply deleting problematic observations – to (probably superior) methods for estimating what those missing values may be, otherwise known as imputation.

I like the distinction in form Marina draws, and we also get a good set of techniques for filling the gaps.

Comments closed

Visualizing a Single Variable in R

Michaelino Mervisiano takes us through the types of visuals we can create to understand a single variable in R:

How to create a histogram in R? And what information that we can get from histogram?
Histogram shows a frequency distribution. It is a great graph for showing the mode, the spread, and the symmetry (skewness) of your data. Here is a histogram of 1,000 random points drawn from a normal distribution with a mean of 2.5

Of course I don’t like option number 4 and would replace it with something else (column/bar charts, Cleveland dot plots, or stacked column/bar depending on what you’re trying to observe). But this is a good way of thinking about how you can visualize a variable.

Comments closed

Tuning Random Forest HyperParameters with R

Julia Silge gives us an idea of how to tune random forest hyperparameters in R:

Our modeling goal here is to predict the legal status of the trees in San Francisco in the #TidyTuesday dataset. This isn’t this week’s dataset, but it’s one I have been wanting to return to. Because it seems almost wrong not to, we’ll be using a random forest model! 🌳

Let’s build a model to predict which trees are maintained by the San Francisco Department of Public Works and which are not. We can use parse_number() to get a rough estimate of the size of the plot from the plot_size column. Instead of trying any imputation, we will just keep observations with no NA values.

Click through to some data exploration, the initial model, and a process for using Grid Search with the caret package.

Comments closed

Installing Apache Airflow

Achilleus walks us through a process to install Apache Airflow on a machine:

Airflow is an amazing tool by Airbnb and is a kinda defacto standard of ETL deployments in the Data Engineering domain nowadays. But at the same time, you can also use Airflow to schedule to ML pipeline and automate the whole ML pipeline(almost).

This is my attempt to install and set up a fairly robust Apache Airflow deployment for my needs. I am pretty sure there might be some better ways of doing it or add any enhancements to it. Any comments or suggestions are highly appreciated!

This is an easy-to-follow set of steps, so check it out.

Comments closed

Using the Tune Package in R for Hyperparamter Optimization

Abderrahim Lyoubi-Idrissi takes us through a Bayesian approach to tune hyperparameters:

In contrast to the model parameters, which are discovered by the learning algorithm of the ML model, the so called Hyperparameter(HP) are not learned during the modeling process, but specified prior to training.

Hyperparameter tuning is the task of finding optimal hyperparameter(s) for a learning algorithm for a specific data set and at the end of the day to improve the model performance.

Abderrahim contrasts two different methods here: Grid Search and Bayesian Optimization. Definitely an interesting read if you develop data science models.

Comments closed

Correlation in easystats

The easystats team announces a new R package:

The easystats project continues to grow with its more recent addition, a package devoted to correlations. Check-out its webpage here!

It’s lightweight, easy to use, and allows for the computation of many different kinds of correlations, such as partial correlations, Bayesian correlations, multilevel correlations, polychoric correlations, biweightpercentage bend or Sheperd’s Pi correlations (types of robust correlation), distance correlation (a type of non-linear correlation) and more, also allowing for combinations between them (for instance, Bayesian partial multilevel correlation).

I’d recommend reading the examples on the GitHub repo due to formatting. Looks quite interesting. H/T R-Bloggers.

Comments closed