Press "Enter" to skip to content

Category: Data Science

Handling Missing Data

Marina Wyss explains various techniques for handling missing data in data sets:

Missing or incomplete data can have a huge negative impact on any data science project. This is particularly relevant for companies in the early stages of developing solid data collection and management systems.

While the best solution for missing data is to avoid it in the first place by developing good data-collection and stewardship policies, often we have to make due with what’s available.

This blog covers the different kinds of missing data, and what we can do about missing data once we know what we’re dealing with. These strategies range from simple – for example, choosing models that handle missings automatically, or simply deleting problematic observations – to (probably superior) methods for estimating what those missing values may be, otherwise known as imputation.

I like the distinction in form Marina draws, and we also get a good set of techniques for filling the gaps.

Comments closed

Visualizing a Single Variable in R

Michaelino Mervisiano takes us through the types of visuals we can create to understand a single variable in R:

How to create a histogram in R? And what information that we can get from histogram?
Histogram shows a frequency distribution. It is a great graph for showing the mode, the spread, and the symmetry (skewness) of your data. Here is a histogram of 1,000 random points drawn from a normal distribution with a mean of 2.5

Of course I don’t like option number 4 and would replace it with something else (column/bar charts, Cleveland dot plots, or stacked column/bar depending on what you’re trying to observe). But this is a good way of thinking about how you can visualize a variable.

Comments closed

Tuning Random Forest HyperParameters with R

Julia Silge gives us an idea of how to tune random forest hyperparameters in R:

Our modeling goal here is to predict the legal status of the trees in San Francisco in the #TidyTuesday dataset. This isn’t this week’s dataset, but it’s one I have been wanting to return to. Because it seems almost wrong not to, we’ll be using a random forest model! 🌳

Let’s build a model to predict which trees are maintained by the San Francisco Department of Public Works and which are not. We can use parse_number() to get a rough estimate of the size of the plot from the plot_size column. Instead of trying any imputation, we will just keep observations with no NA values.

Click through to some data exploration, the initial model, and a process for using Grid Search with the caret package.

Comments closed

Installing Apache Airflow

Achilleus walks us through a process to install Apache Airflow on a machine:

Airflow is an amazing tool by Airbnb and is a kinda defacto standard of ETL deployments in the Data Engineering domain nowadays. But at the same time, you can also use Airflow to schedule to ML pipeline and automate the whole ML pipeline(almost).

This is my attempt to install and set up a fairly robust Apache Airflow deployment for my needs. I am pretty sure there might be some better ways of doing it or add any enhancements to it. Any comments or suggestions are highly appreciated!

This is an easy-to-follow set of steps, so check it out.

Comments closed

Using the Tune Package in R for Hyperparamter Optimization

Abderrahim Lyoubi-Idrissi takes us through a Bayesian approach to tune hyperparameters:

In contrast to the model parameters, which are discovered by the learning algorithm of the ML model, the so called Hyperparameter(HP) are not learned during the modeling process, but specified prior to training.

Hyperparameter tuning is the task of finding optimal hyperparameter(s) for a learning algorithm for a specific data set and at the end of the day to improve the model performance.

Abderrahim contrasts two different methods here: Grid Search and Bayesian Optimization. Definitely an interesting read if you develop data science models.

Comments closed

Correlation in easystats

The easystats team announces a new R package:

The easystats project continues to grow with its more recent addition, a package devoted to correlations. Check-out its webpage here!

It’s lightweight, easy to use, and allows for the computation of many different kinds of correlations, such as partial correlations, Bayesian correlations, multilevel correlations, polychoric correlations, biweightpercentage bend or Sheperd’s Pi correlations (types of robust correlation), distance correlation (a type of non-linear correlation) and more, also allowing for combinations between them (for instance, Bayesian partial multilevel correlation).

I’d recommend reading the examples on the GitHub repo due to formatting. Looks quite interesting. H/T R-Bloggers.

Comments closed

Evaluating Regression Models in Azure ML

Dan Fitton continues a series on model evaluation with Azure Machine Learning:

The initial go-to metric for understanding a regression model is the R squared (or R2) value, also known as the coefficient of determination. R squared measures how well the model is fitted to the data – the goodness of fit. It indicates how much of the variation of y (the target) is explained by the variation in x (the features).

The measures are bog standard if you’ve worked with regressions before, and Dan does a good job explaining them.

Comments closed

Python Cross-Validation

John Mount has some advice if you’re doing cross-validation in Python:

Here is a quick, simple, and important tip for doing machine learning, data science, or statistics in Python: don’t use the default cross validation settings. The default can default to a deterministic, and even ordered split, which is not in general what one wants or expects from a statistical point of view. From a software engineering point of view the defaults may be sensible as since they don’t touch the pseudo-random number generator they are repeatable, deterministic, and side-effect free.

This issue falls under “read the manual”, but it is always frustrating when the defaults are not sufficiently generous.

Click through to see the problem and how you can fix it.

Comments closed

The Hype Cycle for Artificial Intelligence

William Vorhies takes a look at Gartner’s hype cycle for AI (among other things):

Supposing you’re a business leader and supposing you’re trying to make an intelligent decision about prioritizing your AI adoption plans.  It’s likely that like many of us the first thing you’d reach for would be one of Gartner’s many hype cycle or magic quadrant analyses.

What you might not know is that you now need an expert just to guide you through the expert literature.  There has been such a proliferation of hype cycles and magic quadrants that you could easily be looking in the wrong place.

The hype cycle is definitely opinion-based, but I think it’s a useful look at the relative maturity of different segments of an industry or technology cluster. Do read the whole thing, though, as these things aren’t perfect.

Comments closed

Converting Odds to Probabilities with R

Jonas Christoffer Lindstrom has a new package:

Now you might think that converting decimal odds to probabilities should be easy, you can just use the definition above and take the inverse of the odds to recover the probability. But it is not that simple, since in practice using this simple formula will give you improper probabilities. They will not sum to 1, as they should, but be slightly larger. This gives the bookmakers an edge and the probabilities (which aren’t real probabilities) can not be considered fair, and so different methods for correcting this exists.

Read on to learn more about the problem and a few solutions. H/T R-Bloggers.

Comments closed