Press "Enter" to skip to content

Category: Data Science

Inferring Median from a Few Values

Holger von Jouanne-Diedrich is stuck in the middle with you:

Let us dive directly into the matter, the Small Data Rule states:

In a sample of five numerical values from any unknown population, the median of this population lies between the smallest and the largest sample value with 94 percent certainty.

The “population” can be anything, like data about age in a population, income in a country, television consumption, donation amounts, body sizes, temperatures and so on.

This is a very interesting concept. Five values won’t give you the median, but it will give you a bounded expectation with high likelihood. And check out the comments: adding a few more data points increases the expected likelihood even further.

1 Comment

Two Ways to Access Kafka Topics from R

Patrick Neff shows us a couple of ways to build a Kafka-to-R pipeline:

In Data Science projects, we distinguish between descriptive analytics and statistical models running in production. Overall, these can be seen as one process. You start with analyzing historical data to gain insights, find correlations, and finally develop and optimize your model. Then you transfer it and use it in your running system. A key point for every data scientist is not just the mathematical skills themselves, but also how to get the data into your analytics program.

In this blog post, we focus exactly on this crucial step: retrieving the data. In a second article, we’ll talk about running your model on real-time data.

Click through for the techniques.

Comments closed

The Benefits of Cluster Sampling

Muhammad Touhidul Islam explains what cluster sampling is and why it can be useful:

Cluster sampling is defined as a sampling method where multiple clusters of people are created from a population where they are indicative of homogenous characteristics and have an equal chance of being a part of the sample. In this sampling method, a simple random sample is created from the different clusters in the population. This is a probability sampling procedure.

Click through for a few examples of where this can be useful.

Comments closed

Time Series Estimation with Facebook’s Prophet

Dan Lantos looks at the Prophet library:

This article (part of a short series) aims to introduce the Prophet library, discuss it at a high level and run through a basic example of forecasting the FTSE 100 index. Future articles will discuss exactly how Prophet achieves its results, how to interpret the output and how to improve the model.
Please see this article (by my talented colleague Gavita) for an introduction to time-series forecasting algorithms.

Click through for part one in an ongoing series.

Comments closed

Reasons to Use Tidymodels

Roel Hogervorst explains when we may or may not want to use tidymodels versus rolling our own models in R:

When not

you are always using GLM models. (they are very flexible!) it makes no sense to me to go for the extra {parsnip} layer if you are always using the same models. You could still consider using recipes to feature engineer.

– If you are familiar with the kind of data and what models will work on that data. Basically you are an expert on this field and have worked on it for many years. There is no need to experiment.

Read on for concrete examples of when it does make sense. H/T R-Bloggers.

Comments closed

A Summary of Time Series Algorithms

Gavita Regunath and Dan Lantos give an overview of time series algorithms:

Time series forecasting is a data science task that is critical to a variety of activities within any business organisation. Time series forecasting is a useful tool that can help to understand how historical data influences the future. This is done by looking at past data, defining the patterns, and producing short or long-term predictions.

Click through for an overview, as well as ten examples of algorithms you can use for handling time series data.

Comments closed

Decile Analysis and Logistic Regression

Ridhima Kumar (re-)introduces us to decile analysis:

Decile analysis was once a popularly used technique, however the convention of teaching and bucketing machine learning problems into either ‘classification’ or ‘Regression’ types, lead people to forget Decile analysis type analyses. I am pretty sure, most freshly minted data scientists would not have even heard of Decile analysis. So, coming back to what is Decile Analysis.

Decile Analysis is used to categorize dataset from highest to lowest values or vice versa. (Based on predicted probabilities)

As obvious from the name, the analysis involves dividing the dataset into ten equal groups. Each group should have the same no. of observations/customers.

It ranks customers in the order from most likely to respond to least likely to respond.

Read on to learn the steps and how this ties with the fact that logistic regression is regression.

Comments closed

Building QQ plots in R

The folks at finnstats explain the notion of a Quantile-Quantile plot and show how to create one in R:

QQ-plots in R, first need to understand the Q-Q plot. The Q-Q plot is a graphical tool to help us examine if a set of data plausibly came from some theoretical distribution such as a Normal or not.

Suppose, if we are executing a statistical analysis the test comes under parametric methods assumes variable is Normally distributed, we can make use of a Q-Q plot to check that assumption.

It’s just a visual verification, not full proof, so we can make use of some other statistical test also. But Q-Qplot allows us to see at-a-glance if our assumption is valid or not.

Click through to learn more. H/T R-bloggers.

Comments closed

Plotting Correlation Analyses in R

Finnstats shows a few techniques for plotting correlation in R:

Correlation analysis, correlation is a term that is a measure of the strength of a relationship between two variables.

Pearson’s Product-Moment Correlation

One of the most common measures of correlation is Pearson’s product-moment correlation, which is commonly referred to simply as the correlation, or just the letter r.

Correlation shows the strength of a relationship between two variables and is expressed numerically by the correlation coefficient.

Click through for examples from several packages. H/T R-Bloggers.

Comments closed

Fitting Excel Macros into Data Science Pipelines

Bryan Shalloway has a process for us:

While I no longer use it regularly for the purposes of analysis, I will always have a soft spot in my heart for excel. Furthermore, using a “correct” set of data science tools often requires a bridge. Integrating a rigorous component into a messy spreadsheet based pipeline can be an initial step towards the pipeline or team or organization starting on a path of continuous improvement in their processes. Also, spreadsheets are foundational to many (probably most) BizOps teams and therefore are sometimes unavoidable…

In this post I will walk through a short example and some considerations for when you might decide (perhaps against your preferences) to integrate your work with extant spreadsheets or shadow “pipelines” within your organization.

Click through for Bryan’s thoughts on the topic.

Comments closed