Press "Enter" to skip to content

Category: Data Science

The Central Limit Theorem

Mala Mahadevan explains the Central Limit Theorem with an example:

The central limit theorem states that the sampling distribution of the mean of any independent,random variable will be normal or nearly normal, if the sample size is large enough. How large is “large enough”? The answer depends on two factors.

  • Requirements for accuracy. The more closely the sampling distribution needs to resemble a normal distribution, the more sample points will be required.
  • The shape of the underlying population. The more closely the original population resembles a normal distribution, the fewer sample points will be required. (from stattrek.com).

The main use of the sampling distribution is to verify the accuracy of many statistics and population they were based upon.

Read on for an example and to see how to calculate this in T-SQL.

Comments closed

Linear Support Vector Machines

Ananda Das explains how linear Support Vector Machines work in classifying spam messages:

Linear SVM assumes that the two classes are linearly separable that is a hyper-plane can separate out the two classes and the data points from the two classes do not get mixed up. Of course this is not an ideal assumption and how we will discuss it later how linear SVM works out the case of non-linear separability. But for a reader with some experience here I pose a question which is like this Linear SVM creates a discriminant function but so does LDA. Yet, both are different classifiers. Why ? (Hint: LDA is based on Bayes Theorem while Linear SVM is based on the concept of margin. In case of LDA, one has to make an assumption on the distribution of the data per class. For a newbie, please ignore the question. We will discuss this point in details in some other post.)

This is a pretty math-heavy post, so get your coffee first. h/t R-Bloggers.

Comments closed

Introduction To Probability

Mala Mahadevan covers some basics of probability:

Probability is an important statistical and mathematical concept to understand. In simple terms – probability refers to the chances of possible outcome of an event occurring within the domain of multiple outcomes. Probability is indicated by a whole number – with 0 meaning that the outcome has no chance of occurring and 1 meaning that the outcome is certain to happen. So it is mathematically represented as P(event) = (# of outcomes in event / total # of outcomes). In addition to understanding this simple thing, we will also look at a basic example of conditional probability and independent events.

It’s a good intro to a critical topic in statistics.  If I would add one thing to this, it would be to state that probability is always conditional upon something.  It’s fair to write something as P(Event) understanding that it’s a shortcut, but in reality, it’s always P(Event | Conditions), where Conditions is the set of assumptions we made in collecting this sample.

Comments closed

Time Series Errors

Alex Smolyanskaya explains some common errors when doing time series analysis:

Non-zero model error indicates that our model is missing explanatory features. In practice, we don’t expect to get rid of all model error—there will be some error in the forecast from unavoidable natural variation. Natural variation should reflect all the stuff we will probably never capture with our model, like measurement error, unpredictable external market forces, and so on. The distribution of error should be close to normal and, ideally, have a small mean. We get evidence that an important explanatory variable is missing from the model when we find that the model error doesn’t look like simple natural variation—if the distribution of errors skews one way or another, there are more outliers than expected, or if the mean is unpleasantly large. When this happens we should try to identify and correct any missing or incorrect model features.

It’s an interesting article, especially the bit about cross-validation, which is a perfectly acceptable technique in non-time series models.

Comments closed

Building A Python Project Template

Henk Griffioen shows how to create a standardized project in Python, focusing on data science scenarios:

Project structures often organically grow to suit people’s needs, leading to different project structures within a team. You can consider yourself lucky if at some point in time you find, or someone in your team finds, a obscure blog post with a somewhat sane structure and enforces it in your team.

Many years ago I stumbled upon ProjectTemplate for R. Since then I’ve tried to get people to use a good project structure. More recently DrivenData (what’s in a name?) released their more generic Cookiecutter Data Science.

The main philosophies of those projects are:

  • A consistent and well-organized structure allows people to collaborate more easily.

  • Your analyses should be reproducible and your structure should enable that.

  • A projects starts from raw data that should never be edited; consider raw data immutable and only edit derived sources.

This is a set of prescriptions and focuses on the phase before the project actually kicks off.

Comments closed

Frequency Tables

Mala Mahadevan shows how to generate a frequency table in T-SQL and in R:

My results are as below. I have 1000 records in the table. This tells me that I have 82 occurences of age cohort 0-5, 8.2% of my dataset is from this bracket, 82 again is the cumulative frequency since this is the first record and 8.2 cumulative percent. For the next bracket 06-12 I have 175 occurences, 17.5 %, 257 occurences of age below 12, and 25.7 % of my data is in this age bracket. And so on.

Click through for the T-SQL and R scripts.

Comments closed

Prophet

Rodrigo Agundez looks at Prophet, Facebook’s new API for store sales forecasting:

The data is of a current client, therefore I won’t be disclosing any details of it.

Our models make forecasts for different shops of this company. In particular I took 2 shops, one which contains the easiest transactions to predict from all shops, and another with a somewhat more complicated history.

The data consists of real transactions since 2014. Data is daily with the target being the number of transactions executed during a day. There are missing dates in the data when the shop closed, for example New Year’s day and Christmas.

The holidays provided to the API are the same I use in our model. They contain from school vacations or large periods, to single holidays like Christmas Eve. In total, the data contains 46 different holidays.

It looks like Prophet has some limitations but can already make some nice predictions.

Comments closed

Removing Time Series Auto-Correlation

Vincent Granville shows a simple technique for removing auto-correlation from time series data:

A deeper investigation consists in isolating the auto-correlations to see whether the remaining values, once decorrelated, behave like white noise, or not. If departure from white noise is found, then it means that the time series in question exhibits unusual patterns not explained by trends, seasonality or auto correlations. This can be useful knowledge in some contexts  such as high frequency trading, random number generation, cryptography or cyber-security. The analysis of decorrelated residuals can also help identify change points and instances of slope changes in time series.

Dealing with serial correlation is a big issue in econometrics; if you don’t deal with it in an Ordinary Least Squares regression, your regression will appear to have more explanatory power than it really does.

Comments closed

Getting Started With Azure Cognitive Services

Rolf Tesmer has a demo app showing what Azure Cognitive Services Text Analytics can do:

Each execution of the application on any input file will generate 3 text output files with the results of the assessment.  The application runs at a rate of about 1-2 calls per second (the max send rate cannot exceed 100/min as this is the API limit).

  • File 1 [AzureTextAPI_SentimentText_YYYYMMDDHHMMSS.txt] – the sentiment score between 0 and 1 for each individual line in the Source Text File.  The entire line in the file is graded as a single data point.  0 is negative, 1 is positive.

  • File 2 [AzureTextAPI_SentenceText_YYYYMMDDHHMMSS.txt] – if the “Split Document into Sentences” option was selected then this contains each individual sentence in each individual line with the sentiment score of that sentence between 0 and 1.  0 is negative, 1 is positive.  RegEx is used to split the line into sentences.

  • File 3 [AzureTextAPI_KeyPhrasesText_YYYYMMDDHHMMSS.txt] – the key phrases identified within the text on each individual line in the Source Text File.

Rolf has also put his code on GitHub, so read on and check out his repo.

Comments closed

Time Series Aggregation

Steph Locke answers an important question related to time series:

Additive or multiplicative?

It’s important to understand what the difference between a multiplicative time series and an additive one before we go any further.

There are three components to a time series:
trend how things are overall changing
seasonality how things change within a given period e.g. a year, month, week, day
error/residual/irregular activity not explained by the trend or the seasonal value

How these three components interact determines the difference between a multiplicative and an additive time series.

Click through to learn how to spot an additive time series versus a multiplicative.  There is a good bit of very important detail here.

Comments closed