Data Science – Page 76

Project structures often organically grow to suit people’s needs, leading to different project structures within a team. You can consider yourself lucky if at some point in time you find, or someone in your team finds, a obscure blog post with a somewhat sane structure and enforces it in your team.

Many years ago I stumbled upon ProjectTemplate for R. Since then I’ve tried to get people to use a good project structure. More recently DrivenData (what’s in a name?) released their more generic Cookiecutter Data Science.

The main philosophies of those projects are:

A consistent and well-organized structure allows people to collaborate more easily.
Your analyses should be reproducible and your structure should enable that.
A projects starts from raw data that should never be edited; consider raw data immutable and only edit derived sources.

This is a set of prescriptions and focuses on the phase before the project actually kicks off.

Comments closed

Frequency Tables

Published 2017-03-06 by Kevin Feasel

Mala Mahadevan shows how to generate a frequency table in T-SQL and in R:

My results are as below. I have 1000 records in the table. This tells me that I have 82 occurences of age cohort 0-5, 8.2% of my dataset is from this bracket, 82 again is the cumulative frequency since this is the first record and 8.2 cumulative percent. For the next bracket 06-12 I have 175 occurences, 17.5 %, 257 occurences of age below 12, and 25.7 % of my data is in this age bracket. And so on.

Click through for the T-SQL and R scripts.

Comments closed

Prophet

Published 2017-03-01 by Kevin Feasel

Rodrigo Agundez looks at Prophet, Facebook’s new API for store sales forecasting:

The data is of a current client, therefore I won’t be disclosing any details of it.

Our models make forecasts for different shops of this company. In particular I took 2 shops, one which contains the easiest transactions to predict from all shops, and another with a somewhat more complicated history.

The data consists of real transactions since 2014. Data is daily with the target being the number of transactions executed during a day. There are missing dates in the data when the shop closed, for example New Year’s day and Christmas.

The holidays provided to the API are the same I use in our model. They contain from school vacations or large periods, to single holidays like Christmas Eve. In total, the data contains 46 different holidays.

It looks like Prophet has some limitations but can already make some nice predictions.

Comments closed

Removing Time Series Auto-Correlation

Published 2017-02-24 by Kevin Feasel

Vincent Granville shows a simple technique for removing auto-correlation from time series data:

A deeper investigation consists in isolating the auto-correlations to see whether the remaining values, once decorrelated, behave like white noise, or not. If departure from white noise is found, then it means that the time series in question exhibits unusual patterns not explained by trends, seasonality or auto correlations. This can be useful knowledge in some contexts such as high frequency trading, random number generation, cryptography or cyber-security. The analysis of decorrelated residuals can also help identify change points and instances of slope changes in time series.

Dealing with serial correlation is a big issue in econometrics; if you don’t deal with it in an Ordinary Least Squares regression, your regression will appear to have more explanatory power than it really does.

Comments closed

Getting Started With Azure Cognitive Services

Published 2017-02-23 by Kevin Feasel

Rolf Tesmer has a demo app showing what Azure Cognitive Services Text Analytics can do:

Each execution of the application on any input file will generate 3 text output files with the results of the assessment. The application runs at a rate of about 1-2 calls per second (the max send rate cannot exceed 100/min as this is the API limit).

File 1 [AzureTextAPI_SentimentText_YYYYMMDDHHMMSS.txt] – the sentiment score between 0 and 1 for each individual line in the Source Text File. The entire line in the file is graded as a single data point. 0 is negative, 1 is positive.
File 2 [AzureTextAPI_SentenceText_YYYYMMDDHHMMSS.txt] – if the “Split Document into Sentences” option was selected then this contains each individual sentence in each individual line with the sentiment score of that sentence between 0 and 1. 0 is negative, 1 is positive. RegEx is used to split the line into sentences.
File 3 [AzureTextAPI_KeyPhrasesText_YYYYMMDDHHMMSS.txt] – the key phrases identified within the text on each individual line in the Source Text File.

Rolf has also put his code on GitHub, so read on and check out his repo.

Comments closed

Time Series Aggregation

Published 2017-02-20 by Kevin Feasel

Steph Locke answers an important question related to time series:

Additive or multiplicative?

It’s important to understand what the difference between a multiplicative time series and an additive one before we go any further.

There are three components to a time series:
– trend how things are overall changing
– seasonality how things change within a given period e.g. a year, month, week, day
– error/residual/irregular activity not explained by the trend or the seasonal value

How these three components interact determines the difference between a multiplicative and an additive time series.

Click through to learn how to spot an additive time series versus a multiplicative. There is a good bit of very important detail here.

Comments closed

Dplyr Tutorial

Published 2017-02-09 by Kevin Feasel

Deepanshu Bhalla has a nice dplyr tutorial:

What is dplyr?

dplyr is a powerful R-package to manipulate, clean and summarize unstructured data. In short, it makes data exploration and data manipulation easy and fast in R.

What’s special about dplyr?

The package “dplyr” comprises many functions that perform mostly used data manipulation operations such as applying filter, selecting specific columns, sorting data, adding or deleting columns and aggregating data. Another most important advantage of this package is that it’s very easy to learn and use dplyr functions. Also easy to recall these functions. For example, filter() is used to filter rows.

dplyr is a core package when it comes to data cleansing in R, and the more of it you can internalize, the faster you’ll move in that language.

Comments closed

Using Sparklyr To Analyze Flight Data

Published 2017-02-07 by Kevin Feasel

Aki Ariga uses sparklyr on Apache Spark 2.0 to analyze flight data living in S3:

Using sparklyr enables you to analyze big data on Amazon S3 with R smoothly. You can build a Spark cluster easily with Cloudera Director. sparklyr makes Spark as a backend database of dplyr. You can create tidy data from huge messy data, plot complex maps from this big data the same way as small data, and build a predictive model from big data with MLlib. I believe sparklyr helps all R users perform exploratory data analysis faster and easier on large-scale data. Let’s try!

You can see the Rmarkdown of this analysis on RPubs. With RStudio, you can share Rmarkdown easily on RPubs.

Sparklyr is an exciting technology for distributed data analysis.

Comments closed

Principal Component Analysis Using R

Published 2017-01-26 by Kevin Feasel

Francisco Lima explains what principal component analysis is and shows how to do it in R:

Three lines of code and we see a clear separation among grape vine cultivars. In addition, the data points are evenly scattered over relatively narrow ranges in both PCs. We could next investigate which parameters contribute the most to this separation and how much variance is explained by each PC, but I will leave it for pcaMethods. We will now repeat the procedure after introducing an outlier in place of the 10th observation.

PCA is extremely useful when you have dozens of contributing factors, as it lets you narrow in on the big contributors quickly.

Comments closed

Support Vector Machines In R

Published 2017-01-18 by Kevin Feasel

Deepanshu Bhalla explains what support vector machines are:

The main idea of support vector machine is to find the optimal hyperplane (line in 2D, plane in 3D and hyperplane in more than 3 dimensions) which maximizes the margin between two classes. In this case, two classes are red and blue balls. In layman’s term, it is finding the optimal separating boundary to separate two classes (events and non-events).

Deepanshu then goes on to implement this in R.

Comments closed

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Category: Data Science

Building A Python Project Template

Frequency Tables

Prophet

Removing Time Series Auto-Correlation

Getting Started With Azure Cognitive Services

Time Series Aggregation

Additive or multiplicative?

Dplyr Tutorial

Using Sparklyr To Analyze Flight Data

Principal Component Analysis Using R

Support Vector Machines In R