Vincent Granville shows a simple technique for removing auto-correlation from time series data:

A deeper investigation consists in isolating the auto-correlations to see whether the remaining values, once decorrelated, behave like white noise, or not. If departure from white noise is found, then it means that the time series in question exhibits unusual patterns not explained by trends, seasonality or auto correlations. This can be useful knowledge in some contexts such as high frequency trading, random number generation, cryptography or cyber-security. The analysis of decorrelated residuals can also help identify change points and instances of slope changes in time series.

Dealing with serial correlation is a big issue in econometrics; if you don’t deal with it in an Ordinary Least Squares regression, your regression will appear to have more explanatory power than it really does.

Rolf Tesmer has a demo app showing what Azure Cognitive Services Text Analytics can do:

Each execution of the application on any input file will generate 3 text output files with the results of the assessment. The application runs at a rate of about 1-2 calls per second (the max send rate cannot exceed 100/min as this is the API limit).

**File 1 [AzureTextAPI_SentimentText_YYYYMMDDHHMMSS.txt]**– the sentiment score between 0 and 1 for each individual line in the Source Text File. The entire line in the file is graded as a single data point. 0 is negative, 1 is positive.**File 2 [AzureTextAPI_SentenceText_YYYYMMDDHHMMSS.txt] –**if the “*Split Document into Sentences*” option was selected then this contains each individual sentence in each individual line with the sentiment score of that sentence between 0 and 1. 0 is negative, 1 is positive. RegEx is used to split the line into sentences.**File 3 [AzureTextAPI_KeyPhrasesText_YYYYMMDDHHMMSS.txt] –**the key phrases identified within the text on each individual line in the Source Text File.

Rolf has also put his code on GitHub, so read on and check out his repo.

Steph Locke answers an important question related to time series:

## Additive or multiplicative?

It’s important to understand what the difference between a multiplicative time series and an additive one before we go any further.

There are three components to a time series:

–trendhow things are overall changing

–seasonalityhow things change within a given period e.g. a year, month, week, day

–error/residual/irregularactivity not explained by the trend or the seasonal valueHow these three components interact determines the difference between a multiplicative and an additive time series.

Click through to learn how to spot an additive time series versus a multiplicative. There is a good bit of very important detail here.

Deepanshu Bhalla has a nice dplyr tutorial:

What is dplyr?dplyr is a powerful R-package to manipulate, clean and summarize unstructured data. In short, it makes data exploration and data manipulation easy and fast in R.What’s special about dplyr?The package “dplyr” comprises many functions that perform mostly used data manipulation operations such as applying filter, selecting specific columns, sorting data, adding or deleting columns and aggregating data. Another most important advantage of this package is that it’s very easy to learn and use dplyr functions. Also easy to recall these functions. For example, filter() is used to filter rows.

dplyr is a core package when it comes to data cleansing in R, and the more of it you can internalize, the faster you’ll move in that language.

Aki Ariga uses sparklyr on Apache Spark 2.0 to analyze flight data living in S3:

Using sparklyr enables you to analyze big data on Amazon S3 with R smoothly. You can build a Spark cluster easily with Cloudera Director. sparklyr makes Spark as a backend database of dplyr. You can create tidy data from huge messy data, plot complex maps from this big data the same way as small data, and build a predictive model from big data with MLlib. I believe sparklyr helps all R users perform exploratory data analysis faster and easier on large-scale data. Let’s try!

You can see the Rmarkdown of this analysis on RPubs. With RStudio, you can share Rmarkdown easily on RPubs.

Sparklyr is an exciting technology for distributed data analysis.

Francisco Lima explains what principal component analysis is and shows how to do it in R:

Three lines of code and we see a clear separation among grape vine cultivars. In addition, the data points are evenly scattered over relatively narrow ranges in both PCs. We could next investigate which parameters contribute the most to this separation and how much variance is explained by each PC, but I will leave it for

pcaMethods. We will now repeat the procedure after introducing an outlier in place of the 10th observation.

PCA is extremely useful when you have dozens of contributing factors, as it lets you narrow in on the big contributors quickly.

Deepanshu Bhalla explains what support vector machines are:

The main idea of support vector machine is to find the

optimal hyperplane(line in 2D, plane in 3D and hyperplane in more than 3 dimensions) whichmaximizes the margin between two classes. In this case, two classes are red and blue balls. In layman’s term, it is finding the optimal separating boundary to separate two classes (events and non-events).

Deepanshu then goes on to implement this in R.

Sibanjan Das provides some of the basics of data analysis using R:

Let’s start thinking in a logical way the steps that one should perform once we have the data imported into R.

- The
first stepwould be to discover what’s in the data file that was exported. To do this, we can:

- Use
function to view few rows from the data set. By default, head shows first 5 rows. Ex:head`head(mtcars)`

to view the structure of the imported data. Ex:str`str(mtcars)`

to view the data summary. Ex:summary`summary(mtcars)`

There’s a lot to data analysis, but this is a good start.

Martin Willcox offers some advice for people getting into the real-time analytics game:

**Clarify who will be making the decision – man, or machine?**Humans have powers of discretion that machines sometimes lack, but are much slower than a silicon-based system, and only able to make decisions one-at-a-time, one-after-another. If we chose to put a human in the loop, we are normally in “please-update-my-dashboard-faster-and-more-often” territory.**It is important to be clear about decision-latency**. Think about how soon after a business event you need to take a decision and then implement it. You also need to understand whether decision-latency and data-latency are the same. Sometimes a good decision can be made now on the basis of older data. But sometimes you need the latest, greatest and most up-to-date information to make the right choices.

There are some good insights here.

Angelika Stefan and Felix Schönbrodt explain the concept of priors:

When reading about Bayesian statistics, you regularly come across terms like “objective priors“, “prior odds”, “prior distribution”, and “normal prior”. However, it may not be intuitively clear that the meaning of “prior” differs in these terms. In fact, there are two meanings of “prior” in the context of Bayesian statistics: (a) prior plausibilities of models, and (b) the quantification of uncertainty about model parameters. As this often leads to confusion for novices in Bayesian statistics, we want to explain these two meanings of priors in the next two blog posts*. The current blog post covers the the first meaning of priors.

Priors are a big differentiator between the Bayesian statistical model and the classical/frequentist statistical model.

Kevin Feasel

2017-02-24

Data Science

No Comment