Category: Data Science

To be able to do the next stage I was interested in a solution that would give me an API or a bulk-upload interface for free or cheap. Google Ads has a Keyword Planner but having to create a campaign before I could even get started wasn’t fun. Another recommendation I’d had for keyword analysis was ahrefs.com and I noticed they had a $7 trial and an API that looked well documented. The API didn’t seem to cover keywords but it did have a bulk upload capability.

Some of what Steph’s doing is possible within AdWords reporting, but they will tend toward finding helpful ways of maximizing your spend.

Comments closed

Analyzing Autosteer Data (Or Lack Thereof)

Published 2019-03-05 by Kevin Feasel

Elliot Williams has an interesting analysis of the NHTSA report on Tesla’s Autosteer capabilities:

But the NHTSA report went a step further. Based on the data that Tesla provided them, they noted that since the addition of Autosteer to Tesla’s confusingly named “Autopilot” suite of functions, the rate of crashes severe enough to deploy airbags declined by 40%. That’s a fantastic result.
Because it was so spectacular, a private company with a history of investigating automotive safety wanted to have a look at the data. The NHTSA refused because Tesla claimed that the data was a trade secret, so Quality Control Systems (QCS) filed a Freedom of Information Act lawsuit to get the data on which the report was based. Nearly two years later, QCS eventually won.
Looking into the data, QCS concluded that crashes may have actually increased by as much as 60% on the addition of Autosteer, or maybe not at all.

This is a great exercise in statistical analysis and the problem of garbage in, garbage out.

Comments closed

Robust Regressions in R

Published 2019-02-28 by Kevin Feasel

Michael Grogan shows how you can find and re-weigh outliers when performing regressions:

A useful way of dealing with outliers is by running a robust regression, or a regression that adjusts the weights assigned to each observation in order to reduce the skew resulting from the outliers.
In this particular example, we will build a regression to analyse internet usage in megabytes across different observations. You will see that we have several outliers in this dataset. Specifically, we have three incidences where internet consumption is vastly higher than other observations in the dataset.
Let’s see how we can use a robust regression to mitigate for these outliers.

Click through for a demonstration.

Comments closed

Conjoint Analysis In R

Published 2019-02-21 by Kevin Feasel

Abhijit Telang introduces the concept of conjoint analysis and shows how you can implement this in R:

We will need to typically transform the problem of utility modeling from its intangible, abstract form to something that is measurable. That is, we wish to assign a numeric value to the perceived utility by the consumer, and we want to measure that accurately and precisely (as much as possible).
This is where survey design comes in, where, as a market researcher, we must design inputs (in the form of questionnaires) to have respondents do the hard work of transforming their qualitative, habitual, perceptual opinions into simplified, summarized aggregate values which are expressed either as a numeric value or on a rank scale.

I tend to shy away from this kind of analysis because it runs a huge risk of trying to turn ordinal utility rankings into cardinal functions.

Comments closed

Bayesian Modeling Of Hardware Failure Rates

Published 2019-02-20 by Kevin Feasel

Sean Owen shows how you can use Bayesian statistical approaches with Spark Streaming, using the example of hard drive failure rates:

This data doesn’t arrive all at once, in reality. It arrives in a stream, and so it’s natural to run these kind of queries continuously. This is simple with Apache Spark’s Structured Streaming, and proceeds almost identically.
Of course, on the first day this streaming analysis is rolled out, it starts from nothing. Even after two quarters of data here, there’s still significant uncertainty about failure rates, because failures are rare.
An organization that’s transitioning this kind of offline data science to an online streaming context probably does have plenty of historical data. This is just the kind of prior belief about failure rates that can be injected as a prior distribution on failure rates!

Bayesian approaches work really well with streaming data if you think of the streams as sampling events used to update your priors to a new posterior distribution.

Comments closed

Handling Definitional Changes In Predictive Variables

Published 2019-02-20 by Kevin Feasel

Vincent Granville explains how you can blend two different definitions of a variable of interest together:

The reasons why scores can become meaningless over time is because data evolves. New features (variables) are added that were not available before, the definition of a metric is suddenly changed (for instance, the way income is measured) resulting in new data not compatible with prior data, and faulty scores. Also, when external data is gathered across multiple sources, each source may compute it differently, resulting in incompatibilities: for instance, when comparing individual credit scores from two people that are costumers at two different banks, each bank computes base metrics (income, recency, net worth, and so on) used to build the score, in a different way. Sometimes the issue is caused by missing data, especially when users with missing data are very different from those with full data attached to them.

Click through for a description of the approach and links showing how it works in practice.

Comments closed

Using Convolutional Neural Networks To Recognize Features In Images

Published 2019-02-19 by Kevin Feasel

Michael Grogan shows how you can use Keras to perform image recognition with a convolutional neural network:

VGG16 is a built-in neural network in Keras that is pre-trained for image recognition.
Technically, it is possible to gather training and test data independently to build the classifier. However, this would necessitate at least 1,000 images, with 10,000 or greater being preferable.
In this regard, it is much easier to use a pre-trained neural network that has already been designed for image classification purposes.

This is probably the best generally available technique for image classification.

Comments closed

Combining Stream Analytics And Azure ML With Power BI

Published 2019-02-19 by Kevin Feasel

Brad Llewellyn shows us how to feed Azure ML predictions into Power BI via Azure Stream Analytics:

Today, we’re going to talk about combining Stream Analytics with Azure Machine Learning Studio within Power BI. If you haven’t read the earlier posts in this series, Introduction, Getting Started with R Scripts, Clustering, Time Series Decomposition, Forecasting, Correlations, Custom R Visuals, R Scripts in Query Editor, Python, Azure Machine Learning Studio and Stream Analytics, they may provide some useful context. You can find the files from this post in our GitHub Repository. Let’s move on to the core of this post, Stream Analytics.

This post is going to build directly on what we created in the two previous posts, Azure Machine Learning Studio and Stream Analytics. As such, we recommend that you read them before proceeding.

Read on for the demo.

Comments closed

Kafka And The Differing Aims Of Data Professionals

Published 2019-02-15 by Kevin Feasel

Kai Waehner argues that there is an impedence mismatch between data engineers, data scientists, and ML production engineers:

Data scientists love Python, period. Therefore, the majority of machine learning/deep learning frameworks focus on Python APIs. Both the stablest and most cutting edge APIs, as well as the majority of examples and tutorials use Python APIs. In addition to Python support, there is typically support for other programming languages, including JavaScript for web integration and Java for platform integration—though oftentimes with fewer features and less maturity. No matter what other platforms are supported, chances are very high that your data scientists will build and train their analytic models with Python.
There is an impedance mismatch between model development using Python, its tool stack and a scalable, reliable data platform with low latency, high throughput, zero data loss and 24/7 availability requirements needed for data ingestion, preprocessing, model deployment and monitoring at scale. Python in practice is not the most well-known technology for these requirements. However, it is a great client for a data platform like Apache Kafka.

Click through for the full argument as well as where Kafka can help mitigate some of the issues.

Comments closed

Solving The Monty Hall Problem With R

Published 2019-02-13 by Kevin Feasel

Miroslav Rajter builds a Monty Hall problem simulator using R:

The original and most simple scenario of the Monty Hall problem is this: You are in a prize contest and in front of you there are three doors (A, B and C). Behind one of the doors is a prize (Car), while behind others is a loss (Goat). You first choose a door (let’s say door A). The contest host then opens another door behind which is a goat (let’s say door B), and then he ask you will you stay behind your original choice or will you switch the door. The question behind this is what is the better strategy?

This is something that puzzled me for a very long time. This is fundamentally a Bayesian problem built around processing new information, and once I understood that, the answer was a lot clearer. H/T R-Bloggers.

Comments closed

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31