Press "Enter" to skip to content

Category: Data Science

Identifying Distributions with knn in R

Abhijit Telang has an interesting post on identifying arbitrary distributions with the k-nearest-neighbor algorithm in R:

You can easily see how arbitrary the shapes can be almost magically discovered, through the principle of the nearest neighbor search.

The magic happens because the methodical approach of meeting and greeting the neighbors discovers more and more neighbors (and hence the visualization becomes denser and denser) as per the formation of the shape, and on the other hand, sparser and sparser as the traversal approaches the contours of those very shapes. The sparseness around the dense shapes provides the much-needed contrast to discover hidden shapes.

Read on for a very interesting explanation.

Comments closed

Shoestring Budget Keyword Bidding

Steph Locke shares some experience on keyword bidding without breaking the piggy bank:

To be able to do the next stage I was interested in a solution that would give me an API or a bulk-upload interface for free or cheap. Google Ads has a Keyword Planner but having to create a campaign before I could even get started wasn’t fun. Another recommendation I’d had for keyword analysis was ahrefs.com and I noticed they had a $7 trial and an API that looked well documented. The API didn’t seem to cover keywords but it did have a bulk upload capability.

Some of what Steph’s doing is possible within AdWords reporting, but they will tend toward finding helpful ways of maximizing your spend.

Comments closed

Analyzing Autosteer Data (Or Lack Thereof)

Elliot Williams has an interesting analysis of the NHTSA report on Tesla’s Autosteer capabilities:

But the NHTSA report went a step further. Based on the data that Tesla provided them, they noted that since the addition of Autosteer to Tesla’s confusingly named “Autopilot” suite of functions, the rate of crashes severe enough to deploy airbags declined by 40%. That’s a fantastic result.

Because it was so spectacular, a private company with a history of investigating automotive safety wanted to have a look at the data. The NHTSA refused because Tesla claimed that the data was a trade secret, so Quality Control Systems (QCS) filed a Freedom of Information Act lawsuit to get the data on which the report was based. Nearly two years later, QCS eventually won.

Looking into the data, QCS concluded that crashes may have actually increased by as much as 60% on the addition of Autosteer, or maybe not at all. 

This is a great exercise in statistical analysis and the problem of garbage in, garbage out.

Comments closed

Robust Regressions in R

Michael Grogan shows how you can find and re-weigh outliers when performing regressions:

A useful way of dealing with outliers is by running a robust regression, or a regression that adjusts the weights assigned to each observation in order to reduce the skew resulting from the outliers.

In this particular example, we will build a regression to analyse internet usage in megabytes across different observations. You will see that we have several outliers in this dataset. Specifically, we have three incidences where internet consumption is vastly higher than other observations in the dataset.

Let’s see how we can use a robust regression to mitigate for these outliers.

Click through for a demonstration.

Comments closed

Conjoint Analysis In R

Abhijit Telang introduces the concept of conjoint analysis and shows how you can implement this in R:

We will need to typically transform the problem of utility modeling from its intangible, abstract form to something that is measurable. That is, we wish to assign a numeric value to the perceived utility by the consumer, and we want to measure that accurately and precisely (as much as possible).

This is where survey design comes in, where, as a market researcher, we must design inputs (in the form of questionnaires) to have respondents do the hard work of transforming their qualitative, habitual, perceptual opinions into  simplified, summarized aggregate values which are expressed either as a numeric value or on a rank scale.

I tend to shy away from this kind of analysis because it runs a huge risk of trying to turn ordinal utility rankings into cardinal functions.

Comments closed

Bayesian Modeling Of Hardware Failure Rates

Sean Owen shows how you can use Bayesian statistical approaches with Spark Streaming, using the example of hard drive failure rates:

This data doesn’t arrive all at once, in reality. It arrives in a stream, and so it’s natural to run these kind of queries continuously. This is simple with Apache Spark’s Structured Streaming, and proceeds almost identically.

Of course, on the first day this streaming analysis is rolled out, it starts from nothing. Even after two quarters of data here, there’s still significant uncertainty about failure rates, because failures are rare.

An organization that’s transitioning this kind of offline data science to an online streaming context probably does have plenty of historical data. This is just the kind of prior belief about failure rates that can be injected as a prior distribution on failure rates!

Bayesian approaches work really well with streaming data if you think of the streams as sampling events used to update your priors to a new posterior distribution.

Comments closed

Handling Definitional Changes In Predictive Variables

Vincent Granville explains how you can blend two different definitions of a variable of interest together:

The reasons why scores can become meaningless over time is because data evolves. New features (variables) are added that were not available before, the definition of a metric is suddenly changed (for instance, the way income is measured) resulting in new data not compatible with prior data, and faulty scores. Also, when external data is gathered across multiple sources, each source may compute it differently, resulting in incompatibilities: for instance, when comparing individual credit scores from two people that are costumers at two different banks, each bank computes base metrics (income, recency, net worth, and so on) used to build the score, in a different way. Sometimes the issue is caused by missing data, especially when users with missing data are very different from those with full data attached to them.

Click through for a description of the approach and links showing how it works in practice.

Comments closed

Using Convolutional Neural Networks To Recognize Features In Images

Michael Grogan shows how you can use Keras to perform image recognition with a convolutional neural network:

VGG16 is a built-in neural network in Keras that is pre-trained for image recognition.

Technically, it is possible to gather training and test data independently to build the classifier. However, this would necessitate at least 1,000 images, with 10,000 or greater being preferable.

In this regard, it is much easier to use a pre-trained neural network that has already been designed for image classification purposes.

This is probably the best generally available technique for image classification.

Comments closed

Combining Stream Analytics And Azure ML With Power BI

Brad Llewellyn shows us how to feed Azure ML predictions into Power BI via Azure Stream Analytics:

Today, we’re going to talk about combining Stream Analytics with Azure Machine Learning Studio within Power BI.  If you haven’t read the earlier posts in this series, IntroductionGetting Started with R ScriptsClusteringTime Series DecompositionForecastingCorrelationsCustom R VisualsR Scripts in Query EditorPythonAzure Machine Learning Studio and Stream Analytics, they may provide some useful context.  You can find the files from this post in our GitHub Repository.  Let’s move on to the core of this post, Stream Analytics.

This post is going to build directly on what we created in the two previous posts, Azure Machine Learning Studio and Stream Analytics.  As such, we recommend that you read them before proceeding.

Read on for the demo.

Comments closed

Kafka And The Differing Aims Of Data Professionals

Kai Waehner argues that there is an impedence mismatch between data engineers, data scientists, and ML production engineers:

Data scientists love Python, period. Therefore, the majority of machine learning/deep learning frameworks focus on Python APIs. Both the stablest and most cutting edge APIs, as well as the majority of examples and tutorials use Python APIs. In addition to Python support, there is typically support for other programming languages, including JavaScript for web integration and Java for platform integration—though oftentimes with fewer features and less maturity. No matter what other platforms are supported, chances are very high that your data scientists will build and train their analytic models with Python.

There is an impedance mismatch between model development using Python, its tool stack and a scalable, reliable data platform with low latency, high throughput, zero data loss and 24/7 availability requirements needed for data ingestion, preprocessing, model deployment and monitoring at scale. Python in practice is not the most well-known technology for these requirements. However, it is a great client for a data platform like Apache Kafka.

Click through for the full argument as well as where Kafka can help mitigate some of the issues.

Comments closed