The Costs of Specialization within Data Science

Eric Colson argues in favor of data science generalists rather than specialists:

But the goal of data science is not to execute. Rather, the goal is to learn and develop profound new business capabilities. Algorithmic products and services like recommendations systemsclient engagement banditsstyle preference classificationsize matchingfashion design systemslogistics optimizersseasonal trend detection, and more can’t be designed up-front. They need to be learned. There are no blueprints to follow; these are novel capabilities with inherent uncertainty. Coefficients, models, model types, hyper parameters, all the elements you’ll need must be learned through experimentation, trial and error, and iteration. With pins, the learning and design are done up-front, before you produce them. With data science, you learn as you go, not before you go.

In the pin factory, when learning comes first we do not expect, nor do we want, the workers to improvise on any aspect the product, except to produce it more efficiently. Organizing by function makes sense since task specialization leads to process efficiencies and production consistency (no variations in the end product).

I think this article captures the downside risk of specialization, but not the downside risks of generalization: some people simply aren’t very good at some things, leading to huge amounts of technical debt down the road or failing a project due to the lack of requisite knowledge or skills. To give a personal example, I have a generalist team, but I still control the data flows (at the very least doing thorough code reviews of any database changes), my application specialist controls app architecture, my statistician reviews algorithms, etc. I don’t claim that this is the best strategy, but a group of pure generalists will have their own set of problems too.

Accidentally Building a Population Graph

Neil Saunders shares an example of a newspaper headline which ultimately just shows us population sizes:

Some poking around in the NSW Transport Open Data portal reveals how many people enter every Sydney train station on a “typical” day in 2016, 2017 and 2018. We could manipulate those numbers in various ways to estimate total, unique passengers for FY 2017-18 but I’m going to argue that the value as-is serves as a proxy variable for “station busyness”.

When working with spatial data cases, it’s important to differentiate an effect you see because it’s actually unique or interesting versus an effect you see because that’s where all of the people are.

Aspect-Based Sentiment Analysis

Federico Pascual explains aspect-based sentiment analysis and then shows how to implement it with MonkeyLearn:

Imagine you have a large dataset of customer feedback from different sources such as NPS, satisfaction surveys, social media, and online reviews. Some positive, some negative and others that contain mixed feelings. You’d use sentiment analysis to classify the polarity of each text, right? After all, it’s already proven to be a highly efficient tool.

But, what if you wanted to pick customer feedback apart, hone in on the details, get down to the nitty-gritty of each review for a more accurate analysis of your customers’ opinions?

Cue aspect-based sentiment analysis (ABSA). A text analysis technique that breaks down text into aspects (attributes or components of a product or service) and allocates each one a sentiment level. This technique can help businesses become customer-centric, which means putting their customers at the heart of everything they do. It’s about listening to their customers, understanding their voice, analyzing their feedback and learning more about customer experiences, as well as their expectations for products or services.

Click through for the demo.

Identifying Distributions with knn in R

Abhijit Telang has an interesting post on identifying arbitrary distributions with the k-nearest-neighbor algorithm in R:

You can easily see how arbitrary the shapes can be almost magically discovered, through the principle of the nearest neighbor search.

The magic happens because the methodical approach of meeting and greeting the neighbors discovers more and more neighbors (and hence the visualization becomes denser and denser) as per the formation of the shape, and on the other hand, sparser and sparser as the traversal approaches the contours of those very shapes. The sparseness around the dense shapes provides the much-needed contrast to discover hidden shapes.

Read on for a very interesting explanation.

Shoestring Budget Keyword Bidding

Steph Locke shares some experience on keyword bidding without breaking the piggy bank:

To be able to do the next stage I was interested in a solution that would give me an API or a bulk-upload interface for free or cheap. Google Ads has a Keyword Planner but having to create a campaign before I could even get started wasn’t fun. Another recommendation I’d had for keyword analysis was and I noticed they had a $7 trial and an API that looked well documented. The API didn’t seem to cover keywords but it did have a bulk upload capability.

Some of what Steph’s doing is possible within AdWords reporting, but they will tend toward finding helpful ways of maximizing your spend.

Analyzing Autosteer Data (Or Lack Thereof)

Elliot Williams has an interesting analysis of the NHTSA report on Tesla’s Autosteer capabilities:

But the NHTSA report went a step further. Based on the data that Tesla provided them, they noted that since the addition of Autosteer to Tesla’s confusingly named “Autopilot” suite of functions, the rate of crashes severe enough to deploy airbags declined by 40%. That’s a fantastic result.

Because it was so spectacular, a private company with a history of investigating automotive safety wanted to have a look at the data. The NHTSA refused because Tesla claimed that the data was a trade secret, so Quality Control Systems (QCS) filed a Freedom of Information Act lawsuit to get the data on which the report was based. Nearly two years later, QCS eventually won.

Looking into the data, QCS concluded that crashes may have actually increased by as much as 60% on the addition of Autosteer, or maybe not at all. 

This is a great exercise in statistical analysis and the problem of garbage in, garbage out.

Robust Regressions in R

Michael Grogan shows how you can find and re-weigh outliers when performing regressions:

A useful way of dealing with outliers is by running a robust regression, or a regression that adjusts the weights assigned to each observation in order to reduce the skew resulting from the outliers.

In this particular example, we will build a regression to analyse internet usage in megabytes across different observations. You will see that we have several outliers in this dataset. Specifically, we have three incidences where internet consumption is vastly higher than other observations in the dataset.

Let’s see how we can use a robust regression to mitigate for these outliers.

Click through for a demonstration.

Conjoint Analysis In R

Abhijit Telang introduces the concept of conjoint analysis and shows how you can implement this in R:

We will need to typically transform the problem of utility modeling from its intangible, abstract form to something that is measurable. That is, we wish to assign a numeric value to the perceived utility by the consumer, and we want to measure that accurately and precisely (as much as possible).

This is where survey design comes in, where, as a market researcher, we must design inputs (in the form of questionnaires) to have respondents do the hard work of transforming their qualitative, habitual, perceptual opinions into  simplified, summarized aggregate values which are expressed either as a numeric value or on a rank scale.

I tend to shy away from this kind of analysis because it runs a huge risk of trying to turn ordinal utility rankings into cardinal functions.

Bayesian Modeling Of Hardware Failure Rates

Sean Owen shows how you can use Bayesian statistical approaches with Spark Streaming, using the example of hard drive failure rates:

This data doesn’t arrive all at once, in reality. It arrives in a stream, and so it’s natural to run these kind of queries continuously. This is simple with Apache Spark’s Structured Streaming, and proceeds almost identically.

Of course, on the first day this streaming analysis is rolled out, it starts from nothing. Even after two quarters of data here, there’s still significant uncertainty about failure rates, because failures are rare.

An organization that’s transitioning this kind of offline data science to an online streaming context probably does have plenty of historical data. This is just the kind of prior belief about failure rates that can be injected as a prior distribution on failure rates!

Bayesian approaches work really well with streaming data if you think of the streams as sampling events used to update your priors to a new posterior distribution.

Handling Definitional Changes In Predictive Variables

Vincent Granville explains how you can blend two different definitions of a variable of interest together:

The reasons why scores can become meaningless over time is because data evolves. New features (variables) are added that were not available before, the definition of a metric is suddenly changed (for instance, the way income is measured) resulting in new data not compatible with prior data, and faulty scores. Also, when external data is gathered across multiple sources, each source may compute it differently, resulting in incompatibilities: for instance, when comparing individual credit scores from two people that are costumers at two different banks, each bank computes base metrics (income, recency, net worth, and so on) used to build the score, in a different way. Sometimes the issue is caused by missing data, especially when users with missing data are very different from those with full data attached to them.

Click through for a description of the approach and links showing how it works in practice.


March 2019
« Feb