Bayesian Modeling Of Hardware Failure Rates

Sean Owen shows how you can use Bayesian statistical approaches with Spark Streaming, using the example of hard drive failure rates:

This data doesn’t arrive all at once, in reality. It arrives in a stream, and so it’s natural to run these kind of queries continuously. This is simple with Apache Spark’s Structured Streaming, and proceeds almost identically.

Of course, on the first day this streaming analysis is rolled out, it starts from nothing. Even after two quarters of data here, there’s still significant uncertainty about failure rates, because failures are rare.

An organization that’s transitioning this kind of offline data science to an online streaming context probably does have plenty of historical data. This is just the kind of prior belief about failure rates that can be injected as a prior distribution on failure rates!

Bayesian approaches work really well with streaming data if you think of the streams as sampling events used to update your priors to a new posterior distribution.

Related Posts

Lasso and Ridge Regression in Python

Kristian Larsen shows off a few regression techniques using Python: Variables with a regression coefficient equal to zero after the shrinkage process are excluded from the model. Variables with non-zero regression coefficients variables are most strongly associated with the response variable. Therefore, when you conduct a regression model it can be helpful to do a […]

Read More

Controlling Partition and File Counts in Spark

Landon Robinson shows how we can control the number of partitions (and therefore the number of output files) on reduce-style jobs in Spark: Whatever the case may be, the desire to control the number of files for a job or query is reasonable – within, ahem, reason – and in general is not too complicated. And, it’s often […]

Read More


February 2019
« Jan Mar »