This data doesn’t arrive all at once, in reality. It arrives in a stream, and so it’s natural to run these kind of queries continuously. This is simple with Apache Spark’s Structured Streaming, and proceeds almost identically.
Of course, on the first day this streaming analysis is rolled out, it starts from nothing. Even after two quarters of data here, there’s still significant uncertainty about failure rates, because failures are rare.
An organization that’s transitioning this kind of offline data science to an online streaming context probably does have plenty of historical data. This is just the kind of prior belief about failure rates that can be injected as a prior distribution on failure rates!
Bayesian approaches work really well with streaming data if you think of the streams as sampling events used to update your priors to a new posterior distribution.
VGG16 is a built-in neural network in Keras that is pre-trained for image recognition.
Technically, it is possible to gather training and test data independently to build the classifier. However, this would necessitate at least 1,000 images, with 10,000 or greater being preferable.
In this regard, it is much easier to use a pre-trained neural network that has already been designed for image classification purposes.
This is probably the best generally available technique for image classification.
Using Cloudera Data Science Workbench with Apache NiFi, we can easily call functions within our deployed models from Apache NiFi as part of flows. I am working against CDSW on HDP (https://www.cloudera.com/documentation/data-science-workbench/latest/topics/cdsw_hdp.html), but it will work for all CDSW regardless of install type.
In my simple example, I built a Python model that uses TextBlob to run sentiment analysis against a passed-in sentence. It returns Sentiment Polarity and Subjectivity, which we can immediately act upon in our flow.
CDSW is extremely easy to work with and I was up and running in a few minutes. For my model, I created a python 3 script and a shell script for install details. Both of these artifacts are available here: https://github.com/tspannhw/nifi-cdsw.
The “no code” portion was less interesting to me than the scalable ML portion, as “no code” either drops into tedium or ends up being replaced by code.
There are far more options when using Faker. Looking at the official documentation you’ll see the list of different data types you can generate as well as options such as region specific data.
Go have fun trying this, it’s a small setup for a large amount of time saved.
These types of tools can be great for generating a bunch of data but come with a couple of risks. One is that in the fake addresses Rich shows, ZIP codes don’t match their states at all, so if your application needs valid combos, it can cause issues. The other problem comes from distributions: generated data often gets created off of a uniform distribution, so you might not find skewness-related problems (e.g., parameter sniffing issues) strictly in your test data.
That said, easily generating test data is powerful and I don’t want to let the good be the enemy of the great.
This bootcamp is a free online course for everyone who wants to learn hands-on machine learning and AI techniques, from basic algorithms to deep learning, computer vision and NLP. However, the course language is German only, but for every chapter I did, you will find an English R-version here on my blog (see below for links).
Right now, the course is in beta phase, so we are happy about everyone who tests our content and leaves feedback. Also, not the entire curriculum is finished yet, we will update and extend the course during the next months. If there are specific topics you’d like to have us cover, just let us know!
If you understand German and want to learn about data science, check this out and leave feedback.
For this analysis I’m using the SAS Open Source library called SWAT (Scripting Wrapper for Analytics Transfer) to code in Python and execute SAS CAS Action Sets. SWAT acts as a bridge between the python language to CAS Action Sets. CAS Action Sets are synonymous to libraries in Python or packages in R. The one main difference and benefit is that the algorithms within these action sets have been highly parallelized to run on a CAS (Cloud Analytic Services) server. The CAS server is a distributed in-memory engine where I can do all my heavy lifting or computations. The code and Jupyter Notebook are available on GitHub.
Click through for the analysis.
After a small bit of research I discovered the concept of monkey patching (modifying a program to extend its local execution) the DataFrame object to include a transform function. This function is missing from PySpark but does exist as part of the Scala language already.
The following code can be used to achieve this, and can be stored in a generic wrapper functions notebook to separate it out from your main code. This can then be called to import the functions whenever you need them.
Things which make Python more of a functional language are fine by me. Even though I’d rather use Scala.
The most important part of hypothesis testing is being clear what question we are trying to answer. In our case we are asking:
“Could the most extreme value happen by chance?”
The most extreme value we define as the greatest absolute AMVR deviation from the mean. This question forms our null hypothesis.
Give this one a careful read and try out the code. This is an important topic for anyone who analyzes data to understand.
Convolutional Neural Nets are usually abbreviated either CNNs or ConvNets. They are a specific type of neural network that has very particular differences compared to MLPs. Basically, you can think of CNNs as working similarly to the receptive fields of photoreceptors in the human eye. Receptive fields in our eyes are small connected areas on the retina where groups of many photo-receptors stimulate much fewer ganglion cells. Thus, each ganglion cell can be stimulated by a large number of receptors, so that a complex input is condensed into a compressed output before it is further processed in the brain.
If you’re interested in understanding why a CNN will classify the way it does, chapter 5 of Deep Learning with R is a great reference.
H2O provides popular open source software for data science and machine learning on big data, including Apache SparkTM integration. It provides two open source python AutoML classes: h2o.automl.H2OAutoML and pysparkling.ml.H2OAutoML. Both APIs use the same underlying algorithm implementations, however, the latter follows the conventions of Apache Spark’s MLlib library and allows you to build machine learning pipelines that include MLlib transformers. We will focus on the latter API in this post.
H2OAutoML supports classification and regression. The ML models built and tuned by H2OAutoML include Random Forests, Gradient Boosting Machines, Deep Neural Nets, Generalized Linear Models, and Stacked Ensembles.
The post only has a few lines of code but there are a lot of working parts under the surface.