Monitoring Car Data With Spark And Kafka

Carol McDonald builds a model to determine where Uber cars are clustered:

Uber trip data is published to a MapR Streams topic using the Kafka API. A Spark streaming application, subscribed to the topic, enriches the data with the cluster Id corresponding to the location using a k-means model, and publishes the results in JSON format to another topic. A Spark streaming application subscribed to the second topic analyzes the JSON messages in real time.

This is a fairly detailed post, well worth the read.

Understanding Naive Bayes

Ahmet Taspinar explains the Naive Bayes classificiation algorithm and writes Python code to implement it:

Within Machine Learning many tasks are – or can be reformulated as – classification tasks.

In classification tasks we are trying to produce a model which can give the correlation between the input data $X$ and the class $C$ each input belongs to. This model is formed with the feature-values of the input-data. For example, the dataset contains datapoints belonging to the classes ApplesPears and Oranges and based on the features of the datapoints (weight, color, size etc) we are trying to predict the class.

Ahmet has his entire post saved as a Jupyter notebook.

Calling Cognitive Services With R

David Smith has written a go-to guide for connecting to Azure Cognitive Services using R:

There’s no official R package (yet!) for calling Cognitive Services APIs. But since every Cognitive Service API is just a standard REST API, we can use the httr package to call the API. Input and output is standard JSON, which we can create and extract using the jsonlite package.

(There’s also an independent R interface to the text APIs. And there are already Python SDKs for many of the services, including the Face API.)

This is also useful for other REST APIs for times when there isn’t already a pre-built package to do most of the translation work for you.

Machine Learning With R Q&A

Ginger Grant answers a series of questions about R and machine learning:

Question: Is it possible to run R processes in diffrent boxes other than SQL Server itself for scalability reasons?

You have the option of installing the R Server on another server. Just keep in mind that you do have to account for the additional overhead of moving all the data over the network, which needs to weigh in on your decision to move processing to a different server.

Click through for plenty more questions and answers.

Using Spark MLlib For Categorization

Taras Matyashovskyy uses Apache Spark MLlib to categorize songs in different genres:

The roadmap for implementation was pretty straightforward:

  • Collect the raw data set of the lyrics (~65k sentences in total):

    • Black Sabbath, In Flames, Iron Maiden, Metallica, Moonspell, Nightwish, Sentenced, etc.
    • Abba, Ace of Base, Backstreet Boys, Britney Spears, Christina Aguilera, Madonna, etc.
  • Create training set, i.e. label (0 for metal | 1 for pop) + features (represented as double vectors)

  • Train logistic regression that is the obvious selection for the classification

This is a supervised learning problem, and is pretty fun to walk through.

Machine Learning Algorithms In R

Ginger Grant has a list of machine learning algorithms and their implementations in R:

Often times determining which algorithm to use can take a while.  Here is a pretty good flowchart for determining which algorithm should be used given some examples of what the desired outcomes and data contain. The diagram lists the algorithms, which are implemented in Azure ML.  The same algorithms can be implemented in R.  In R there are libraries to help with nearly every task.  Here’s a list of libraries and their accompanying links which can be used in Machine Learning.  This list is no means comprehensive as there are libraries and functions other than the ones listed here, but if you are trying to write a Machine Learning Experiment in R, and are looking at the flowchart, these R functions and Libraries will provide the tools to do the types of Machine Learning Analysis listed.

I think algorithm determination is one of the most difficult parts of machine learning.  Even if you don’t mean to go there, the garden of forking paths is dangerous.

SKLearn To Azure ML

David Crook shows how to build a model using Python’s SciKit library and then operationalize it in Azure ML:

Why Model Outside Azure ML?

Sometimes you run into things like various limitations, speed, data size or perhaps you just iterate better on your own workstation.  I find myself significantly faster on my workstation or in a jupyter notebook that lives on a big ol’ server doing my experiments.  Modelling outside Azure ML allows me to use the full capabilities of whatever infrastructure and framework I want for training.

So Why Operationalize with Azure ML?

AzureML has several benefits such as auto-scale, token generation, high speed python execution modules, api versioning, sharing, tight PaaS integration with things like Stream Analytics among many other things.  This really does make life easier for me.  Sure I can deploy a flask app via docker somewhere, but then, I need to worry about things like load balancing, and then security and I really just don’t want to do that.  I want to build a model, deploy it, and move to the next one.  My value is A.I. not web management, so the more time I spend delivering my value, the more impactful I can be.

Read the whole thing.

Cortana Intelligence Solutions

James Serra gives an introductory walkthrough to Cortana Intelligence Solutions:

Cortana Intelligence Solutions is a new tool just released in public preview that enables users to rapidly discover, easily provision, quickly experiment with, and jumpstart production grade analytical solutions using the Cortana Intelligence Suite (CIS).  It does so using preconfigured solutions, reference architectures and design patterns (I’ll just call all these solutions “patterns” for short).  At the heart of each Cortana Intelligence Solution pattern is one or more ARM Templates which describe the Azure resources to be provisioned in the user’s Azure subscription.  Cortana Intelligence Solution patterns can be complex with multiple ARM templates, interspersed with custom tasks (Web Jobs) and/or manual steps (such as Power BI authorization in Stream Analytics job outputs).

So instead of having to manually go to the Azure web portal and provision many sources, these patterns will do it for you automatically.  Think of a pattern as a way to accelerate the process of building an end-to-end demo on top of CIS.  A deployed solution will provision your subscription with necessary CIS components (i.e. Event Hub, Stream Analytics, HDInsight, Data Factory, Machine Learning, etc.) and build the relationships between them.

James also walks through an entire solution, so check it out.

Using Xgboost In Azure ML Studio

Koos van Strien wants to use the xgboost model in Azure ML Studio:

Because the high-level path of bringing trained R models from the local R environment towards the cloud Azure ML is almost identical to the Python one I showed two weeks ago, I use the same four steps to guide you through the process:

  1. Export the trained model

  2. Zip the exported files

  3. Upload to the Azure ML environment

  4. Embed in your Azure ML solution

Read the whole thing.


Koos van Strien moves from Python to R to run an xgboost algorithm:

Note that the parameters of xgboost used here fall in three categories:

  • General parameters

    • nthread (number of threads used, here 8 = the number of cores in my laptop)
  • Booster parameters

    • max.depth (of tree)
    • eta
  • Learning task parameters

    • objective: type of learning task (softmax for multiclass classification)
    • num_class: needed for the “softmax” algorithm: how many classes to predict?
  • Command Line Parameters

    • nround: number of rounds for boosting

Read the whole thing.


August 2017
« Jul