Uber trip data is published to a MapR Streams topic using the Kafka API. A Spark streaming application, subscribed to the topic, enriches the data with the cluster Id corresponding to the location using a k-means model, and publishes the results in JSON format to another topic. A Spark streaming application subscribed to the second topic analyzes the JSON messages in real time.
This is a fairly detailed post, well worth the read.
Within Machine Learning many tasks are – or can be reformulated as – classification tasks.
In classification tasks we are trying to produce a model which can give the correlation between the input data and the class each input belongs to. This model is formed with the feature-values of the input-data. For example, the dataset contains datapoints belonging to the classes Apples, Pears and Oranges and based on the features of the datapoints (weight, color, size etc) we are trying to predict the class.
Ahmet has his entire post saved as a Jupyter notebook.
There’s no official R package (yet!) for calling Cognitive Services APIs. But since every Cognitive Service API is just a standard REST API, we can use the httr package to call the API. Input and output is standard JSON, which we can create and extract using the jsonlite package.
This is also useful for other REST APIs for times when there isn’t already a pre-built package to do most of the translation work for you.
Question: Is it possible to run R processes in diffrent boxes other than SQL Server itself for scalability reasons?
You have the option of installing the R Server on another server. Just keep in mind that you do have to account for the additional overhead of moving all the data over the network, which needs to weigh in on your decision to move processing to a different server.
Click through for plenty more questions and answers.
The roadmap for implementation was pretty straightforward:
Collect the raw data set of the lyrics (~65k sentences in total):
- Black Sabbath, In Flames, Iron Maiden, Metallica, Moonspell, Nightwish, Sentenced, etc.
- Abba, Ace of Base, Backstreet Boys, Britney Spears, Christina Aguilera, Madonna, etc.
Create training set, i.e. label (0 for metal | 1 for pop) + features (represented as double vectors)
Train logistic regression that is the obvious selection for the classification
This is a supervised learning problem, and is pretty fun to walk through.
Often times determining which algorithm to use can take a while. Here is a pretty good flowchart for determining which algorithm should be used given some examples of what the desired outcomes and data contain. The diagram lists the algorithms, which are implemented in Azure ML. The same algorithms can be implemented in R. In R there are libraries to help with nearly every task. Here’s a list of libraries and their accompanying links which can be used in Machine Learning. This list is no means comprehensive as there are libraries and functions other than the ones listed here, but if you are trying to write a Machine Learning Experiment in R, and are looking at the flowchart, these R functions and Libraries will provide the tools to do the types of Machine Learning Analysis listed.
I think algorithm determination is one of the most difficult parts of machine learning. Even if you don’t mean to go there, the garden of forking paths is dangerous.
Why Model Outside Azure ML?
Sometimes you run into things like various limitations, speed, data size or perhaps you just iterate better on your own workstation. I find myself significantly faster on my workstation or in a jupyter notebook that lives on a big ol’ server doing my experiments. Modelling outside Azure ML allows me to use the full capabilities of whatever infrastructure and framework I want for training.
So Why Operationalize with Azure ML?
AzureML has several benefits such as auto-scale, token generation, high speed python execution modules, api versioning, sharing, tight PaaS integration with things like Stream Analytics among many other things. This really does make life easier for me. Sure I can deploy a flask app via docker somewhere, but then, I need to worry about things like load balancing, and then security and I really just don’t want to do that. I want to build a model, deploy it, and move to the next one. My value is A.I. not web management, so the more time I spend delivering my value, the more impactful I can be.
Read the whole thing.
Cortana Intelligence Solutions is a new tool just released in public preview that enables users to rapidly discover, easily provision, quickly experiment with, and jumpstart production grade analytical solutions using the Cortana Intelligence Suite (CIS). It does so using preconfigured solutions, reference architectures and design patterns (I’ll just call all these solutions “patterns” for short). At the heart of each Cortana Intelligence Solution pattern is one or more ARM Templates which describe the Azure resources to be provisioned in the user’s Azure subscription. Cortana Intelligence Solution patterns can be complex with multiple ARM templates, interspersed with custom tasks (Web Jobs) and/or manual steps (such as Power BI authorization in Stream Analytics job outputs).
So instead of having to manually go to the Azure web portal and provision many sources, these patterns will do it for you automatically. Think of a pattern as a way to accelerate the process of building an end-to-end demo on top of CIS. A deployed solution will provision your subscription with necessary CIS components (i.e. Event Hub, Stream Analytics, HDInsight, Data Factory, Machine Learning, etc.) and build the relationships between them.
James also walks through an entire solution, so check it out.
Because the high-level path of bringing trained R models from the local R environment towards the cloud Azure ML is almost identical to the Python one I showed two weeks ago, I use the same four steps to guide you through the process:
Export the trained model
Zip the exported files
Upload to the Azure ML environment
Embed in your Azure ML solution
Read the whole thing.
Note that the parameters of xgboost used here fall in three categories:
- nthread (number of threads used, here 8 = the number of cores in my laptop)
- max.depth (of tree)
Learning task parameters
- objective: type of learning task (softmax for multiclass classification)
- num_class: needed for the “softmax” algorithm: how many classes to predict?
Command Line Parameters
nround: number of rounds for boosting
Read the whole thing.