When choosing a train and testset, you’ll implicitly introduce a new bias: it could be that the model you just trained predicts well for this particular testset, when trained for this particular trainset. To reduce this bias, you could “cross-validate” your results.
Cross-validation (often abbreviated as just “cv”) splits the dataset into n folds. Each fold is used once as a testset, using all other folds together as a training set. So in our pizza example with 100 records, with 5 folds we will have 5 test runs:
This isn’t Azure ML-specific, and is good reading.
Before I started, I was already quite comfortable programming Python and did some R programming in the past. This turned out pretty handy, though not really needed to start off with – because starting with Azure ML, the data flow can be created much like BI specialists are used to in SSIS.
A good place to start for me was the Tutorial competition (Iris Petal Competition). It provides you with a pre-filled workspace with everything in place to train and test your first ML model:
I’d like to see Azure ML get more traction; I’m not optimistic that it will.
The Linked Service for ML is going to need some information from the Web Service, the URL and the API key. Chances are neither of these have been committed to memory, instead open up Azure ML, go to Web Service and copy them. For the URL, look under the API Help Pagegrid, there are two options, Request/Response and Batch Execution. Clicking on Batch Execution loads a new page Batch Execution API Document. The URL can be found under Request URI. When copying the URL, you do not need to include any text after the word “jobs”. The rest of the URL, “?api-version=2.0”. Copying the entire URL will cause an error. Going back to the web Services page, The API Key appears on the dashboard section of Azure ML and there is a convenient button for copying it. Using these two pieces of information, it is now possible to create the Data Factory Linked Service to make the connection to the web service, which here I called AzureMLLinkedService
Read the whole thing.
The Amazon ML endpoint created earlier can be invoked using an API call. This is very handy for building an application for end users who can interact with the ML model in real time.
Create a similar application and host it as a static website on Amazon S3. This feature of S3 allows you to host websites without any web servers and takes away the complexities of scaling hardware based on traffic routed to your application. The following is a screenshot from the application:
I think that Azure ML is still ahead of Amazon’s ML solution, but I’m happy to see the competition.
ML studio now gives you even more flexibility, with new language engines supported in the language modules. Within the Execute Python Script module, you can now choose to use Python 2.7.11 or Python 3.5, both of which run within the Acaconda 4.0 distribution. And within the Execute R Script module, you can now choose Microsoft R Open 3.2.2 as your R engine, in addition to the existing CRAN R 3.1.0 engine. Microsoft R Open 3.2.2 not only gives you a newer R language engine, it also gives you access to a wealth of new R packages for use within ML Studio. Over 400 packages are pre-installed for use with the R Script module, and you can install and use any other R package (including CRAN packages and your own R packages) via the Script Bundle input port.
I’m interested in the Microsoft R Open language support, as Azure ML’s still using a relatively older version of R (3.1.0).
Even if your target variable is a numeric one, sometimes it’s better to use classification methods instead of approximation ones. For instance if you have mostly zero target values and just a few non-zero values. Change the latter to 1, in this case you’ll have two categories: 1 (positive value of your target variable ) and 0. You can also split numerical variable into multiple subgroups : apartment prices for low, medium and high by equal subset width and predict them using classification algorithms. This process is called discretization.
Both types of models are common in machine learning, so a good understanding of when to use which is important.
In this post, we focus on sourcing R and Python’s external dependencies, such as R libraries and Python modules, which are not already installed on Azure ML and require code compilation. Commonly the compiled code comes from a variety of other languages such as C, C++ and Fortran. One could also use this approach to wrap their compiled code with R or Python wrappers and run it on Azure ML.
To illustrate the process, we will build two MurmurHash modules from C++ for R and Python using the following two implementations on GitHub, and link them to Azure ML from a zipped folder
Link via David Smith. I knew it was possible to call compiled C code from Python and R, but didn’t expect to be able to do it within Azure ML, so that’s good to know.
If missing values are something which haunts you then
MICEpackage is the real friend of yours.
When we face an issue of missing values we generally go ahead with basic imputations such as replacing with 0, replacing with mean, replacing with mode etc. but each of these methods are not versatile and could result into a possible data discrepancy.
MICEpackage helps you to impute missing values by using multiple techniques, depending on the kind of data you are working with.
I’d heard of a couple of these, but most of them are new to me.
In each case, a number of different models are trained in R (decision forests, boosted decision trees, multinomial models, neural networks and poisson regression) and compared for performance; the best model is automatically selected for predictions.
On a related note, Microsoft recently teamed up with aircraft engine manufacturer Rolls-Royceto help airlines get the most out of their engines. Rolls-Royce is turning to Microsoft’s Azure cloud-based services — Stream Analytics, Machine Learning and Power BI — to make recommendations to airline executives on the most efficient way to use their engines in flight and on the ground. This short video gives an overview.
Check out the data set and play around a bit.
When explaining what machine learning is, I’m giving the example of predicting the country someone lives in from their first name. So John might be American and Johannes might be German.
In this case, it’s really easy to imagine what data you might want to do a good job at this — just get the first names and current countries of every person in the world! Then count up which countries Julias live in (Canada? The US? Germany?), pick the most likely one, and you’re done!
This is a super simple modelling process, but I think it’s a good illustration — if you don’t include any data from China when training your computer to recognize names, it’s not going to get any Chinese names right!
Machine learning projects are like any other development projects, with more complex algorithms. There’s no magic and there’s a lot of perspiration (hopefully figuratively rather than literally) involved in getting a program which behaves correctly.