ML studio now gives you even more flexibility, with new language engines supported in the language modules. Within the Execute Python Script module, you can now choose to use Python 2.7.11 or Python 3.5, both of which run within the Acaconda 4.0 distribution. And within the Execute R Script module, you can now choose Microsoft R Open 3.2.2 as your R engine, in addition to the existing CRAN R 3.1.0 engine. Microsoft R Open 3.2.2 not only gives you a newer R language engine, it also gives you access to a wealth of new R packages for use within ML Studio. Over 400 packages are pre-installed for use with the R Script module, and you can install and use any other R package (including CRAN packages and your own R packages) via the Script Bundle input port.
I’m interested in the Microsoft R Open language support, as Azure ML’s still using a relatively older version of R (3.1.0).
Even if your target variable is a numeric one, sometimes it’s better to use classification methods instead of approximation ones. For instance if you have mostly zero target values and just a few non-zero values. Change the latter to 1, in this case you’ll have two categories: 1 (positive value of your target variable ) and 0. You can also split numerical variable into multiple subgroups : apartment prices for low, medium and high by equal subset width and predict them using classification algorithms. This process is called discretization.
Both types of models are common in machine learning, so a good understanding of when to use which is important.
In this post, we focus on sourcing R and Python’s external dependencies, such as R libraries and Python modules, which are not already installed on Azure ML and require code compilation. Commonly the compiled code comes from a variety of other languages such as C, C++ and Fortran. One could also use this approach to wrap their compiled code with R or Python wrappers and run it on Azure ML.
To illustrate the process, we will build two MurmurHash modules from C++ for R and Python using the following two implementations on GitHub, and link them to Azure ML from a zipped folder
Link via David Smith. I knew it was possible to call compiled C code from Python and R, but didn’t expect to be able to do it within Azure ML, so that’s good to know.
If missing values are something which haunts you then
MICEpackage is the real friend of yours.
When we face an issue of missing values we generally go ahead with basic imputations such as replacing with 0, replacing with mean, replacing with mode etc. but each of these methods are not versatile and could result into a possible data discrepancy.
MICEpackage helps you to impute missing values by using multiple techniques, depending on the kind of data you are working with.
I’d heard of a couple of these, but most of them are new to me.
In each case, a number of different models are trained in R (decision forests, boosted decision trees, multinomial models, neural networks and poisson regression) and compared for performance; the best model is automatically selected for predictions.
On a related note, Microsoft recently teamed up with aircraft engine manufacturer Rolls-Royceto help airlines get the most out of their engines. Rolls-Royce is turning to Microsoft’s Azure cloud-based services — Stream Analytics, Machine Learning and Power BI — to make recommendations to airline executives on the most efficient way to use their engines in flight and on the ground. This short video gives an overview.
Check out the data set and play around a bit.
When explaining what machine learning is, I’m giving the example of predicting the country someone lives in from their first name. So John might be American and Johannes might be German.
In this case, it’s really easy to imagine what data you might want to do a good job at this — just get the first names and current countries of every person in the world! Then count up which countries Julias live in (Canada? The US? Germany?), pick the most likely one, and you’re done!
This is a super simple modelling process, but I think it’s a good illustration — if you don’t include any data from China when training your computer to recognize names, it’s not going to get any Chinese names right!
Machine learning projects are like any other development projects, with more complex algorithms. There’s no magic and there’s a lot of perspiration (hopefully figuratively rather than literally) involved in getting a program which behaves correctly.
In this article we used an artificial neural network (ANN) from Spark machine learning library as a classifier to predict emergency department deaths due to heart disease. We discussed a high-level process for feature selection, choosing number of hidden layers of the network and number of computational units. Based on that process, we found a model that achieved very good performance on test data. We observed that Spark MLlib API is simple and easy to use for training the classifier and calculating its performance metrics. In reference to Hastie et. al, we have some final comments.
Articles like this are what got me interested in data analysis to begin with.
This kind of attribution of an adjective to a subject is something an accurate parser can do automatically. Rather than laboriously going through just a hundred examples, it’s easy to set up the Parser McParseface and run through millions of sentences. The parser isn’t perfect, but at 94% accuracy on one metric, it’s pretty close to humans who get 96%.
Even better, having the computer do the heavy lifting means that it’s possible to explore many other relationships in the data, to uncover all sorts of unknown statistical relationships in the language we use. There’s bound to be other words that are skewed in similar or opposite ways to ‘bossy’, and I’d love to know what they are!
Looks like one more time sink for me… Check this out if you’re at all interested in parsers.