I prefer it to R mostly because I don’t have to create the csv-file(names) in advance before I import data to it. This is particularly important for scenarios where I want to append data to an existing file. The key for this task is NOT to use the append-option that Python offers, because M-scripts will be executed multiple times and this would create a total mess in my file. Instead I create a new file with the context to append and use the Import-from-folder method instead to stitch all csvs back together. Therefore I have to dynamically create new filenames for each import. So when the M-Python-scripts are executed repetitively here, the newly created file will just be overwritten – which doesn’t do any harm.
Click through for the code as well as a few caveats.
The parallel processing holds two varieties of execution: Synchronous and Asynchronous.
In synchronous execution, once a process starts execution, it puts a lock over the main program until its get accomplished.
While the asynchronous execution doesn’t require locking, it performs a task quickly but the outcome can be in the rearranged order.
Click through for a few examples using Pool.
This post came about due to a question on the Microsoft Machine Learning Server forum. The question was if there are any plans by Microsoft to support more the one input dataset (
sp_execute_external_script. My immediate reaction was that if you want more than one dataset, you can always connect from the script back into the database, and retrieve data.
However, the poster was well aware of that, but due to certain reasons he did not want to do it that way – he wanted to push in the data, fair enough. When I read this, I seemed to remember something from a while ago, where, instead of retrieving data from inside the script, they pushed in the data, serialized it as an output parameter and then used the binary representation as in input parameter (yeah – this sounds confusing, but bear with me). I did some research (read Googling), and found this StackOverflow question, and answer. So for future questions, and for me to remember, I decided to write a blog post about it.
This has been a point of frustration for me. We can name the one input data set, so I’d really like to see true support for input multiple data sets without the need for hacks.
Now that we’ve seen our data, it’s a relatively simple task to convert the R script to a Python script. There are a few major differences. First, Python is a general purpose programming language, whereas R is a statistical programming language. This means that some of the functionality provided in Base R requires additional libraries in Python. Pandas is a good library for data manipulation, but is already included by default in Power BI. Scikit-learn (also known as sklearn) is a good library for build predictive models. Finally, Seaborn and Matplotlib are good libraries for creating data visualizations.
In addition, there are some scenarios where Python is a bit more verbose than R, resulting in additional coding to achieve the same result. For instance, fitting a regression line to our data using the sklearn.linear_model.LinearRegression().fit() function required much more coding than the corresponding lm() function in R. Of course, there are plenty of situations where the opposite is true and R becomes the more verbose language.
Click through for the full example.
Let’s look at how Gradient Boosting works. Most of the magic is described in the name: “Gradient” plus “Boosting”.
Boosting builds models from individual so called “weak learners” in an iterative way. In the Random Forests part, I had already discussed the differences between Bagging and Boostingas tree ensemble methods. In boosting, the individual models are not built on completely random subsets of data and features but sequentially by putting more weight on instances with wrong predictions and high errors. The general idea behind this is that instances, which are hard to predict correctly (“difficult” cases) will be focused on during learning, so that the model learns from past mistakes. When we train each ensemble on a subset of the training set, we also call this Stochastic Gradient Boosting, which can help improve generalizability of our model.
The gradient is used to minimize a loss function, similar to how Neural Nets utilize gradient descent to optimize (“learn”) weights. In each round of training, the weak learner is built and its predictions are compared to the correct outcome that we expect. The distance between prediction and truth represents the error rate of our model. These errors can now be used to calculate the gradient. The gradient is nothing fancy, it is basically the partial derivative of our loss function – so it describes the steepness of our error function. The gradient can be used to find the direction in which to change the model parameters in order to (maximally) reduce the error in the next round of training by “descending the gradient”.
Along with neural networks, gradient boosting has become one of the dominant algorithms for machine learning, and is well worth learning about.
What Are Convolutional Neural Networks?
Convolutional Neural Networks, like neural networks, are made up of neurons with learnable weights and biases. Each neuron receives several inputs, takes a weighted sum over them, pass it through an activation function and responds with an output.
The whole network has a loss function and all the tips and tricks that we developed for neural networks still apply on Convolutional Neural Networks.
Pretty straightforward, right?
Neural networks, as its name suggests, is a machine learning technique which is modeled after the brain structure. It comprises of a network of learning units called neurons.
These neurons learn how to convert input signals (e.g. picture of a cat) into corresponding output signals (e.g. the label “cat”), forming the basis of automated recognition.
Let’s take the example of automatic image recognition. The process of determining whether a picture contains a cat involves an activation function. If the picture resembles prior cat images the neurons have seen before, the label “cat” would be activated.
Hence, the more labeled images the neurons are exposed to, the better it learns how to recognize other unlabelled images. We call this the process of training neurons.
I (finally) finished chapter 5 of Deep Learning in R, which is all about CNNs. It’s interesting just how open CNNs are for post hoc understanding, totally at odds with the classic neural network reputation for being a black box full of dark magic.
Now R has some useful packages like ggmap, mapdata and ggplot2 which allow you to source you map satellite images directly from google maps, but this does require a free google API key to source from the cloud. These packages can also plot the map around the data as I am currently trimming the map to fit the data. But for a fair test I also used a simplistic pre-built map in R. This was from the package rworldmap, which allows plotting at a country level with defined borders. Axes can be scaled to act like a zoom function but without a higher resolutions map or raster satellite image map it is pointless to go past a country level.
There’s a lot more you can do with both languages, but when you just want a plot in a few lines of code, both are up to the task.
If you read Part Two, then you know these are the steps I used for anomaly detection with K-means:
Segmentation – the process of splitting your time series data into small segments with a horizontal translation.
Windowing – the action of multiplying your segmented data by a windowing function to truncate the dataset before and after the window. The term windowing gets its name from its functionality: it allows you to only see the data in the window range since everything before and after (or outside the window) is multiplied by zero. Windowing allows you to seamlessly stitch your reconstructed data together.
Clustering – the task of grouping similar windowed segments and finding the centroids in the clusters. A centroid is at the center of a cluster. Mathematically, it is defined by the arithmetic mean position of all the points in the cluster.
Reconstruction – the process of rebuilding your time series data. Essentially, you are matching your normal time series data to the closest centroid (the predicted centroid) and stitching those centroids together to produce the reconstructed data.
Normal Error – The purpose of the Reconstruction is to calculate the normal error associated with the output of your time series prediction.
Anomaly Detection – Since you know what the normal error for reconstruction is, you can now use it as a threshold for anomaly detection. Any reconstruction error above that normal error can be considered an anomaly.
Read the whole thing. This is a really cool use case of a set of technologies along with a venerable (if sometimes troublesome) algorithm.
As we have created four different models, we would also like to have the accuary of the model visually represented using SSRS.
Showing plots created with Python might not be as straight forward, as with R Language.
Following procedure will extract the data from database and generate plot, that can be used and visualized in SSRS.
Tomaz shows us examples of displaying data as well as visuals generated in Python.
Once we’ve done the hard work of building and testing a model we need to put it to some use! Excel is a great front-end tool for playing with data interactively. It’s used virtually everywhere and so being able to deliver your model in Excel to non-developer users massively opens up opportunities for how it can be used in your business. Even if the model is being used as part of a real-time or batch system, being able to call the model interactively can be really helpful when trying to understand the behaviour of a system.
Fortunately now the model is written in Python getting it into Excel is extremely simple. PyXLL, the Python Excel Add-In has everything we need to write Python for Excel. All we need to do is add a few @xl_func decorators from the pyxll module and configure the PyXLL add-in to load the module containing our model.
If you’re not already familiar with PyXLL, check out the introduction to PyXLL from the user guide.
I mean, if the data’s going to live in Excel spreadsheets anyhow…