As important as these concepts are to working Python and MLS, the purpose in covering them was meant only to provide you with a foundation for doing what’s really important in MLS, that is, using Python (or the R language) to analyze data and present the results in a meaningful way. In this article, we start digging into the analytics side of Python by stepping through a script that identifies anomalies in a data set, which can occur as a result of fraud, demographic irregularities, network or system intrusion, or any number of other reasons.
The article uses a single example to demonstrate how to generate training and test data, create a support vector machine (SVM) data model based on the training data, score the test data using the SVM model, and create a scatter plot that shows the scoring results.
Click through to see the scenario that Robert has laid out as an example.
One of the new features announced with SQL Server 2017 is support for the Python language. This is big! In SQL Server 2016, Microsoft announced support for the R language – an open source language ideally suited for statistical analysis and machine learning (ML). Recognizing that many data scientists use Python with ML libraries, the easy-to-learn-hard-to-forget language has now been added to the SQL Server ML suite.
There’s a big difference between R and Python though: R is a domain-specific language while Python is general purpose. That means that the full power of Python is available within SQL Server. This article leaves ML aside for the moment and explores a few of the other possibilities.
Gerald has two good cases for using Python with SQL Server. Funny enough, they’re both also easily supported in R, so you could do this in 2016 as well.
One of the most useful modules is the matplotlib library, which provides an extensive codebase for plotting data and creating rich, customized visualizations. You can use matplotlib components to generate a wide range of graphics, including bar charts, pie charts, scatter plots, histograms, and many others. For example, you can generate a series of line charts that aggregate inventory or sales data in your SQL Server database and then save those charts to .png or .pdf files.
This article includes several examples that demonstrate how to create matplotlib visualizations and save them to .pdf files, using data from the AdventureWorks2017 sample database. The article assumes that you know how to use the sp_execute_external_script stored procedure to run Python scripts in SQL Server. If you’re not familiar with the stored procedure, you should review the first two articles in this series before continuing with this one.
If you’re already familiar with matplotlib, using it within SQL Server is pretty easy, as Robert shows. If you’re not familiar, this is a useful introduction to the library.
This article focuses on using data frames in Python. It is the second article in a series about MLS and Python. The first article introduced you briefly to data frames. This article continues that discussion, describing how to work with data frame objects and the data within those objects.
Data frames and the functions they support are available to MLS and Python through the pandas library. The library is available as a Python module that provides tools for analyzing and manipulating data, including the ability to generate data frame objects and work with data frame data. The pandas library is included by default in MLS, so the functions and data structures available to pandas are ready to use, without having to manually install pandas in the MLS library.
There’s quite a bit to this article, making it an interesting read.
As you can see, I collected the brand (Peugeot 106), the type (1.0, 1.1, …), the color of the car (black, blue, …) the construction year of the car, the odometer of the car (which is the distance in kilometers (km) traveled with the car at this point in space and time), the ask price of the car (in Euro’s), the days until the MOT (Ministry of Transport test, a required periodical check-up of your car) and the horse power (HP) of the car. Feel free to use your own variables/units!
It’s an interesting example of how you can approach a real problem.
The above code is correct, but it is inefficient. List comprehensions are meant for case like this. The idea of comprehension is to move the for loop inside the construction of the result list:def get_square_list(x): return [xi*xi for xi in x]
This is both simpler and faster than the previous code.
Definitely worth reading if you come at Python from a C-like language background.
Another way we might be able to look at the data “out of time” to show flow between modules is to use a Sankey diagram that allows for the possibility of feedback loops.
sankeyviewpackage (described in Hybrid Sankey diagrams: Visual analysis of multidimensional data for understanding resource use looks like it could be useful here, if I can work out how to do the set-up correctly!
Sankey diagrams are on my list of dangerous visuals: done right, they are informative, but it’s easy to try to put too much into the diagram and thereby confuse everybody.
The first thing to notice is that our values are not normalized. The number of visitors is a number and gets larger and larger. To normalize it, we simply divide it by 100, since all numbers are below 1. The same holds for the lag. Most of the lags are lower than 30. Therefore, I will divide the lag size by 30.
Notice that there are many more approaches for normalizing the data! This is just a quick normalization on the data, but feel free to use your own normalization method. My normalization process is closely related to the MinMaxScalar normalization which can be found in sklearn (scikit-learn).
With just a few lines of Python code we can create a Multi-Layer Perceptron (MLP):
Click through for the code.
I’m an R programmer. To me, R has been great for data exploration, transformation, statistical modeling, and visualizations. However, there is a huge community of Data Scientists and Analysts who turn to Python for these tasks. Moreover, both R and Python experts exist in most analytics organizations, and it is important for both languages to coexist.
Many times, this means that R coders will develop a workflow in R but then must redesign and recode it in Python for their production systems. If the coder is lucky, this is easy, and the R model can be exported as a serialized object and read into Python. There are packages that do this, such as pmml. Unfortunately, many times, this is more challenging because the production system might demand that the entire end to end workflow is built exclusively in Python. That’s sometimes tough because there are aspects of statistical model building in R which are more intuitive than Python.
Python has many strengths, such as its robust data structures such as Dictionaries, compatibility with Deep Learning and Spark, and its ability to be a multipurpose language. However, many scenarios in enterprise analytics require people to go back to basic statistics and Machine Learning, which the classic Data Science packages in Python are not as intuitive as R for. The key difference is that many statistical methods are built into R natively. As a result, there is a gap for when R users must build workflows in Python. To try to bridge this gap, this post will discuss a relatively new package developed by Microsoft, revoscalepy.
Having worked with both, my loyalties tend to lie with R for a couple of reasons. But this might help some people bridge the gap.
There are two installation options: In-Database or Standalone. If you are evaluating Machine Learning Services and you have no knowledge of what the load may be, start by selecting the Machine Learning Service In-Database. There are several reasons why by default you want to select the In-Database option. One of the problems that Microsoft was looking to solve by incorporating advanced data analytics was to improve performance of the native code by greatly reducing data latency. If you are analyzing a lot of data which is stored within SQL Server, the performance will be improved if the data does not need to be moved around on a network. Also, the licensing costs of installing R Server standalone also need to be evaluated with a Microsoft representative as well. An evaluation of the resource load on the network, as well as analysis of the code running on SQL Server should be performed prior to the decision to install the Machine Learning Server Standalone.
Read the whole thing.