Looking at the first 5 records of the RDD
This output is difficult to read. This is because we are asking PySpark to show us data that is in the RDD format. PySpark has a DataFrame functionality. If the Python version is 2.7 or higher, you can utilize the pandas package. However, pandas doesn’t work on Python versions 2.6, so we use the Spark SQL functionality to create DataFrames for exploration.
The full example is a fairly simple k-means clustering process, which is a great introduction to PySpark.
According to the results of the 2016 survey, R is the preferred tool for 42% of analytics professionals, followed by SAS at 39% and Python at 20%. While Python’s placing may at first appear to relegate the language to Bronze Medal status, it’s the delta here that really matters.
It’s interesting to see the breakdowns of who uses which language, comparing across industry, education, work experience, and geographic lines.
ML studio now gives you even more flexibility, with new language engines supported in the language modules. Within the Execute Python Script module, you can now choose to use Python 2.7.11 or Python 3.5, both of which run within the Acaconda 4.0 distribution. And within the Execute R Script module, you can now choose Microsoft R Open 3.2.2 as your R engine, in addition to the existing CRAN R 3.1.0 engine. Microsoft R Open 3.2.2 not only gives you a newer R language engine, it also gives you access to a wealth of new R packages for use within ML Studio. Over 400 packages are pre-installed for use with the R Script module, and you can install and use any other R package (including CRAN packages and your own R packages) via the Script Bundle input port.
I’m interested in the Microsoft R Open language support, as Azure ML’s still using a relatively older version of R (3.1.0).
In this post, we focus on sourcing R and Python’s external dependencies, such as R libraries and Python modules, which are not already installed on Azure ML and require code compilation. Commonly the compiled code comes from a variety of other languages such as C, C++ and Fortran. One could also use this approach to wrap their compiled code with R or Python wrappers and run it on Azure ML.
To illustrate the process, we will build two MurmurHash modules from C++ for R and Python using the following two implementations on GitHub, and link them to Azure ML from a zipped folder
Link via David Smith. I knew it was possible to call compiled C code from Python and R, but didn’t expect to be able to do it within Azure ML, so that’s good to know.
K-Means takes in an unlabeled data set and a whole real number, k. K is the number of centroids, or clusters you wish to find. If you do not know how many clusters there should be, it is possible to do some pre-processing to find that more automatically, however that is out of the scope of this article. Once you have a data set and defined the size of k, K-Means begins its iterative process. It starts by selecting centroids by moving them to the average of the data associated with them. It then reshuffles all of the data into new groups based on the proximity to each centroid.
This is a big and detailed post, and worth reading in its totality.
Buck Woody’s back to blogging, and his focus is data science. Over the past month, he’s looked at R and Python.
In future notebook entries we’ll explore working with R, but for now, we need to install it. That really isn’t that difficult, but it does bring up something we need to deal with first. While the R environment is truly amazing, it has some limitations. It’s most glaring issue is that the data you want to work with is loaded into memory as a frame, which of course limits the amount of data you can process for a given task. It’s also not terribly suited for parallelism – many things are handled as in-line tasks. And if you use a package in your script, you have to ensure others load that script, and at the right version.
Enter Revolution Analytics – a company that changed R to include more features and capabilities to correct these issues, along with a few others. They have a great name in the industry, bright people, and great products – so Microsoft bought them. That means the “RRE” engine they created is going to start popping up in all sorts of places, like SQL Server 2016, Azure Machine Learning, and many others. But the “stand-alone” RRE products are still available, and at the current version. So that’s what we’ll install.
Python has some distinct differences that make it attractive for working in data analytics. It scales well, is fairly easy to learn and use, has an extensible framework, has support for almost every platform around, and you can use it to write extensive programs that work with almost any other system and platform.
R and Python are the two biggest languages in this slice of the field, and you’ll gain a lot from learning at least one of these languages.