Press "Enter" to skip to content

Category: Python

Learning R Or Python?

David Smith tackles the age-old question:

If your interests lean more towards traditional statistical analysis and inference as used within industries like manufacturing, finance, and the life sciences, I’d lean towards R. If you’re more interested in machine learning and artificial intelligence applications, I’d lean towards Python. But even that’s not a hard-and-fast rule: R has excellent support for machine learning and deep learning frameworks, and Python is often used for traditional data science applications.

One thing I am quite sure of though: neither Python nor R is inherently better than the other, and arguments on that front are ultimately futile. (Trust me, I’ve been there.) Which is better for any given person depends on a wide variety of factors, and for some, it may even be worthwhile to learn both. Brian Ray recently posted a good overview of the factors that may lead you towards R or Python for data science: their history, the community, performance, third-party support, use cases, and even how to use them together. It’s great food for thought if you’re trying to decide which community to invest in.

Embrace the power of “and.”  The whole R versus Python bit is fun for purposes of arguing with people, but they’re both powerful languages and we’re seeing more and more overlap—for example, the Keras package David mentions runs Python’s TensorFlow under the covers.

Comments closed

Neural Topic Models On Amazon SageMaker

David Ping, et al, show off topic modeling on Amazon SageMaker:

Topic Modeling is used to organize a corpus of documents into “topics” which is a grouping based on a statistical distribution of words within the documents themselves. Amazon Comprehend, our fully managed text analytics service, provides a pre-configured topic modeling API that is best suited for the most popular use cases like organizing customer feedback, support incidents or workgroup documents. Amazon Comprehend is the suggested topic modeling choice for customers as it removes a lot of the most routine steps associated with topic modeling like tokenization, training a model and adjusting parameters. Amazon SageMaker’s Neural Topic Model (NTM) caters to the use cases where a finer control of the training, optimization, and/or hosting of a topic model is required, such as training models on text corpus of particular writing style or domain, or hosting topic models as part of a web application. While Amazon SageMaker NTM provides a starting point of state-of-the-art topic modeling, customers have the flexibility to modify the network architecture as well as hyperparameters to accommodate the idiosyncrasies of their data sets as well as to tune the trade-off between a multitude of metrics such as document modeling accuracy, human interpretability and granularity of the learned topics, based on their applications. In addition, Amazon SageMaker NTM leverages the full power of the Amazon SageMaker platform: easily configurable training and hosting infrastructure, automatic hyperparameter optimization, and fully-managed hosting with auto-scaling.

They walk through the entire topic modeling process, so check it out.

Comments closed

Granting Non-Admin Users Access To Run ML Services

Niels Berglund walks through the rights needed for a non-administrative user to execute an external script using SQL Server Machine Learning Services:

Oops, something did go wrong, as it turns out that if you try to grant permissions on extended stored procedures, which SPEES is, you need to do it from the master database. Cool, let us switch to master and do it there. Well, if you try to do that – then you get another error: the user does not exist in master, sigh!

At this stage you have a couple of options:

  • Add the login for the user to the sysadmin role, or the user to the db_owner role in the actual database. No do not do that, I am only kidding! Do.Not.Do.That!

  • Create the user in master and grant the permission. That would work.

  • Grant the permission to public.

Check it out, as there are two parts to the process.

Comments closed

Solving A Problem In TensorFlow Using SoftMax

Kiran Gutha gives us a fairly simple solution to the MNIST digit data set using the SoftMax algorithm:

In this tutorial, we will train a machine learning model for predicting numbers in pictures. Our goal is not to design a world-class complex model (although we will give you the source code to implement first-rate predictive models later). Rather, this tutorial is to introduce how to use TensorFlow. So, we start here with a very simple mathematical model called Softmax Regression.

The implementation code for this tutorial is short, and the really interesting content is only contained in three lines of code. However, it is very important to understand the design ideas contained in these codes: the basic concepts of TensorFlow workflow and machine learning. Therefore, this tutorial will explain in detail the implementation of these codes.

This is about as easy as it gets with neural networks, but easy doesn’t mean wrong.

Comments closed

Comparing Keras In Python Versus R

Dmitry Kisler performs image classification using Keras in both Python and R:

From the plots above, one can see that:

  • the accuracy of your model doesn’t depend on the language you use to build and train it (the plot shows only train accuracy, but the model doesn’t have high variance and the bias accuracy is around 99% as well).

  • even though 10 measurements may be not convincing, but Python would reduce (by up to 15%) the time required to train your CNN model. This is somewhat expected because R uses Python under the hood when executes Keras functions.

This is just one example, but the results are about what I’d expect.

Comments closed

Auto-Encoders And KernelML

Rohan Kotwani gives us an example where KernelML might be better than TensorFlow or PyTorch:

So what’s the point of using KernelML?

1. The parameters in each layer can be non-linear
2. Each parameter can be sampled from a different random distribution
3. The parameters can be transformed to meet certain constraints
4. Network combinations are defined in terms of numpy operations
5. Parameters are probabilistically updated
6. Each parameter update samples the loss function around a local or global minima

KerneML Specs

KernelMLis brute force optimizer that can be used to train machine learning algorithms. The package uses a combination of a machine learning and monte carlo simulations to optimize a parameter vector with a user defined loss function. Using kernelml creates a high computational cost for large complex networks because it samples the loss function using a subspace for each parameter in the parameter vector which requires many random simulations. The computational cost was reduced by enabling parallel computations with the ipyparallel. The decision to use this package was made because it effectively utilizes the cores on a machine.

It’s an interesting use case, though I would have liked to have seen a direct comparison to other frameworks.

Comments closed

Visualizing Data In Real Time With SQL Server And Dash

Tomaz Kastrun shows how to use Python Dash to visualize data living in SQL Server in real time:

The need for visualizing the real-time data (or near-real time) has been and still is a very important daily driver for many businesses. Microsoft SQL Server has many capabilities to visualize streaming data and this time, I will tackle this issue using Python. And python Dash package  for building web applications and visualizations. Dash is build on top of the Flask, React and Plotly and give the wide range of capabilities to create a interactive web applications, interfaces and visualizations.

Tomaz’s example hit SQL Server every half-second to grab the latest changes and gives us an example of roll-your-own streaming.

Comments closed

Flattening JSON Data With Databricks

Ivan Vazharov gives us a Databricks notebook to parse and flatten JSON using PySpark:

With Databricks you get:

  • An easy way to infer the JSON schema and avoid creating it manually
  • Subtle changes in the JSON schema won’t break things
  • The ability to explode nested lists into rows in a very easy way (see the Notebook below)
  • Speed!

Following is an example Databricks Notebook (Python) demonstrating the above claims. The JSON sample consists of an imaginary JSON result set, which contains a list of car models within a list of car vendors within a list of people. We want to flatten this result into a dataframe.

Click through for the notebook.

Comments closed

Scatterplot Matrices

The Plotly folks show off scatterplot matrices in Python:

The scatterplot matrix, known acronymically as SPLOM, is a relatively uncommon graphical tool that uses multiple scatterplots to determine the correlation (if any) between a series of variables.

These scatterplots are then organized into a matrix, making it easy to look at all the potential correlations in one place.

SPLOMs, invented by John Hartigan in 1975, allow data aficionados to quickly realize any interesting correlations between parameters in the data set.

In this post, we’ll go over how to make SPLOMs in Plotly with Python. For extra insights, check out our SPLOM tutorial in Python and R.

fff

Comments closed

Python And The Tidyverse

Leo at Locke Data looks at a couple Python packages which implement Tidyverse concepts:

The Dplython README provides some clear examples of how the package can be used. Below is an summary of the common functions:

  • select() – used to get specific columns of the data-frame.

  • sift() – used to filter out rows based on the value of a variable in that row.

  • sample_n() and sample_frac() – used to provide a random sample of rows from the data-frame.

  • arrange() – used to sort results.

  • mutate() – used to create new columns based on existing columns.

I think the Tidyverse is immediately accessible for data platform professionals, so it’s good to see these concepts making their way to Python as well as R.

Comments closed