Creating Seaborn Plots With R

Abdul Majed Raja shows how to call Python from R and build plots using the Seaborn Python package:

The reticulate package provides a comprehensive set of tools for interoperability between Python and R. The package includes facilities for:

  • Calling Python from R in a variety of ways including R Markdown, sourcing Python scripts, importing Python modules, and using Python interactively within an R session.
  • Translation between R and Python objects (for example, between R and Pandas data frames, or between R matrices and NumPy arrays).
  • Flexible binding to different versions of Python including virtual environments and Conda environments.

Reticulate embeds a Python session within your R session, enabling seamless, high-performance interoperability.

The more common use of reticulate I’ve seen is running TensorFlow neural networks from R.

Natural Language Generation With Markov Chains

Abdul Majed Raja shows off Markovify, a Python package which builds sentences using Markov chains:

Markov chains, named after Andrey Markov, are mathematical systems that hop from one “state” (a situation or set of values) to another. For example, if you made a Markov chain model of a baby’s behavior, you might include “playing,” “eating”, “sleeping,” and “crying” as states, which together with other behaviors could form a ‘state space’: a list of all possible states. In addition, on top of the state space, a Markov chain tells you the probability of hopping, or “transitioning,” from one state to any other state — -e.g., the chance that a baby currently playing will fall asleep in the next five minutes without crying first. Read more about how Markov Chain works in this interactive article by Victor Powell.

Click through for a fun example of headline generation.

TensorFlow Lite

Laurence Maroney explains TensorFlow Lite:

TensorFlow Lite is TensorFlow’s lightweight solution for mobile and embedded devices. It enables on-device machine learning inference with low latency and a small binary size. TensorFlow Lite also supports hardware acceleration with the Android Neural Networks API.

It’s designed to be low-latency, with optimized kernels for mobile apps, pre-fused activations and much more. It’s also *really* easy to use, and there’s a great demo app that will get you up and running with image classification from the device camera on both Android and iOS.

It comes in two parts:

  • A set of tools that you can use to prepare your models for use on mobile. These let you freeze your model to make it smaller, and then optimize and convert it in a process also called flattening the model, so that it will run happily on mobile

  • A mobile runtime with an easy API that lets you pass data to the model and get classifications back.

You don’t build the neural network on a phone, but the fact that you can run one on your phone is pretty crazy.

Push-Based Alerting With Kafka Streams

Robin Moffatt shows how to take syslog data and create a notification app using Python and Kafka Streams:

Now we can query from it and show the aggregate window timestamp alongside the result:

ksql> SELECT ROWTIME, TIMESTAMPTOSTRING(ROWTIME, 'yyyy-MM-dd HH:mm:ss'), \HOST, INVALID_LOGIN_COUNT \FROM INVALID_USERS_LOGINS_PER_HOST;1521644100000 | 2018-03-21 14:55:00 | rpi-03 | 11521646620000 | 2018-03-21 15:37:00 | rpi-03 | 21521649080000 | 2018-03-21 16:18:00 | rpi-03 | 11521649260000 | 2018-03-21 16:21:00 | rpi-03 | 41521649320000 | 2018-03-21 16:22:00 | rpi-03 | 21521649080000 | 2018-03-21 16:38:00 | rpi-03 | 2

In the above query I’m displaying the aggregate window start time, ROWTIME (which is epoch), and converting it also to a display string, using TIMESTAMPTOSTRING. We can use this to easily query the stream for a given window of interest. For example, for the window beginning at 2018-03-21 16:21:00 we can see there were four invalid user login attempts. We can easily check the source data for this, using the ROWTIME in the above output for the window (16:21 – 16:22) as the bounds for the predicate:

It’s a very interesting use case.

Accessing BigQuery Data From Python And R

Eleni Markou shows how to connect to Google’s BigQuery service using Python and then R:

Some time ago we discussed how you can access data that are stored in Amazon Redshift and PostgreSQL with Python and R. Let’s say you did find an easy way to store a pile of data in your BigQuery data warehouse and keep them in sync. Now you want to start messing with it using statistical techniques, maybe build a model of your customers’ behavior, or try to predict your churn rate.

To do that, you will need to extract your data from BigQuery and use a framework or language that is best suited for data analysis and the most popular so far are Python and R. In this small tutorial we will see how we can extract data that is stored in Google BigQuery to load it with Python or R, and then use the numerous analytic libraries and algorithms that exist for these two languages.

Read on to see how easy it is for either language.

Working With Jupyter Notebooks And Airflow On Hadoop

Mark Litwintschik shows us an interesting demonstration of running Jupyter Notebooks as well as automating tasks with Airflow on Hadoop:

The following will create a ~/airflow folder, setup a SQLite 3 database used to store Airflow’s state and configuration set via the Web UI, upgrade the configuration schema and create a folder for the Python-based jobs code Airflow will run.

$ cd ~
$ airflow initdb
$ airflow upgradedb
$ mkdir -p ~/airflow/dags

By default Presto’s Web UI, Spark’s Web UI and Airflow’s Web UI all use TCP port 8080. If you launch Presto after Spark then Presto will fail to start. If you start Spark after Presto then Presto will launch on 8080 and the Spark Master Server will take 8081 and keep trying higher ports until it finds one that is free. Spark will then pick an even higher port number for the Spark Worker Web UI. This overlap normally isn’t an issue as in a production setting these services would normally live on separate machines.

Read the whole thing.

Contrasting Plotly And Seaborn

Natasha Sharma contrasts the Seaborn and Plotly libraries for visualizing data:

It was important to use a library which can provide easy and high-class interactivity. Before embedding the plots into my website code, I tested a few different libraries like Matplotlib and Seaborn in order to visualize the results and to see how different they can look. After few trials, I came across Plotly library and found it valuable for my project because of its inbuilt functionality which gives user a high class interactivity.

In this post, I am going to compare Seaborn and Plotly using – Bar Chart and Heatmap diagram. I will be using Breast cancer dataset to visualize these plots. But before jumping into the comparison, the dataset I used needed preprocessing like data cleaning so, I followed steps.

In this case, the contrast is mostly lines of code versus visual quality; read on for more.

Using Python Within R

Kevin Feasel

2018-04-02

Python, R

David Smith points out new reticulate package:

With reticulate, you can:

  • Import objects from Python, automatically converted into their equivalent R types. (For example, Pandas data frames become R data.frame objects, and NumPy arrays become R matrix objects.)

  • Import Python modules, and call their functions from R

  • Source Python scripts from R

  • Interactively run Python commands from the R command line

  • Combine R code and Python code (and output) in R Markdown documents, as shown in the snippet below

The first thing that came to mind when reading this was the implementation of the keras package in R and how it calls out to TensorFlow (written in Python).  The ability to make R vs Python an “and” instead of an “or” proposition is quite powerful.

Tips For Processing Large Data Sets With Python

Kevin Feasel

2018-03-30

Python

Julien Heiduk has a few tips for people looking to process large data sets within Python:

In order to aggregate our data, we have to use chunksize. This option of read_csvallows you to load massive file as small chunks in Pandas. We decide to take 10% of the total length for the chunksize which corresponds to 40 Million rows.
Be careful it is not necessarily interesting to take a small value. The time between each iteration can be too long with a small chaunksize. In order to find the best trade-off “Memory usage – Time” you can try different chunksize and select the best which will consume the lesser memory and which will be the faster.

Click through for more tips.

Multi-Class Text Classification In Python

Susan Li has a series on multi-class text classification in Python.  First up is analysis with PySpark:

Our task is to classify San Francisco Crime Description into 33 pre-defined categories. The data can be downloaded from Kaggle.

Given a new crime description comes in, we want to assign it to one of 33 categories. The classifier makes the assumption that each new crime description is assigned to one and only one category. This is multi-class text classification problem.

    • * Input: Descript
    • * Example: “STOLEN AUTOMOBILE”
    • * Output: Category
    • * Example: VEHICLE THEFT

To solve this problem, we will use a variety of feature extraction technique along with different supervised machine learning algorithms in Spark. Let’s get started!

Then, she looks at multi-class text classification with scikit-learn:

The classifiers and learning algorithms can not directly process the text documents in their original form, as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length. Therefore, during the preprocessing step, the texts are converted to a more manageable representation.

One common approach for extracting features from the text is to use the bag of words model: a model where for each document, a complaint narrative in our case, the presence (and often the frequency) of words is taken into consideration, but the order in which they occur is ignored.

Specifically, for each term in our dataset, we will calculate a measure called Term Frequency, Inverse Document Frequency, abbreviated to tf-idf.

This is a nice pair of articles on the topic.  Natural Language Processing (and dealing with text in general) is one place where Python is well ahead of R in terms of functionality and ease of use.

Categories

April 2018
MTWTFSS
« Mar  
 1
2345678
9101112131415
16171819202122
23242526272829
30