Press "Enter" to skip to content

Category: Python

The Elitist Shuffle And Recommenders

Rodrigo Agundez shows us a way of displaying fresh recommendations without retraining the recommender system:

Suppose you have 10,000 items in total that can be recommended to your user, you run the recommendation system over all the items and those 10,000 items get ranked in order of relevance of the content.

The application shows 5 items on the entry screen. The first time the user opens the application after the re-scoring process the top 5 ranked items are shown. It is decided that from now on (based on user control groups, investigation, AB testing, etc.) until the next re-scoring process the entry screen should not be the same every time and remain relevant for the user.

Based on an investigation from the data scientist it turns out that somewhat relevant items appear until item 100. Then the idea is to somehow shuffle those 100 items such that the top 5 items shown are still relevant but not the same.

Click through for an example in Python and how it compares favorably to a couple other shuffling algorithms.

Comments closed

Creating Seaborn Plots With R

Abdul Majed Raja shows how to call Python from R and build plots using the Seaborn Python package:

The reticulate package provides a comprehensive set of tools for interoperability between Python and R. The package includes facilities for:

  • Calling Python from R in a variety of ways including R Markdown, sourcing Python scripts, importing Python modules, and using Python interactively within an R session.
  • Translation between R and Python objects (for example, between R and Pandas data frames, or between R matrices and NumPy arrays).
  • Flexible binding to different versions of Python including virtual environments and Conda environments.

Reticulate embeds a Python session within your R session, enabling seamless, high-performance interoperability.

The more common use of reticulate I’ve seen is running TensorFlow neural networks from R.

Comments closed

Natural Language Generation With Markov Chains

Abdul Majed Raja shows off Markovify, a Python package which builds sentences using Markov chains:

Markov chains, named after Andrey Markov, are mathematical systems that hop from one “state” (a situation or set of values) to another. For example, if you made a Markov chain model of a baby’s behavior, you might include “playing,” “eating”, “sleeping,” and “crying” as states, which together with other behaviors could form a ‘state space’: a list of all possible states. In addition, on top of the state space, a Markov chain tells you the probability of hopping, or “transitioning,” from one state to any other state — -e.g., the chance that a baby currently playing will fall asleep in the next five minutes without crying first. Read more about how Markov Chain works in this interactive article by Victor Powell.

Click through for a fun example of headline generation.

Comments closed

TensorFlow Lite

Laurence Maroney explains TensorFlow Lite:

TensorFlow Lite is TensorFlow’s lightweight solution for mobile and embedded devices. It enables on-device machine learning inference with low latency and a small binary size. TensorFlow Lite also supports hardware acceleration with the Android Neural Networks API.

It’s designed to be low-latency, with optimized kernels for mobile apps, pre-fused activations and much more. It’s also *really* easy to use, and there’s a great demo app that will get you up and running with image classification from the device camera on both Android and iOS.

It comes in two parts:

  • A set of tools that you can use to prepare your models for use on mobile. These let you freeze your model to make it smaller, and then optimize and convert it in a process also called flattening the model, so that it will run happily on mobile

  • A mobile runtime with an easy API that lets you pass data to the model and get classifications back.

You don’t build the neural network on a phone, but the fact that you can run one on your phone is pretty crazy.

Comments closed

Push-Based Alerting With Kafka Streams

Robin Moffatt shows how to take syslog data and create a notification app using Python and Kafka Streams:

Now we can query from it and show the aggregate window timestamp alongside the result:

ksql> SELECT ROWTIME, TIMESTAMPTOSTRING(ROWTIME, 'yyyy-MM-dd HH:mm:ss'), \
HOST, INVALID_LOGIN_COUNT \
FROM INVALID_USERS_LOGINS_PER_HOST;
1521644100000 | 2018-03-21 14:55:00 | rpi-03 | 1
1521646620000 | 2018-03-21 15:37:00 | rpi-03 | 2
1521649080000 | 2018-03-21 16:18:00 | rpi-03 | 1
1521649260000 | 2018-03-21 16:21:00 | rpi-03 | 4
1521649320000 | 2018-03-21 16:22:00 | rpi-03 | 2
1521649080000 | 2018-03-21 16:38:00 | rpi-03 | 2

In the above query I’m displaying the aggregate window start time, ROWTIME (which is epoch), and converting it also to a display string, using TIMESTAMPTOSTRING. We can use this to easily query the stream for a given window of interest. For example, for the window beginning at 2018-03-21 16:21:00 we can see there were four invalid user login attempts. We can easily check the source data for this, using the ROWTIME in the above output for the window (16:21 – 16:22) as the bounds for the predicate:

It’s a very interesting use case.

Comments closed

Accessing BigQuery Data From Python And R

Eleni Markou shows how to connect to Google’s BigQuery service using Python and then R:

Some time ago we discussed how you can access data that are stored in Amazon Redshift and PostgreSQL with Python and R. Let’s say you did find an easy way to store a pile of data in your BigQuery data warehouse and keep them in sync. Now you want to start messing with it using statistical techniques, maybe build a model of your customers’ behavior, or try to predict your churn rate.

To do that, you will need to extract your data from BigQuery and use a framework or language that is best suited for data analysis and the most popular so far are Python and R. In this small tutorial we will see how we can extract data that is stored in Google BigQuery to load it with Python or R, and then use the numerous analytic libraries and algorithms that exist for these two languages.

Read on to see how easy it is for either language.

Comments closed

Working With Jupyter Notebooks And Airflow On Hadoop

Mark Litwintschik shows us an interesting demonstration of running Jupyter Notebooks as well as automating tasks with Airflow on Hadoop:

The following will create a ~/airflow folder, setup a SQLite 3 database used to store Airflow’s state and configuration set via the Web UI, upgrade the configuration schema and create a folder for the Python-based jobs code Airflow will run.

$ cd ~
$ airflow initdb
$ airflow upgradedb
$ mkdir -p ~/airflow/dags

By default Presto’s Web UI, Spark’s Web UI and Airflow’s Web UI all use TCP port 8080. If you launch Presto after Spark then Presto will fail to start. If you start Spark after Presto then Presto will launch on 8080 and the Spark Master Server will take 8081 and keep trying higher ports until it finds one that is free. Spark will then pick an even higher port number for the Spark Worker Web UI. This overlap normally isn’t an issue as in a production setting these services would normally live on separate machines.

Read the whole thing.

Comments closed

Contrasting Plotly And Seaborn

Natasha Sharma contrasts the Seaborn and Plotly libraries for visualizing data:

It was important to use a library which can provide easy and high-class interactivity. Before embedding the plots into my website code, I tested a few different libraries like Matplotlib and Seaborn in order to visualize the results and to see how different they can look. After few trials, I came across Plotly library and found it valuable for my project because of its inbuilt functionality which gives user a high class interactivity.

In this post, I am going to compare Seaborn and Plotly using – Bar Chart and Heatmap diagram. I will be using Breast cancer dataset to visualize these plots. But before jumping into the comparison, the dataset I used needed preprocessing like data cleaning so, I followed steps.

In this case, the contrast is mostly lines of code versus visual quality; read on for more.

Comments closed

Using Python Within R

David Smith points out new reticulate package:

With reticulate, you can:

  • Import objects from Python, automatically converted into their equivalent R types. (For example, Pandas data frames become R data.frame objects, and NumPy arrays become R matrix objects.)

  • Import Python modules, and call their functions from R

  • Source Python scripts from R

  • Interactively run Python commands from the R command line

  • Combine R code and Python code (and output) in R Markdown documents, as shown in the snippet below

The first thing that came to mind when reading this was the implementation of the keras package in R and how it calls out to TensorFlow (written in Python).  The ability to make R vs Python an “and” instead of an “or” proposition is quite powerful.

Comments closed

Tips For Processing Large Data Sets With Python

Julien Heiduk has a few tips for people looking to process large data sets within Python:

In order to aggregate our data, we have to use chunksize. This option of read_csvallows you to load massive file as small chunks in Pandas. We decide to take 10% of the total length for the chunksize which corresponds to 40 Million rows.
Be careful it is not necessarily interesting to take a small value. The time between each iteration can be too long with a small chaunksize. In order to find the best trade-off “Memory usage – Time” you can try different chunksize and select the best which will consume the lesser memory and which will be the faster.

Click through for more tips.

Comments closed