Press "Enter" to skip to content

Category: Python

TensorFlow Lite

Laurence Maroney explains TensorFlow Lite:

TensorFlow Lite is TensorFlow’s lightweight solution for mobile and embedded devices. It enables on-device machine learning inference with low latency and a small binary size. TensorFlow Lite also supports hardware acceleration with the Android Neural Networks API.

It’s designed to be low-latency, with optimized kernels for mobile apps, pre-fused activations and much more. It’s also *really* easy to use, and there’s a great demo app that will get you up and running with image classification from the device camera on both Android and iOS.

It comes in two parts:

  • A set of tools that you can use to prepare your models for use on mobile. These let you freeze your model to make it smaller, and then optimize and convert it in a process also called flattening the model, so that it will run happily on mobile

  • A mobile runtime with an easy API that lets you pass data to the model and get classifications back.

You don’t build the neural network on a phone, but the fact that you can run one on your phone is pretty crazy.

Comments closed

Push-Based Alerting With Kafka Streams

Robin Moffatt shows how to take syslog data and create a notification app using Python and Kafka Streams:

Now we can query from it and show the aggregate window timestamp alongside the result:

ksql> SELECT ROWTIME, TIMESTAMPTOSTRING(ROWTIME, 'yyyy-MM-dd HH:mm:ss'), \
HOST, INVALID_LOGIN_COUNT \
FROM INVALID_USERS_LOGINS_PER_HOST;
1521644100000 | 2018-03-21 14:55:00 | rpi-03 | 1
1521646620000 | 2018-03-21 15:37:00 | rpi-03 | 2
1521649080000 | 2018-03-21 16:18:00 | rpi-03 | 1
1521649260000 | 2018-03-21 16:21:00 | rpi-03 | 4
1521649320000 | 2018-03-21 16:22:00 | rpi-03 | 2
1521649080000 | 2018-03-21 16:38:00 | rpi-03 | 2

In the above query I’m displaying the aggregate window start time, ROWTIME (which is epoch), and converting it also to a display string, using TIMESTAMPTOSTRING. We can use this to easily query the stream for a given window of interest. For example, for the window beginning at 2018-03-21 16:21:00 we can see there were four invalid user login attempts. We can easily check the source data for this, using the ROWTIME in the above output for the window (16:21 – 16:22) as the bounds for the predicate:

It’s a very interesting use case.

Comments closed

Accessing BigQuery Data From Python And R

Eleni Markou shows how to connect to Google’s BigQuery service using Python and then R:

Some time ago we discussed how you can access data that are stored in Amazon Redshift and PostgreSQL with Python and R. Let’s say you did find an easy way to store a pile of data in your BigQuery data warehouse and keep them in sync. Now you want to start messing with it using statistical techniques, maybe build a model of your customers’ behavior, or try to predict your churn rate.

To do that, you will need to extract your data from BigQuery and use a framework or language that is best suited for data analysis and the most popular so far are Python and R. In this small tutorial we will see how we can extract data that is stored in Google BigQuery to load it with Python or R, and then use the numerous analytic libraries and algorithms that exist for these two languages.

Read on to see how easy it is for either language.

Comments closed

Working With Jupyter Notebooks And Airflow On Hadoop

Mark Litwintschik shows us an interesting demonstration of running Jupyter Notebooks as well as automating tasks with Airflow on Hadoop:

The following will create a ~/airflow folder, setup a SQLite 3 database used to store Airflow’s state and configuration set via the Web UI, upgrade the configuration schema and create a folder for the Python-based jobs code Airflow will run.

$ cd ~
$ airflow initdb
$ airflow upgradedb
$ mkdir -p ~/airflow/dags

By default Presto’s Web UI, Spark’s Web UI and Airflow’s Web UI all use TCP port 8080. If you launch Presto after Spark then Presto will fail to start. If you start Spark after Presto then Presto will launch on 8080 and the Spark Master Server will take 8081 and keep trying higher ports until it finds one that is free. Spark will then pick an even higher port number for the Spark Worker Web UI. This overlap normally isn’t an issue as in a production setting these services would normally live on separate machines.

Read the whole thing.

Comments closed

Contrasting Plotly And Seaborn

Natasha Sharma contrasts the Seaborn and Plotly libraries for visualizing data:

It was important to use a library which can provide easy and high-class interactivity. Before embedding the plots into my website code, I tested a few different libraries like Matplotlib and Seaborn in order to visualize the results and to see how different they can look. After few trials, I came across Plotly library and found it valuable for my project because of its inbuilt functionality which gives user a high class interactivity.

In this post, I am going to compare Seaborn and Plotly using – Bar Chart and Heatmap diagram. I will be using Breast cancer dataset to visualize these plots. But before jumping into the comparison, the dataset I used needed preprocessing like data cleaning so, I followed steps.

In this case, the contrast is mostly lines of code versus visual quality; read on for more.

Comments closed

Using Python Within R

David Smith points out new reticulate package:

With reticulate, you can:

  • Import objects from Python, automatically converted into their equivalent R types. (For example, Pandas data frames become R data.frame objects, and NumPy arrays become R matrix objects.)

  • Import Python modules, and call their functions from R

  • Source Python scripts from R

  • Interactively run Python commands from the R command line

  • Combine R code and Python code (and output) in R Markdown documents, as shown in the snippet below

The first thing that came to mind when reading this was the implementation of the keras package in R and how it calls out to TensorFlow (written in Python).  The ability to make R vs Python an “and” instead of an “or” proposition is quite powerful.

Comments closed

Tips For Processing Large Data Sets With Python

Julien Heiduk has a few tips for people looking to process large data sets within Python:

In order to aggregate our data, we have to use chunksize. This option of read_csvallows you to load massive file as small chunks in Pandas. We decide to take 10% of the total length for the chunksize which corresponds to 40 Million rows.
Be careful it is not necessarily interesting to take a small value. The time between each iteration can be too long with a small chaunksize. In order to find the best trade-off “Memory usage – Time” you can try different chunksize and select the best which will consume the lesser memory and which will be the faster.

Click through for more tips.

Comments closed

Multi-Class Text Classification In Python

Susan Li has a series on multi-class text classification in Python.  First up is analysis with PySpark:

Our task is to classify San Francisco Crime Description into 33 pre-defined categories. The data can be downloaded from Kaggle.

Given a new crime description comes in, we want to assign it to one of 33 categories. The classifier makes the assumption that each new crime description is assigned to one and only one category. This is multi-class text classification problem.

    • * Input: Descript
    • * Example: “STOLEN AUTOMOBILE”
    • * Output: Category
    • * Example: VEHICLE THEFT

To solve this problem, we will use a variety of feature extraction technique along with different supervised machine learning algorithms in Spark. Let’s get started!

Then, she looks at multi-class text classification with scikit-learn:

The classifiers and learning algorithms can not directly process the text documents in their original form, as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length. Therefore, during the preprocessing step, the texts are converted to a more manageable representation.

One common approach for extracting features from the text is to use the bag of words model: a model where for each document, a complaint narrative in our case, the presence (and often the frequency) of words is taken into consideration, but the order in which they occur is ignored.

Specifically, for each term in our dataset, we will calculate a measure called Term Frequency, Inverse Document Frequency, abbreviated to tf-idf.

This is a nice pair of articles on the topic.  Natural Language Processing (and dealing with text in general) is one place where Python is well ahead of R in terms of functionality and ease of use.

Comments closed

XGBoost With Python

Fisseha Berhane looked at Extreme Gradient Boosting with R and now covers it in Python:

In both R and Python, the default base learners are trees (gbtree) but we can also specify gblinear for linear models and dart for both classification and regression problems.
In this post, I will optimize only three of the parameters shown above and you can try optimizing the other parameters. You can see the list of parameters and their details from the website.

It’s hard to overstate just how valuable XGBoost is as an algorithm.

Comments closed

Using Python In SQL Server 2017

Emma Stewart has a post covering setup and configuration of SQL Server 2017 Machine Learning Services and using Python within SQL Server:

One of the new features of SQL Server 2017 was the ability to execute Python Scripts within SQL Server. For anyone who hasn’t heard of Python, it is the language of choice for data analysis. It has a lot of libraries for data analysis and predictive modelling, offers power and flexibility for various machine learning tasks and is also a much simpler language to learn than others.

The release of SQL Server 2016, saw the integration of the database engine with R Services, a data science language. By extending this support to Python, Microsoft have renamed R Services to ‘Machine Learning Services’ to include both R and Python.

The benefits of being able to run Python from SQL Server are that you can keep analytics close to the data (if your data is held within a SQL Server database) and reduce any unnecessary data movement. In a production environment you can simply execute your Python solution via a T-SQL Stored Procedure and you can also deploy the solution using the familiar development tool, Visual Studio.

ML Services is a great addition to SQL Server.

Comments closed