Category: Python

Data Visualization in R and Python

Published 2019-12-26 by Kevin Feasel

Michelle Golchert contrasts libraries for visualizing data in R and Python:

Unlike R, Python – as a “general-purpose” programming language – does not include data visualization tools by default. However, Python also provides many libraries for this purpose, such as Matplotlib and Seaborn.
Python now also offers numerous packages (like plotnine and ggpy) which are equivalents of ggplot2 in R, and allow you to create plots in Python according to the same “Grammar of Graphics” principle.

This is an area where I think R has the upper hand at most levels: it’s easier to get started plotting with R (thanks to the built-in plots), it’s easier to do “intermediate-quality” plots (stuff you would use in an internal presentation), and you tend to have more control when building professional-quality plots. You can certainly create beautiful visuals in both languages, though.

Comments closed

Schiphol Takeoff: Low-Code Automated Deployment

Published 2019-12-17 by Kevin Feasel

Tim van Cann and Daniel van der Ende have an open source project for automatic deployment on Azure:

To give a bit more insight into why we built Schiphol Takeoff, it’s good to take a look at an example use case. This use case ties a number of components together:
– Data arrives in a (near) real-time stream on an Azure Eventhub.
– A Spark job running on Databricks consumes this data from Eventhub, processes the data, and outputs predictions.
– A REST API is running on Azure Kubernetes Service, which exposes the predictions made by the Spark job.
Conceptually, this is not a very complex setup. However, there are quite a few components involved:
– Azure Eventhub
– Azure Databricks
– Azure Kubernetes Service
Each of these individually has some form of automation, but there is no unified way of coordinating and orchestrating deployment of the code to all at the same time. If, for example, you were to change the name of the consumer group for Azure Eventhub, you could script that. However, you’d also need to manually update your Spark job running on Databricks to ensure it could still consume the data.

This looks pretty nice. I’ll need to dive into it some more.

Comments closed

Azure Databricks and Delta Lake

Published 2019-12-11 by Kevin Feasel

Brad Llewellyn starts a new series on Delta Lake in Azure Databricks:

Saving the data in Delta format is as simple as replacing the .format(“parquet”) function with .format(“delta”). However, we see a major difference when we look at the table creation. When creating a table using Delta, we don’t have to specify the schema, because the schema is already strongly defined when we save the data. We also see that Delta tables can be easily queried using the same SQL we’re used to. Next, let’s compare what the raw files look like by examining the blob storage container that we are storing them in.

There are some good demos in this post and it promises to be a nice series.

Comments closed

Using SQL Server as a REST API Back-End

Published 2019-12-11 by Kevin Feasel

Davide Mauri shows how you can use SQL Server to power an API, using Flask as an example:

I mentioned in my previous article that having native JSON support in Azure SQL it’s a game changer as it profoundly change the way a developer can interact with a relational database, bringing the simplicity and the flexibility needed in today’s Modern Applications.
As Python is becoming immensely popular, one of the most common tasks for a developer is to create REST API using Python. Thanks to JSON support, using Azure SQL as a backend database to support your API is as easy as writing to a text file, with the difference that behind the scenes you have all the peace of mind that your data will be safely stored and made available on request, at scale, with also the option to push as much compute to data as you want, so that you can leverage the powerful query and processing engine while keeping your code simple, elegant and agile, with a clear separation of concerns. All these things will help you immensely once you’ll start to evolve your project to keep it updated with today’s demanding and ever-changing world.

Those who remember the days of ASMX web services in SQL Server (thankfully removed after 2005) might cringe, but I’ve actually done something like this for a company, where all of the data lived in SQL Server and the transformation logic was pretty simple. If you have to monkey with the JSON afterward in your middle tier, then just bring back a data set, but in a scenario like Davide shows, moving the JSON creation to Python wouldn’t really gain you anything.

Comments closed

Querying SQL Server from Python

Published 2019-12-09 by Kevin Feasel

Hasan Savran builds an Azure Data Studio notebook to query SQL Server from Python:

SQL Kernel is the default language, to query database with Python change SQL to Python 3. Probably, you will see the following message if this is the first time you are trying this. You need to install Python packages to be able to run python scripts. I have Visual Studio installed on my machine and I already have Python, I taught I could to use it by clicking “Use existing Python installation”. I was wrong, I couldn’t. This option looks for local installation files and when I point to Visual Studio Python files, it throws error in the middle of the installation. So, I will ignore this option for now.

In ADS, I haven’t gotten “Use existing Python location” to work either, so Hasan’s not alone in that regard.

Comments closed

Re-Introducing rquery

Published 2019-10-28 by Kevin Feasel

John Mount has a new introduction to rquery:

rquery is a data wrangling system designed to express complex data manipulation as a series of simple data transforms. This is in the spirit of R’s base::transform(), or dplyr’s dplyr::mutate() and uses a pipe in the style popularized in R with magrittr. The operators themselves follow the selections in Codd’s relational algebra, with the addition of the traditional SQL “window functions.” More on the background and context of rquery can be found here.
The R/rquery version of this introduction is here, and the Python/data_algebra version of this introduction is here.

Check it out.

Comments closed

A New Notebook Tool: Polynote

Published 2019-10-24 by Kevin Feasel

Jeremy Smith, et al, announce a new product:

We are pleased to announce the open-source launch of Polynote: a new, polyglot notebook with first-class Scala support, Apache Spark integration, multi-language interoperability including Scala, Python, and SQL, as-you-type autocomplete, and more.
Polynote provides data scientists and machine learning researchers with a notebook environment that allows them the freedom to seamlessly integrate our JVM-based ML platform — which makes heavy use of Scala — with the Python ecosystem’s popular machine learning and visualization libraries. It has seen substantial adoption among Netflix’s personalization and recommendation teams, and it is now being integrated with the rest of our research platform.

There are some nice pieces to it, especially around language interop.

Comments closed

PySpark DataFrame Joining

Published 2019-10-15 by Kevin Feasel

Monika Rathor shows the various ways you can join DataFrames with PySpark:

PySpark provides multiple ways to combine dataframes i.e. join, merge, union, SQL interface, etc. In this article, we will take a look at how the PySpark join function is similar to SQL join, where two or more tables or dataframes can be combined based on conditions.

One join type you don’t directly get in SQL Server is the left anti join. We can build something quite similar with NOT EXISTS, though.

Comments closed

Installing Python Libraries on EMR Clusters with Notebooks

Published 2019-10-07 by Kevin Feasel

Parag Chaudhari shows how we can install Python libraries on existing ElasticMapReduce clusters using EMR Notebooks:

The notebook-scoped libraries discussed previously require your EMR cluster to have access to a PyPI repository. If you cannot connect your EMR cluster to a repository, use the Python libraries pre-packaged with EMR Notebooks to analyze and visualize your results locally within the notebook. Unlike the notebook-scoped libraries, these local libraries are only available to the Python kernel and are not available to the Spark environment on the cluster. To use these local libraries, export your results from your Spark driver on the cluster to your notebook and use the notebook magic to plot your results locally. Because you are using the notebook and not the cluster to analyze and render your plots, the dataset that you export to the notebook has to be small (recommend less than 100 MB).

Read the whole thing.

Comments closed

Topic Modeling

Published 2019-09-30 by Kevin Feasel

Federico Pascual has an article on topic modeling and topic classification:

Topic modeling is an unsupervised machine learning technique that’s capable of scanning a set of documents, detecting word and phrase patterns within them, and automatically clustering word groups and similar expressions that best characterize a set of documents. It’s known as ‘unsupervised’ machine learning because it doesn’t require a predefined list of tags or training data that’s been previously classified by humans.
Since topic modeling doesn’t require training, it’s a quick and easy way to start analyzing your data. However, you can’t guarantee you’ll receive accurate results, which is why many businesses opt to invest time training a topic classification model.

The article is long but worth the read, with examples in Python and additional notes for R.

Comments closed