Category: Python

Working with Transformer Models for Machine Translation

Published 2022-10-21 by Kevin Feasel

Stefania Cristina continues a series on transformer models. First up is plotting loss curves:

We have previously seen how to train the Transformer model for neural machine translation. Before moving on to inferencing the trained model, let us first explore how to modify the training code slightly, in order to be able to plot the training and validation loss curves that can be generated during the learning process.
The training and validation loss values provide important pieces of information, because they allow us to have a better insight on how the learning performance is changing over the number of epochs, and help us diagnose any problems with learning that can lead to an underfit or an overfit model. They will also inform us about the epoch at which to use the trained model weights at the inferencing stage.

Then we get to try it out:

We have seen how to train the Transformer model on a dataset of English and German sentence pairs, as well as how to plot the training and validation loss curves in order to diagnose the model’s learning performance and decide at which epoch to inference the trained model. We are now ready to inference the trained Transformer model for the purpose of translating an input sentence.
In this tutorial, you will discover how to inference the trained Transformer model for neural machine translation.

Click through for the results and to see exactly why there’s so much computational effort dumped into high-end trained models.

Comments closed

Training a Language Transformer Model

Published 2022-10-13 by Kevin Feasel

Stefania Cristina continues a series on building a language transformer:

We have put together the complete Transformer model, and now we are ready to train it for neural machine translation. We shall be making use of a training dataset for this purpose, which contains short English and German sentence pairs. We will also be revisiting the role of masking in computing the accuracy and loss metrics during the training process.
In this tutorial, you will discover how to train the Transformer model for neural machine translation.

Read on for the process, including a lot of code.

Comments closed

Kernel SHAP in R and Python

Published 2022-10-06 by Kevin Feasel

Michael Mayer and Christian Lorentzen team up:

SHAP is one of the most used model interpretation technique in Machine Learning. It decomposes predictions into additive contributions of the features in a fair way. For tree-based methods, the fast TreeSHAP algorithm exists. For general models, one has to resort to computationally expensive Monte-Carlo sampling or the faster Kernel SHAP algorithm. Kernel SHAP uses a regression trick to get the SHAP values of an observation with a comparably small number of calls to the predict function of the model. Still, it is much slower than TreeSHAP.

Read on to see how to do this in both R and Python. With libraries the way they are, the code is very similar and the results are basically the same.

Comments closed

Finding Near-Duplicates in a Corpus

Published 2022-10-04 by Kevin Feasel

Estelle Wang de-dupes text data:

Building a large high-quality corpus for Natural Language Processing (NLP) is not for the faint of heart. Text data can be large, cumbersome, and unwieldy and unlike clean numbers or categorical data in rows and columns, discerning differences between documents can be challenging. In organizations where documents are shared, modified, and shared again before being saved in an archive, the problem of duplication can become overwhelming.
To find exact duplicates, matching all string pairs is the simplest approach, but it is not a very efficient or sufficient technique. Using the MD5 or SHA-1 hash algorithms can get us a correct outcome with a faster speed, yet near-duplicates would still not be on the radar. Text similarity is useful for finding files that look alike. There are various approaches to this and each of them has its own way to define documents that are considered duplicates. Furthermore, the definition of duplicate documents has implications for the type of processing and the results produced. Below are some of the options.

Click through for solutions in SAS.

Comments closed

Deploying a Streamlit App to RStudio Connect

Published 2022-09-09 by Kevin Feasel

Parisa Gregg wraps up a series:

RStudio Connect is a platform which is well known for providing the ability to deploy and share R applications such as Shiny apps and Plumber APIs as well as plots, models and R Markdown reports. However, despite the name, it is not just for R developers (hence their recent announcement). RStudio Connect also supports a growing number of Python applications, API services including Flask and FastAPI and interactive web based apps such as Bokeh and Streamlit.
In this post we will look at how to deploy a Streamlit application to RStudio Connect. Streamlit is a framework for creating interactive web apps for data visualisation in Python. It’s API makes it very easy and quick to display data and create interactive widgets from just a regular Python script.

Click through for the step-by-step process.

Comments closed

RStudio Connect and Python’s FastAPI

Published 2022-09-02 by Kevin Feasel

Parisa Gregg continues a series on RStudio Connect and Python:

FastAPI is a light web framework and as you can probably tell by the name, it’s fast. It provides a similar functionality to Flask in that it allows the building of web applications and APIs, however it is newer and uses the ASGI (Asynchronous Server Gateway Interface) framework. One of the nice features of FastAPI is it is built on OpenAPI and JSON Schema standards which means it has the ability to provide automatic interactive API documentation with SwaggerUI. You also get validation for most Python data types with Pydantic. FastAPI is therefore another popular choice for data scientists when creating APIs to interact with and visualize data.
In this blog post we will go through how to deploy a simple machine learning API to RStudio Connect.

I’ve taken pretty well to FastAPI for rapid API development. I haven’t had to worry about scaling it out too much, so I’m not sure how well that works in practice. Still, for single-user or few-user apps, FastAPI definitely works well.

Comments closed

Developing a Flask App with RStudio Connect

Published 2022-08-31 by Kevin Feasel

Parisa Gregg crosses the language barrier:

One of the Python applications you can deploy to RStudio Connect is Flask. Flask is a WSGI (Web Server Gateway Interface) web application framework and provides a Python interface to enable the building of web APIs. It is useful to data scientists, for example for building interactive web dashboards and visualisations of data, as well as APIs for machine learning models. Deploying a Flask app to a publishing platform such as RStudio Connect means it can then be used from anywhere and can be easily shared with clients.
This blog post focuses on how to deploy a Flask app to RStudio Connect. We will use a simple example but won’t go into detail on how to create Flask apps. If you are getting started in Flask you may find this tutorial useful.

Read on for a demo.

Comments closed

Trying out Shiny Python

Published 2022-08-25 by Kevin Feasel

Jamie Owen kicks the tires on Py-shiny:

We would posit (see what we did there) that R-{shiny} has been a boon for data science practitioners using the R language over the last decade. We know that in our Python work, we have certainly been clamouring for something of the same ilk. And whilst there are other frameworks that we also like, streamlit and dash to name a couple, neither of them has filled us with the same excitement and confidence that shiny did in R to build both simple and complex bespoke web applications. With ~~RStudio~~ Posit conf in action the big news from July 27th was the alpha release of Py-{shiny} which was a source of great interest for us, so we couldn’t resist installing and starting to build.
If you are familiar with R-shiny already, then much of the py-shiny package will feel familiar to you (albeit with a couple of things having been renamed). However we will approach the rest of this post assuming that a reader does not have that prior experience and take you through building a simple shiny application to display plots on subsets of a dataset.

I’m curious how much take-up there will be for the library, given that there are several good competitors on Python.

Comments closed

Case-Sensitive String Comparisons and Case-Insensitive Tables

Published 2022-08-22 by Kevin Feasel

Meagan Longoria reminds us that case sensitivity was a huge mistake:

Here’s the scenario: You are using Python, perhaps in Azure Databricks, to manipulate data before inserting it into a SQL Database. Your source data is a flattened data extract and you need to create a unique list of values for an entity found in the data. For example, you have a dataset containing sales for the last month and you want a list of the unique products that have been sold. Then you insert the unique product values into a SQL table with a unique constraint, but you encounter issues on the insert related to unique values.

Click through for an example and how to extricate yourself from this scenario. Python certainly is not the only language to do this, so it’s good to know even if you don’t plan on using or supporting Python.

Comments closed

Bulk Insert into Azure SQL DB using Python

Published 2022-08-19 by Kevin Feasel

Jose Manuel Jurado Diaz shares some customer notes:

Today, I’ve been working on a service request that our customer wants to improve the performance of a bulk insert process. Following, I would like to share my experience working on that.
Our customer mentioned that inserting data (100.000 rows) is taking 14 seconds in a database in Business Critical. I was able to reproduce this time using a single thread using a table with 20 columns.

A lot of this advice also applies to on-premises SQL Server and relates to using bulk inserts and picking good batch sizes. Similar advice to what we’d be doing with SQL Server Integration Services or any other ETL/ELT process, tailored to Python.

Comments closed

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31