Press "Enter" to skip to content

Category: Python

Finding Near-Duplicates in a Corpus

Estelle Wang de-dupes text data:

Building a large high-quality corpus for Natural Language Processing (NLP) is not for the faint of heart. Text data can be large, cumbersome, and unwieldy and unlike clean numbers or categorical data in rows and columns, discerning differences between documents can be challenging. In organizations where documents are shared, modified, and shared again before being saved in an archive, the problem of duplication can become overwhelming.

To find exact duplicates, matching all string pairs is the simplest approach, but it is not a very efficient or sufficient technique. Using the MD5 or SHA-1 hash algorithms can get us a correct outcome with a faster speed, yet near-duplicates would still not be on the radar. Text similarity is useful for finding files that look alike. There are various approaches to this and each of them has its own way to define documents that are considered duplicates. Furthermore, the definition of duplicate documents has implications for the type of processing and the results produced. Below are some of the options.

Click through for solutions in SAS.

Comments closed

Deploying a Streamlit App to RStudio Connect

Parisa Gregg wraps up a series:

RStudio Connect is a platform which is well known for providing the ability to deploy and share R applications such as Shiny apps and Plumber APIs as well as plots, models and R Markdown reports. However, despite the name, it is not just for R developers (hence their recent announcement). RStudio Connect also supports a growing number of Python applications, API services including Flask and FastAPI and interactive web based apps such as Bokeh and Streamlit.

In this post we will look at how to deploy a Streamlit application to RStudio Connect. Streamlit is a framework for creating interactive web apps for data visualisation in Python. It’s API makes it very easy and quick to display data and create interactive widgets from just a regular Python script.

Click through for the step-by-step process.

Comments closed

RStudio Connect and Python’s FastAPI

Parisa Gregg continues a series on RStudio Connect and Python:

FastAPI is a light web framework and as you can probably tell by the name, it’s fast. It provides a similar functionality to Flask in that it allows the building of web applications and APIs, however it is newer and uses the ASGI (Asynchronous Server Gateway Interface) framework. One of the nice features of FastAPI is it is built on OpenAPI and JSON Schema standards which means it has the ability to provide automatic interactive API documentation with SwaggerUI. You also get validation for most Python data types with Pydantic. FastAPI is therefore another popular choice for data scientists when creating APIs to interact with and visualize data.

In this blog post we will go through how to deploy a simple machine learning API to RStudio Connect.

I’ve taken pretty well to FastAPI for rapid API development. I haven’t had to worry about scaling it out too much, so I’m not sure how well that works in practice. Still, for single-user or few-user apps, FastAPI definitely works well.

Comments closed

Developing a Flask App with RStudio Connect

Parisa Gregg crosses the language barrier:

One of the Python applications you can deploy to RStudio Connect is Flask. Flask is a WSGI (Web Server Gateway Interface) web application framework and provides a Python interface to enable the building of web APIs. It is useful to data scientists, for example for building interactive web dashboards and visualisations of data, as well as APIs for machine learning models. Deploying a Flask app to a publishing platform such as RStudio Connect means it can then be used from anywhere and can be easily shared with clients.

This blog post focuses on how to deploy a Flask app to RStudio Connect. We will use a simple example but won’t go into detail on how to create Flask apps. If you are getting started in Flask you may find this tutorial useful.

Read on for a demo.

Comments closed

Trying out Shiny Python

Jamie Owen kicks the tires on Py-shiny:

We would posit (see what we did there) that R-{shiny} has been a boon for data science practitioners using the R language over the last decade. We know that in our Python work, we have certainly been clamouring for something of the same ilk. And whilst there are other frameworks that we also like, streamlit and dash to name a couple, neither of them has filled us with the same excitement and confidence that shiny did in R to build both simple and complex bespoke web applications. With RStudio Posit conf in action the big news from July 27th was the alpha release of Py-{shiny} which was a source of great interest for us, so we couldn’t resist installing and starting to build.

If you are familiar with R-shiny already, then much of the py-shiny package will feel familiar to you (albeit with a couple of things having been renamed). However we will approach the rest of this post assuming that a reader does not have that prior experience and take you through building a simple shiny application to display plots on subsets of a dataset.

I’m curious how much take-up there will be for the library, given that there are several good competitors on Python.

Comments closed

Case-Sensitive String Comparisons and Case-Insensitive Tables

Meagan Longoria reminds us that case sensitivity was a huge mistake:

Here’s the scenario: You are using Python, perhaps in Azure Databricks, to manipulate data before inserting it into a SQL Database. Your source data is a flattened data extract and you need to create a unique list of values for an entity found in the data. For example, you have a dataset containing sales for the last month and you want a list of the unique products that have been sold. Then you insert the unique product values into a SQL table with a unique constraint, but you encounter issues on the insert related to unique values.

Click through for an example and how to extricate yourself from this scenario. Python certainly is not the only language to do this, so it’s good to know even if you don’t plan on using or supporting Python.

Comments closed

Bulk Insert into Azure SQL DB using Python

Jose Manuel Jurado Diaz shares some customer notes:

Today, I’ve been working on a service request that our customer wants to improve the performance of a bulk insert process. Following, I would like to share my experience working on that.

Our customer mentioned that inserting data (100.000 rows) is taking 14 seconds in a database in Business Critical. I was able to reproduce this time using a single thread using a table with 20 columns.

A lot of this advice also applies to on-premises SQL Server and relates to using bulk inserts and picking good batch sizes. Similar advice to what we’d be doing with SQL Server Integration Services or any other ETL/ELT process, tailored to Python.

Comments closed

Anomaly Detection over Delta Live Tables

Avinash Sooriyarachchi and Sathish Gangichetty show off an interesting scenario:

Anomaly detection poses several challenges. The first is the data science question of what an ‘anomaly’ looks like. Fortunately, machine learning has powerful tools to learn how to distinguish usual from anomalous patterns from data. In the case of anomaly detection, it is impossible to know what all anomalies look like, so it’s impossible to label a data set for training a machine learning model, even if resources for doing so are available. Thus, unsupervised learning has to be used to detect anomalies, where patterns are learned from unlabelled data.

Even with the perfect unsupervised machine learning model for anomaly detection figured out, in many ways, the real problems have only begun. What is the best way to put this model into production such that each observation is ingested, transformed and finally scored with the model, as soon as the data arrives from the source system? That too, in a near real-time manner or at short intervals, e.g. every 5-10 minutes? This involves building a sophisticated extract, load, and transform (ELT) pipeline and integrating it with an unsupervised machine learning model that can correctly identify anomalous records. Also, this end-to-end pipeline has to be production-grade, always running while ensuring data quality from ingestion to model inference, and the underlying infrastructure has to be maintained.

Click through to see their solution using Databricks and delta lake.

Comments closed

Understanding Decision Trees

Durgesh Gupta provides a primer on the humble decision tree:

A decision tree is a graphical representation of all possible solutions to a decision.

The objective of using a Decision Tree is to create a training model that can use to predict the class or value of the target variable by learning simple decision rules inferred from training data.

It is a tree-structured classifier, where internal nodes represent the features of a dataset, branches represent the decision rules and each leaf node represents the outcome.

The way I like to describe decision trees, especially to developers, is that a tree is a set of if-else statements which leads to a conclusion. The nice part about decision trees is that once you understand how they work, you’re halfway there to gradient boosting (e.g., XGBoost) and random forests.

Comments closed

Apache eCharts for Python

Mark LItwintschik looks at another charting library:

The Apache eCharts project is a web-based charting library. It was started in 2013 and built using 77.5K lines of TypeScript. It is well documented and has over 200 examples of its API’s usage. The examples allow you to toggle between light/dark mode and there is a cheat sheet and a theme builder with several tasteful presents to choose from.

This is a library I hadn’t heard of before but Mark shows it off a bit.

Comments closed