Press "Enter" to skip to content

Category: Python

Learning about RDDs in Spark

Tomaz Kastrun continues a series on Spark. Part 7 ties in R and gives us sample plotting in R and Python:

Let’s look into the local use of Spark. For R language, sparklyr package is availble and for Python pyspark is availble.

Part 8 gets us into the key data structure behind Spark’s success, the Resilient Distributed Dataset:

Spark is created around the concept of resilient distributed datasets (RDD). RDD is a fault-tolerant collection of files that can be used in parallel. RDDs can be created in two ways:
– parallelising an existing data collection in driver program
– referencing a datasets in external storage (HDFS, blob storage, shared filesystem, Hadoop InputFormat,…)

In a simple way, Spark RDD has two opeartions:
– transformations – creating a new RDD dataset on top of already existing one with the last transformation
– actions – to the action, and return a value to the driver program after running a computation on the dataset.

Part 9 looks a bit more at transformations and actions:

Two types of operations are available with RDD; transformations and actions. Transformations are lazy operations, meaning that they prepare the new RDD with every new operation but now show or return anything. We can say, that transformations are lazy because of updating existing RDD, these operations create another RDD. Actions on the other hand trigger the computations on RDD and show (return) the result of transformations.

Most modern work in Spark won’t directly use RDDs, though everything is built on top of them and it’s good to understand the foundation even if you don’t need to write all of those map(), fold(), and reduceByKey() operations yourself.

Comments closed

SQL Server Backend for Django

Warren Chu announces a new version of the SQL Server 3rd Party Backend for Django:

We have released version 1.1 of the SQL Server 3rd Party Backend for Django. This release contains support for the upcoming release of Django 4.0, as well as a number of issue fixes.

Our plan is to time releases to coincide with major releases of Django and SQL Server, to ensure users of this project can keep up to date with Django while continuing to use SQL Server as a backend.

Read on to see what this entails.

Comments closed

Solving Linear Constraints with Python

Luke Menzies and Gavita Regunath create a schedule:

Linear optimisation (often referred to as linear programming) is not cutting edge or new. It has been around for a very long time. It was first introduced within the field of operational research during World War II, where it was used to help minimise costings. The method proposed for solving these problems is known as the simplex method, and it hasn’t changed much today. Although this method hasn’t changed significantly, what has changed significantly is the computing power and accessibility of this technique, allowing these methods to be used on very complex scenarios with almost a click of a button. Convenient libraries have allowed the intricate complexities of setting these problems up on a computer to be simplified.

Read on for an example of linear programming. This is something I’ve always enjoyed, but haven’t had many places to use this technique in my professional career. That said, shout out to everyone who’s ever used LINGO.

Comments closed

Monotonic Constraints on Random Forests

Michael Mayer has some interesting R and Python code for us:

On ML competition platforms like Kaggle, complex and unintuitively behaving models dominate. In this respect, reality is completely different. There, the majority of models do not serve as pure prediction machines but rather as fruitful source of information. Furthermore, even if used as prediction machine, the users of the models might expect a certain degree of consistency when “playing” with input values.

A classic example are statistical house appraisal models. An additional bathroom or an additional square foot of ground area is expected to raise the appraisal, everything else being fixed (ceteris paribus). The user might lose trust in the model if the opposite happens.

One way to enforce such consistency is to monitor the signs of coefficients of a linear regression model. Another useful strategy is to impose monotonicity constraints on selected model effects.

Certain types of regression algorithm make this easy, but random forest? Not so much. That’s where Michael steps in.

Comments closed

When to Start Using a Database with R or Python

Roel Hogervorst thinks about data sizes in R and Python:

Your dataset becomes so big and unwieldy that operations take a long time. How long is too long? That depends on you, I get annoyed if I don’ t get feedback within 20 seconds (and I love it when a program shows me a progress bar at that point, at least I know how long it will take!), your boundary may lay at some other point. When you reach that point of annoyance or point of no longer being able to do your work. You should improve your workflow.

I will show you how to do some speedups by using other R packages, in python moving from pandas to polars, or leveraging databases. I see some hesitancy about moving to a database for analytical work, and that is too bad. Bad for two reasons, one: it is super simple, two it will save you a lot of time.

I definitely agree with Roel’s bottom line here. Granted, part of that is domain knowledge, but databases are extremely good at handling data and both languages have plenty of database accessibility.

One last tip, though: if you’re on the data science or data analytics track, learn SQL. Yes, libraries like dbplyr in R or ORMs in Python can cover up a lot, but that comes at a cost, typically in terms of performance. Building these skills will make your life considerably easier.

Comments closed

GPU-Accelerated Analysis on Databricks using PyTorch + Huggingface

Srijith Rajamohan walks us through an example of sentiment analysis using the PyTorch and Huggingface libraries on Databricks:

Sentiment analysis is commonly used to analyze the sentiment present within a body of text, which could range from a review, an email or a tweet. Deep learning-based techniques are one of the most popular ways to perform such an analysis. However, these techniques tend to be very computationally intensive and often require the use of GPUs, depending on the architecture and the embeddings used. Huggingface (https://huggingface.co) has put together a framework with the transformers package that makes accessing these embeddings seamless and reproducible. In this work, I illustrate how to perform scalable sentiment analysis by using the Huggingface package within PyTorch and leveraging the ML runtimes and infrastructure on Databricks.

Click through for a description of the process, as well as a link to a notebook you can walk through yourself.

Comments closed

Document Classification in Python

Brendan Tierney performs a bit of document classification with scikit-learn and nltk:

Text mining is a popular topic for exploring what text you have in documents etc. Text mining and NLP can help you discover different patterns in the text like uncovering certain words or phases which are commonly used, to identifying certain patterns and linkages between different texts/documents. Combining this work on Text mining you can use Word Clouds, time-series analysis, etc to discover other aspects and patterns in the text. Check out my previous blog posts (post 1post 2) on performing Text Mining on documents (manifestos from some of the political parties from the last two national government elections in Ireland). These two posts gives you a simple indication of what is possible.

We can build upon these Text Mining examples to include other machine learning algorithms like those for Classification. With Classification we want to predict or label a record or document to have a particular value. With Classification this could involve labeling a document as being positive or negative (movie or book reviews), or determining if a document is for a particular domain such as Technology, Sports, Entertainment, etc

Click through for a walkthrough of this process.

Comments closed

DBScan for Clustering in Python

Brendan Tierney takes us through the DBScan algorithm:

Let’s illustrate the use of DBScan (Density Based Spatial Clustering of Applications with Noise), using the scikit-learn Python package, for a “manufactured” dataset. This example will illustrate how this density based algorithm works (See my other blog post which compares different Clustering algorithms for this same dataset). DBSCAN is better suited for datasets that have disproportional cluster sizes (or densities), and whose data can be separated in a non-linear fashion.

Click through for an interesting read on a dataset which is historically difficult to cluster (unless you know the general shape and translate everything to polar coordinates).

Comments closed

Detecting Hard-to-Classify Data

Kaushal Mukherjee takes us through a new Python package:

The article explains the algorithm behind the recently introduced Python package named PyHard, based on the concept of Instance Space Analysis. It helps in assessing the quality of a dataset and identifying what are the instances which are hard/easy to classify. With the help of this algorithm we can separate out noisy instances. It also provides an interactive visualization tool to deep dive into the instance space.

Click through for the details. I’m going to wait for PyHard 2: PyHarder. Or maybe PyHardWithAVengeance. But it’ll all go downhill by the time we get to PyHard 5.

Comments closed