Press "Enter" to skip to content

Category: Python

Re-Introducing rquery

John Mount has a new introduction to rquery:

rquery is a data wrangling system designed to express complex data manipulation as a series of simple data transforms. This is in the spirit of R’s base::transform(), or dplyr’s dplyr::mutate() and uses a pipe in the style popularized in R with magrittr. The operators themselves follow the selections in Codd’s relational algebra, with the addition of the traditional SQL “window functions.” More on the background and context of rquery can be found here.

The R/rquery version of this introduction is here, and the Python/data_algebra version of this introduction is here.

Check it out.

Comments closed

A New Notebook Tool: Polynote

Jeremy Smith, et al, announce a new product:

We are pleased to announce the open-source launch of Polynote: a new, polyglot notebook with first-class Scala support, Apache Spark integration, multi-language interoperability including Scala, Python, and SQL, as-you-type autocomplete, and more.

Polynote provides data scientists and machine learning researchers with a notebook environment that allows them the freedom to seamlessly integrate our JVM-based ML platform — which makes heavy use of Scala — with the Python ecosystem’s popular machine learning and visualization libraries. It has seen substantial adoption among Netflix’s personalization and recommendation teams, and it is now being integrated with the rest of our research platform.

There are some nice pieces to it, especially around language interop.

Comments closed

PySpark DataFrame Joining

Monika Rathor shows the various ways you can join DataFrames with PySpark:

PySpark provides multiple ways to combine dataframes i.e. join, merge, union, SQL interface, etc. In this article, we will take a look at how the PySpark join function is similar to SQL join, where two or more tables or dataframes can be combined based on conditions. 

One join type you don’t directly get in SQL Server is the left anti join. We can build something quite similar with NOT EXISTS, though.

Comments closed

Installing Python Libraries on EMR Clusters with Notebooks

Parag Chaudhari shows how we can install Python libraries on existing ElasticMapReduce clusters using EMR Notebooks:

The notebook-scoped libraries discussed previously require your EMR cluster to have access to a PyPI repository. If you cannot connect your EMR cluster to a repository, use the Python libraries pre-packaged with EMR Notebooks to analyze and visualize your results locally within the notebook. Unlike the notebook-scoped libraries, these local libraries are only available to the Python kernel and are not available to the Spark environment on the cluster. To use these local libraries, export your results from your Spark driver on the cluster to your notebook and use the notebook magic to plot your results locally. Because you are using the notebook and not the cluster to analyze and render your plots, the dataset that you export to the notebook has to be small (recommend less than 100 MB).

Read the whole thing.

Comments closed

Topic Modeling

Federico Pascual has an article on topic modeling and topic classification:

Topic modeling is an unsupervised machine learning technique that’s capable of scanning a set of documents, detecting word and phrase patterns within them, and automatically clustering word groups and similar expressions that best characterize a set of documents. It’s known as ‘unsupervised’ machine learning because it doesn’t require a predefined list of tags or training data that’s been previously classified by humans.

Since topic modeling doesn’t require training, it’s a quick and easy way to start analyzing your data. However, you can’t guarantee you’ll receive accurate results, which is why many businesses opt to invest time training a topic classification model.

The article is long but worth the read, with examples in Python and additional notes for R.

Comments closed

K-Means Clustering with Python

Abhinav Choudhary walks us through k-means clustering using scikit-learn:

K Means Clustering tries to cluster your data into clusters based on their similarity. In this algorithm, we have to specify the number of clusters (which is a hyperparameter) we want the data to be grouped into. Hyperparameters are the variables whose value need to be set before applying value to the dataset. Hyperparameters are adjustable parameters you choose to train a model that carries out the training process itself.

Read on for a demo.

Comments closed

Logistic Regression Defaults and sklearn

Giovanni Lanzani shares some thoughts on scikit-learn defaults for Logistic Regression:

If you read the post, you can see that the biggest problem with the choice is that, unless your data is regularized, you will train a model that probably under performs: you are unnecessarily penalizing it by making it learn less than what it could from the data.

The second problem with the default behavior of LogisticRegression is about choosing a regularization constant that is — in effect — a magic number (equal to 1.0). This hides the fact that the regularization constant should be tuned by hyperparameter search, and not set in advance without knowing how the data and problem looks like.

Knowledge is power. Also read the post Giovanni links to in order to learn more about the issue.

Comments closed

Python and R Data Reshaping

John Mount takes us through a couple of data shaping packages:

The advantages of data_algebra and cdata are:

– The user specifies their desired transform declaratively by example and in data. What one does is: work an example, and then write down what you want (we have a tutorial on this here).
– The transform systems can print what a transform is going to do. This makes reasoning about data transforms much easier.
– The transforms, as they themselves are written as data, can be easily shared between systems (such as R and Python).

Let’s re-work a small R cdata example, using the Python package data_algebra.

Click through for the example.

Comments closed

Develop BDC PySpark Jobs in Visual Studio Code

Jenny Jiang announces a new capability in Visual Studio Code:

With the Visual Studio Code extension, you can enjoy native Python programming experiences such as linting, debugging support, language service, and so on. You can run current linerun selected lines of code, or run all for your PY file. You can import and export a .ipynb notebook and perform a notebook like query including Run Cell, Run Above, or Run Below. You can also enjoy a notebook like interactive experience that includes your source code and markdown comments along with the running results and output. You can remove the unneeded sections, enter comments, or type additional code in the interactive results window. Moreover, you can visualize your results in a graphic format through a matplotlib like Jupyter Notebook. The integration with SQL Server 2019 Big Data Clusters empowers you to quickly submit a PySpark batch job to the big data cluster and monitor job progress.

This is rather useful for developers, though I greatly prefer the Azure Data Studio notebook interface.

Comments closed

Eliminating Tail Calls in Python

John Mount shows how you can eliminate tail calls in Python:

I was working through Kyle Miller‘s excellent note: “Tail call recursion in Python”, and decided to experiment with variations of the techniques.

The idea is: one may want to eliminate use of the Python language call-stack in the case of a “tail calls” (a function call where the result is not used by the calling function, but instead immediately returned). Tail call elimination can both speed up programs, and cut down on the overhead of maintaining intermediate stack frames and environments that will never be used again.

Click through for John’s riff on the topic.

Comments closed