Press "Enter" to skip to content

Category: Python

Installing Python Libraries on EMR Clusters with Notebooks

Parag Chaudhari shows how we can install Python libraries on existing ElasticMapReduce clusters using EMR Notebooks:

The notebook-scoped libraries discussed previously require your EMR cluster to have access to a PyPI repository. If you cannot connect your EMR cluster to a repository, use the Python libraries pre-packaged with EMR Notebooks to analyze and visualize your results locally within the notebook. Unlike the notebook-scoped libraries, these local libraries are only available to the Python kernel and are not available to the Spark environment on the cluster. To use these local libraries, export your results from your Spark driver on the cluster to your notebook and use the notebook magic to plot your results locally. Because you are using the notebook and not the cluster to analyze and render your plots, the dataset that you export to the notebook has to be small (recommend less than 100 MB).

Read the whole thing.

Comments closed

Topic Modeling

Federico Pascual has an article on topic modeling and topic classification:

Topic modeling is an unsupervised machine learning technique that’s capable of scanning a set of documents, detecting word and phrase patterns within them, and automatically clustering word groups and similar expressions that best characterize a set of documents. It’s known as ‘unsupervised’ machine learning because it doesn’t require a predefined list of tags or training data that’s been previously classified by humans.

Since topic modeling doesn’t require training, it’s a quick and easy way to start analyzing your data. However, you can’t guarantee you’ll receive accurate results, which is why many businesses opt to invest time training a topic classification model.

The article is long but worth the read, with examples in Python and additional notes for R.

Comments closed

K-Means Clustering with Python

Abhinav Choudhary walks us through k-means clustering using scikit-learn:

K Means Clustering tries to cluster your data into clusters based on their similarity. In this algorithm, we have to specify the number of clusters (which is a hyperparameter) we want the data to be grouped into. Hyperparameters are the variables whose value need to be set before applying value to the dataset. Hyperparameters are adjustable parameters you choose to train a model that carries out the training process itself.

Read on for a demo.

Comments closed

Logistic Regression Defaults and sklearn

Giovanni Lanzani shares some thoughts on scikit-learn defaults for Logistic Regression:

If you read the post, you can see that the biggest problem with the choice is that, unless your data is regularized, you will train a model that probably under performs: you are unnecessarily penalizing it by making it learn less than what it could from the data.

The second problem with the default behavior of LogisticRegression is about choosing a regularization constant that is — in effect — a magic number (equal to 1.0). This hides the fact that the regularization constant should be tuned by hyperparameter search, and not set in advance without knowing how the data and problem looks like.

Knowledge is power. Also read the post Giovanni links to in order to learn more about the issue.

Comments closed

Python and R Data Reshaping

John Mount takes us through a couple of data shaping packages:

The advantages of data_algebra and cdata are:

– The user specifies their desired transform declaratively by example and in data. What one does is: work an example, and then write down what you want (we have a tutorial on this here).
– The transform systems can print what a transform is going to do. This makes reasoning about data transforms much easier.
– The transforms, as they themselves are written as data, can be easily shared between systems (such as R and Python).

Let’s re-work a small R cdata example, using the Python package data_algebra.

Click through for the example.

Comments closed

Develop BDC PySpark Jobs in Visual Studio Code

Jenny Jiang announces a new capability in Visual Studio Code:

With the Visual Studio Code extension, you can enjoy native Python programming experiences such as linting, debugging support, language service, and so on. You can run current linerun selected lines of code, or run all for your PY file. You can import and export a .ipynb notebook and perform a notebook like query including Run Cell, Run Above, or Run Below. You can also enjoy a notebook like interactive experience that includes your source code and markdown comments along with the running results and output. You can remove the unneeded sections, enter comments, or type additional code in the interactive results window. Moreover, you can visualize your results in a graphic format through a matplotlib like Jupyter Notebook. The integration with SQL Server 2019 Big Data Clusters empowers you to quickly submit a PySpark batch job to the big data cluster and monitor job progress.

This is rather useful for developers, though I greatly prefer the Azure Data Studio notebook interface.

Comments closed

Eliminating Tail Calls in Python

John Mount shows how you can eliminate tail calls in Python:

I was working through Kyle Miller‘s excellent note: “Tail call recursion in Python”, and decided to experiment with variations of the techniques.

The idea is: one may want to eliminate use of the Python language call-stack in the case of a “tail calls” (a function call where the result is not used by the calling function, but instead immediately returned). Tail call elimination can both speed up programs, and cut down on the overhead of maintaining intermediate stack frames and environments that will never be used again.

Click through for John’s riff on the topic.

Comments closed

The Basics of Apache Airflow

Divyansh Jain explains what Apache Airflow is and takes us through a sample solution:

Airflow is a platform to programmatically author, schedule & monitor workflows or data pipelines. These functions achieved with Directed Acyclic Graphs (DAG) of the tasks. It is an open-source and still in the incubator stage. It was initialized in 2014 under the umbrella of Airbnb since then it got an excellent reputation with approximately 800 contributors on GitHub and 13000 stars. The main functions of Apache Airflow is to schedule workflow, monitor and author.

It’s another interesting product in the Hadoop ecosystem and has additional appeal outside of that space.

Comments closed

Python versus R (Again)

Alex Woodie looks at whether Python is dominating R in the data science space:

There is some evidence that Python’s popularity is hurting R usage. According to the TIOBE Index, Python is currently the third most popular language in the world, behind perennial heavyweights Java and C. From August 2018 to August 2019, Python usage surged by more than 3% to achieve a 10% rating (TIOBE’s proprietary metric that primarily measures search activity), easily the biggest gain among the 20 most popular languages.

R, by contrast, has not fared well lately on the TIOBE Index, where it dropped from 8th place in January 2018 to become the 20th most popular language today, behind Perl, Swift, and Go. At its peak in January 2018, R had a popularity rating of about 2.6%. But today it’s down to 0.8%, according to the TIOBE index.

I’ll say that rumors of R’s demise are premature.

Comments closed

Microsoft ML Server 9.4

Jeroen Ter Heerdt announces Microsoft Machine Learning Server 9.4:

Today we’re excited to announce our latest Microsoft Machine Learning Server 9.4 release, which addresses popular customer requests as well as developments in the R and Python community.

Microsoft Machine Learning Server is your flexible enterprise platform for analyzing data at scale, building intelligent apps, and discovering valuable insights across your business with full support for Python and R. Machine Learning Server meets the needs of all constituents of the process – from data engineers and data scientists to line-of-business programmers and IT professionals. It offers a choice of languages and features and algorithmic innovation that brings the best of open source and proprietary worlds together.

This is the best way to bind new versions of R and Python to your SQL Server ML Services installation.

Comments closed