Press "Enter" to skip to content

Category: Python

Hyperparameter Tuning with MLflow

Joseph Bradley shows how you can perform hyperparameter tuning of an MLlib model with MLflow:

Apache Spark MLlib users often tune hyperparameters using MLlib’s built-in tools CrossValidator and TrainValidationSplit.  These use grid search to try out a user-specified set of hyperparameter values; see the Spark docs on tuning for more info.

Databricks Runtime 5.3 and 5.3 ML and above support automatic MLflow tracking for MLlib tuning in Python.

With this feature, PySpark CrossValidator and TrainValidationSplit will automatically log to MLflow, organizing runs in a hierarchy and logging hyperparameters and the evaluation metric.  For example, calling CrossValidator.fit() will log one parent run.  Under this run, CrossValidator will log one child run for each hyperparameter setting, and each of those child runs will include the hyperparameter setting and the evaluation metric.  Comparing these runs in the MLflow UI helps with visualizing the effect of tuning each hyperparameter.

Hyperparameter tuning is critical for some of the more complex algorithms like random forests, gradient boosting, and neural networks.

Comments closed

Linear Regression With Python In Power BI

Emanuele Meazzo builds a linear regression in Power BI using a Python visual:

As a prerequisite, of course, you’ll need to have python installed in your machine, I recommend having an external IDE like Visual Studio Code to write your Python code as the PowerBI window offers zero assistance to coding.

You can follow this article in order to configure Python Correctly for PowerBI.

Step 2 is to add a Python Visual to the page, and let the magic happen.

Click through for the step-by-step instructions, including quite a bit of Python code and a few warnings and limitations.

Comments closed

Learning R Versus Python

Andy Kirk shares the results of a rather informal Twitter poll:

Yesterday I ran a simple Twitter poll about the relative ease of learning R vs. Python. Although a correct answer to this query will ALWAYS have to be based on nuances like pre-existing skills and the scope of need, this originates from people telling me they encounter job or career profiles that list a need for R and/or Python. If they don’t have either, if they prioritised the pursuit of just one, which would be possible to develop a degree of competency more easily, more quickly and more efficiently?

Andy has also created a Twitter moment from the responses.

My thought, based only on the question itself, is that R would be better than Python because the hypothetical person has no additional programming skills. For someone with additional programming skills, the breakdown for me starts with, if your background is statistics, database development, or functional programming, you probably want R; if your background is object-oriented development or imperative programming, you probably want Python. And then it gets nuanced.

Comments closed

Linear Programming in Python

Francisco Alvarez shows us an example of linear programming in Python:

The first two constraints, x1 ≥ 0 and x2 ≥ 0 are called nonnegativity constraints. The other constraints are then called the main constraints. The function to be maximized (or minimized) is called the objective function. Here, the objective function is x1 + x2.

Two classes of problems, called here the standard maximum problem and the standard minimum problem, play a special role. In these problems, all variables are constrained to be nonnegative, and all main constraints are inequalities.

That post spurred me on to look up LINGO and see that it’s actually still around.

Comments closed

From pandas to Spark with koalas

Achilleus tries out Koalas:

Python is widely used programming language when it comes to Data science workloads and Python has way too many different libraries to back this fact. Most of the data scientists are familiar with Python and pandas mostly. But the main issue with Pandas is it works great for small and medium datasets but not so good on big-data workloads. The challenge now becomes to convert the existing pandas code to pyspark code. This is just not straight forward and has a lot of performance hits if python UDFs are used without much care.

Koalas tries to address the first problem ie lessen the friction of learning different APIs to port their existing Pandas code to Pyspark. With Koalas, we can just directly replace the existing pandas code with Koalas. As far as the performance goes, there are no numbers yet as it is still in the initial phase of the project. But this definitely looks promising though.

Read on for some initial thoughts on the product, including a few gotchas.

Comments closed

Vectors for Programmers

John Mount has a couple of videos available:

We have just released two new free video lectures on vectors from a programmer’s point of view. I am experimenting with what ideas do programmers find interesting about vectors, what concepts do they consider safe starting points, and how to condense and present the material.

Click through for the links, one with Python examples and the other with R examples.

Comments closed

Sentiment Analysis with Python

Bruno Stecanella shows us how to use MonkeyLearn to perform sentiment analysis in Python:

Sentiment analysis is a set of Natural Language Processing (NLP) techniques that takes a text (in more academic circles, a document) written in natural language and extracts the opinions present in the text.

In a more practical sense, our objective here is to take a text and produce a label (or labels) that summarizes the sentiment of this text, e.g. positiveneutral, and negative.

For example, if we were dealing with hotel reviews, we would want the sentence ‘The staff were lovely‘ to be labeled as Positive, and the sentence ‘The shared bathroom was absolutely disgusting‘ labeled as Negative.

Click through for a demo.

Comments closed

Predicting Database Growth

James Livingston uses linear regression to plot database growth over time:

Utilizing the equation for a line, instead of solving for y we will solve for x, where:
– x corresponds to the day we will hit capacity based on current growth rate
– y corresponds to drive capacity in GB
– m is the slope of our regression line, provided by the model via lm.coef_
– b is the intercept of the regression line, also provided by the model via lm.intercept_

Click through for an example. This is one of the areas where DBAs can gain a lot by learning a bit of data science.

Comments closed

Sentiment Analysis with Spark on Qubole

Jonathan Day, et al, have a tutorial on using Qubole to build a sentiment analysis model:

This post covers the use of Qubole, Zeppelin, PySpark, and H2O PySparkling to develop a sentiment analysis model capable of providing real-time alerts on customer product reviews. In particular, this model allows users to monitor any natural language text (such as social media posts or Amazon reviews) and receive alerts when customers post extremely nice (high sentiment) or extremely negative (low sentiment) comments about their products.

In addition to introducing the frameworks used, we will also discuss the concepts of embedding spaces, sentiment analysis, deep neural networks, grid search, stop words, data visualization, and data preparation.

Click through for the demo.

Comments closed

Running Spark MLlib to Feed Power BI

Brad Llewellyn shows how you can take Spark MLlib results and feed them into Power BI:

MLlib is one of the primary extensions of Spark, along with Spark SQL, Spark Streaming and GraphX.  It is a machine learning framework built from the ground up to be massively scalable and operate within Spark.  This makes it an excellent choice for machine learning applications that need to crunch extremely large amounts of data.  You can read more about Spark MLlib here.

In order to leverage Spark MLlib, we obviously need a way to execute Spark code.  In our minds, there’s no better tool for this than Azure Databricks.  In the previous post, we covered the creation of an Azure Databricks environment.  We’re going to reuse that environment for this post as well.  We’ll also use the same dataset that we’ve been using, which contains information about individual customers.  This dataset was originally designed to predict Income based on a number of factors.  However, we left the income out of this dataset a few posts back for reasons that were important then.  So, we’re actually going to use this dataset to predict “Hours Per Week” instead.

Check it out. And Brad’s not joking when he says the resulting model is terrible. But that’s okay, because it was never about the model.

Comments closed