Press "Enter" to skip to content

Category: Python

Visual Studio Code In Anaconda 5.1

George Leopold reports that Anaconda 5.1 will now include Visual Studio Code as an optional IDE:

Microsoft and Python data science platform vendor Anaconda have extended their partnership by adding the software giant’s code editor to the latest Anaconda distribution.

The addition of Microsoft’s Visual Studio Code (VS Code) expands its support for the latest release of the Python data science platform, Anaconda 5.1. The Python platform has attracted more than 4.5 million users running the programming language on Windows, Mac and Linux.

Along with editing and debugging features, the partners said the cross-platform code editor includes custom features for Anaconda users. For example, a Python extension customizes VS Code for the Python development environment.

Read on for more information.

Comments closed

Optimal Image Colorization With Python

Sandipan Dey walks through a paper on colorization and shows some examples:

Colorization is a computer-assisted process of adding color to a monochrome image or movie. In the paper the authors presented an optimization-based colorization method that is based on a simple premise: neighboring pixels in space-time that have similar intensities should have similar colors.

This premise is formulated using a quadratic cost function  and as an optimization problem. In this approach an artist only needs to annotate the image with a few color scribbles, and the indicated colors are automatically propagated in both space and time to produce a fully colorized image or sequence.

In this article the optimization problem formulation and the way to solve it to obtain the automatically colored image will be described for the images only.

It’s an interesting approach.

Comments closed

PySpark DataFrame Transformations

Vincent-Philippe Lauzon shows how to perform data frame transformations using PySpark:

We wanted to look at some more Data Frames, with a bigger data set, more precisely some transformation techniques.  We often say that most of the leg work in Machine learning in data cleansing.  Similarly we can affirm that the clever & insightful aggregation query performed on a large dataset can only be executed after a considerable amount of work has been done into formatting, filtering & massaging data:  data wrangling.

Here, we’ll look at an interesting dataset, the H-1B Visa Petitions 2011-2016 (from Kaggle) and find some good insights with just a few queries, but also some data wrangling.

It is important to note that about everything in this article isn’t specific to Azure Databricks and would work with any distribution of Apache Spark.

The notebook used for this article is persisted on GitHub.

Read on for explanation, or check out the notebook to work on it at your own pace.

Comments closed

R Or Python

Tomaz Kastrun shares his thoughts on the topic of R versus Python:

Imag[in]e I ask you, would you prefer Apple iPhone over Samsung Galaxy, respectively? Or if I would ask you, would you prefer BMW over Audi, respectively? In all the cases, both phones or both cars will get the job done. So will Python or R, R or Python. So instead of asking which one I prefer, ask your self, which one suits my environment better? If your background is more statistics and less programming, take R, if you are more into programming and less into statistics, take Python; in both cases you will have faster time to accomplish results with your preferred language. If you ask me, can I do gradient boosting or ANOVA or MDS in Python or in R, the answer will be yes, you can do both in any of the languages.

This graf hits the crux of my opinion on the topic, but as I’ve gone deeper into the topic over the past year, I think the correct answer is probably “both” for a mature organization and “pick the one which suits you better” for beginners.

Comments closed

Leveraging Hive In Pyspark

Fisseha Berhane shows how to use Spark to connect Python to Hive:

If we are using earlier Spark versions, we have to use HiveContext which is variant of Spark SQL that integrates with data stored in Hive. Even when we do not have an existing Hive deployment, we can still enable Hive support.
In this tutorial, I am using standalone Spark. When not configured by the Hive-site.xml, the context automatically creates metastore_db in the current directory.

As shown below, initially, we do not have metastore_db but after we instantiate SparkSession with Hive support, we see that metastore_db has been created. Further, when we execute create database command, spark-warehouse is created.

Click through for a bunch of examples.

Comments closed

Markov Chains In Python

Sandipan Dey shows off various uses of Markov chains as well as how to create one in Python:

Perspective. In the 1948 landmark paper A Mathematical Theory of Communication, Claude Shannon founded the field of information theory and revolutionized the telecommunications industry, laying the groundwork for today’s Information Age. In this paper, Shannon proposed using a Markov chain to create a statistical model of the sequences of letters in a piece of English text. Markov chains are now widely used in speech recognition, handwriting recognition, information retrieval, data compression, and spam filtering. They also have many scientific computing applications including the genemark algorithm for gene prediction, the Metropolis algorithm for measuring thermodynamical properties, and Google’s PageRank algorithm for Web search. For this assignment, we consider a whimsical variant: generating stylized pseudo-random text.

Markov chains are a venerable statistical technique and formed the basis of a lot of text processing (especially text generation) due to the algorithm’s relatively low computational requirements.

Comments closed

Anomaly Detection With Python

Robert Sheldon continues his SQL Server Machine Learning Series:

As important as these concepts are to working Python and MLS, the purpose in covering them was meant only to provide you with a foundation for doing what’s really important in MLS, that is, using Python (or the R language) to analyze data and present the results in a meaningful way. In this article, we start digging into the analytics side of Python by stepping through a script that identifies anomalies in a data set, which can occur as a result of fraud, demographic irregularities, network or system intrusion, or any number of other reasons.

The article uses a single example to demonstrate how to generate training and test data, create a support vector machine (SVM) data model based on the training data, score the test data using the SVM model, and create a scatter plot that shows the scoring results.

Click through to see the scenario that Robert has laid out as an example.

Comments closed

Inventive Uses Of Python In SQL Server 2017

Gerald Britton has a couple non-ML uses for Python in SQL Server 2017:

One of the new features announced with SQL Server 2017 is support for the Python language. This is big! In SQL Server 2016, Microsoft announced support for the R language – an open source language ideally suited for statistical analysis and machine learning (ML). Recognizing that many data scientists use Python with ML libraries, the easy-to-learn-hard-to-forget language has now been added to the SQL Server ML suite.

There’s a big difference between R and Python though: R is a domain-specific language while Python is general purpose. That means that the full power of Python is available within SQL Server. This article leaves ML aside for the moment and explores a few of the other possibilities.

Gerald has two good cases for using Python with SQL Server.  Funny enough, they’re both also easily supported in R, so you could do this in 2016 as well.

Comments closed

Plotting Data With Python In ML Services

Robert Sheldon continues his Python in Machine Learning Services series:

One of the most useful modules is the matplotlib library, which provides an extensive codebase for plotting data and creating rich, customized visualizations. You can use matplotlib components to generate a wide range of graphics, including bar charts, pie charts, scatter plots, histograms, and many others. For example, you can generate a series of line charts that aggregate inventory or sales data in your SQL Server database and then save those charts to .png or .pdf files.

This article includes several examples that demonstrate how to create matplotlib visualizations and save them to .pdf files, using data from the AdventureWorks2017 sample database. The article assumes that you know how to use the sp_execute_external_script stored procedure to run Python scripts in SQL Server. If you’re not familiar with the stored procedure, you should review the first two articles in this series before continuing with this one.

If you’re already familiar with matplotlib, using it within SQL Server is pretty easy, as Robert shows.  If you’re not familiar, this is a useful introduction to the library.

Comments closed

Python Data Frames In ML Services

Robert Sheldon continues his SQL Server Machine Learning Services series by looking at Python data frames:

This article focuses on using data frames in Python. It is the second article in a series about MLS and Python. The first article introduced you briefly to data frames. This article continues that discussion, describing how to work with data frame objects and the data within those objects.

Data frames and the functions they support are available to MLS and Python through the pandas library. The library is available as a Python module that provides tools for analyzing and manipulating data, including the ability to generate data frame objects and work with data frame data. The pandas library is included by default in MLS, so the functions and data structures available to pandas are ready to use, without having to manually install pandas in the MLS library.

There’s quite a bit to this article, making it an interesting read.

Comments closed