Press "Enter" to skip to content

Category: Python

R Or Python

Tomaz Kastrun shares his thoughts on the topic of R versus Python:

Imag[in]e I ask you, would you prefer Apple iPhone over Samsung Galaxy, respectively? Or if I would ask you, would you prefer BMW over Audi, respectively? In all the cases, both phones or both cars will get the job done. So will Python or R, R or Python. So instead of asking which one I prefer, ask your self, which one suits my environment better? If your background is more statistics and less programming, take R, if you are more into programming and less into statistics, take Python; in both cases you will have faster time to accomplish results with your preferred language. If you ask me, can I do gradient boosting or ANOVA or MDS in Python or in R, the answer will be yes, you can do both in any of the languages.

This graf hits the crux of my opinion on the topic, but as I’ve gone deeper into the topic over the past year, I think the correct answer is probably “both” for a mature organization and “pick the one which suits you better” for beginners.

Comments closed

Leveraging Hive In Pyspark

Fisseha Berhane shows how to use Spark to connect Python to Hive:

If we are using earlier Spark versions, we have to use HiveContext which is variant of Spark SQL that integrates with data stored in Hive. Even when we do not have an existing Hive deployment, we can still enable Hive support.
In this tutorial, I am using standalone Spark. When not configured by the Hive-site.xml, the context automatically creates metastore_db in the current directory.

As shown below, initially, we do not have metastore_db but after we instantiate SparkSession with Hive support, we see that metastore_db has been created. Further, when we execute create database command, spark-warehouse is created.

Click through for a bunch of examples.

Comments closed

Markov Chains In Python

Sandipan Dey shows off various uses of Markov chains as well as how to create one in Python:

Perspective. In the 1948 landmark paper A Mathematical Theory of Communication, Claude Shannon founded the field of information theory and revolutionized the telecommunications industry, laying the groundwork for today’s Information Age. In this paper, Shannon proposed using a Markov chain to create a statistical model of the sequences of letters in a piece of English text. Markov chains are now widely used in speech recognition, handwriting recognition, information retrieval, data compression, and spam filtering. They also have many scientific computing applications including the genemark algorithm for gene prediction, the Metropolis algorithm for measuring thermodynamical properties, and Google’s PageRank algorithm for Web search. For this assignment, we consider a whimsical variant: generating stylized pseudo-random text.

Markov chains are a venerable statistical technique and formed the basis of a lot of text processing (especially text generation) due to the algorithm’s relatively low computational requirements.

Comments closed

Anomaly Detection With Python

Robert Sheldon continues his SQL Server Machine Learning Series:

As important as these concepts are to working Python and MLS, the purpose in covering them was meant only to provide you with a foundation for doing what’s really important in MLS, that is, using Python (or the R language) to analyze data and present the results in a meaningful way. In this article, we start digging into the analytics side of Python by stepping through a script that identifies anomalies in a data set, which can occur as a result of fraud, demographic irregularities, network or system intrusion, or any number of other reasons.

The article uses a single example to demonstrate how to generate training and test data, create a support vector machine (SVM) data model based on the training data, score the test data using the SVM model, and create a scatter plot that shows the scoring results.

Click through to see the scenario that Robert has laid out as an example.

Comments closed

Inventive Uses Of Python In SQL Server 2017

Gerald Britton has a couple non-ML uses for Python in SQL Server 2017:

One of the new features announced with SQL Server 2017 is support for the Python language. This is big! In SQL Server 2016, Microsoft announced support for the R language – an open source language ideally suited for statistical analysis and machine learning (ML). Recognizing that many data scientists use Python with ML libraries, the easy-to-learn-hard-to-forget language has now been added to the SQL Server ML suite.

There’s a big difference between R and Python though: R is a domain-specific language while Python is general purpose. That means that the full power of Python is available within SQL Server. This article leaves ML aside for the moment and explores a few of the other possibilities.

Gerald has two good cases for using Python with SQL Server.  Funny enough, they’re both also easily supported in R, so you could do this in 2016 as well.

Comments closed

Plotting Data With Python In ML Services

Robert Sheldon continues his Python in Machine Learning Services series:

One of the most useful modules is the matplotlib library, which provides an extensive codebase for plotting data and creating rich, customized visualizations. You can use matplotlib components to generate a wide range of graphics, including bar charts, pie charts, scatter plots, histograms, and many others. For example, you can generate a series of line charts that aggregate inventory or sales data in your SQL Server database and then save those charts to .png or .pdf files.

This article includes several examples that demonstrate how to create matplotlib visualizations and save them to .pdf files, using data from the AdventureWorks2017 sample database. The article assumes that you know how to use the sp_execute_external_script stored procedure to run Python scripts in SQL Server. If you’re not familiar with the stored procedure, you should review the first two articles in this series before continuing with this one.

If you’re already familiar with matplotlib, using it within SQL Server is pretty easy, as Robert shows.  If you’re not familiar, this is a useful introduction to the library.

Comments closed

Python Data Frames In ML Services

Robert Sheldon continues his SQL Server Machine Learning Services series by looking at Python data frames:

This article focuses on using data frames in Python. It is the second article in a series about MLS and Python. The first article introduced you briefly to data frames. This article continues that discussion, describing how to work with data frame objects and the data within those objects.

Data frames and the functions they support are available to MLS and Python through the pandas library. The library is available as a Python module that provides tools for analyzing and manipulating data, including the ability to generate data frame objects and work with data frame data. The pandas library is included by default in MLS, so the functions and data structures available to pandas are ready to use, without having to manually install pandas in the MLS library.

There’s quite a bit to this article, making it an interesting read.

Comments closed

Estimating Used Car Prices

Kevin Jacobs wants to estimate the value of his car and shows how to set up a machine learning job to do this:

As you can see, I collected the brand (Peugeot 106), the type (1.0, 1.1, …), the color of the car (black, blue, …) the construction year of the car, the odometer of the car (which is the distance in kilometers (km) traveled with the car at this point in space and time), the ask price of the car (in Euro’s), the days until the MOT (Ministry of Transport test, a required periodical check-up of your car) and the horse power (HP) of the car. Feel free to use your own variables/units!

It’s an interesting example of how you can approach a real problem.

Comments closed

Working With Lists In Python

Jean-Francois Puget shows that iterating through lists in a for loop is not always the most efficient for Python code:

The above code is correct, but it is inefficient.  List comprehensions are meant for case like this.  The idea of comprehension is to move the for loop inside the construction of the result list:

def get_square_list(x):
    return [xi*xi for xi in x]

This is both simpler and faster than the previous code.

Definitely worth reading if you come at Python from a C-like language background.

Comments closed

Network And Sankey Diagrams In Python And R

Tony Hirst has a roundup of various R and Python packages which build network charts or Sankey diagrams:

Another way we might be able to look at the data “out of time” to show flow between modules is to use a Sankey diagram that allows for the possibility of feedback loops.

The Python sankeyview package (described in Hybrid Sankey diagrams: Visual analysis of multidimensional data for understanding resource use looks like it could be useful here, if I can work out how to do the set-up correctly!

Sankey diagrams are on my list of dangerous visuals:  done right, they are informative, but it’s easy to try to put too much into the diagram and thereby confuse everybody.

Comments closed