Press "Enter" to skip to content

Category: Python

Vectorized UDFs For PySpark

Li Jin talks about a performance optimization coming in Apache Spark 2.3:

To enable data scientists to leverage the value of big data, Spark added a Python API in version 0.7, with support for user-defined functions. These user-defined functions operate one-row-at-a-time, and thus suffer from high serialization and invocation overhead. As a result, many data pipelines define UDFs in Java and Scala, and then invoke them from Python.

Vectorized UDFs built on top of Apache Arrow bring you the best of both worlds—the ability to define low-overhead, high performance UDFs entirely in Python.

This looks like a good performance improvement coming to PySpark, bringing it closer to Scala/Java performance with respect to UDFs.

Comments closed

Kaggle Data Science Report For 2017

Mark McDonald rounds up a few notebooks covering a recent Kaggle survey:

In 2017 we conducted our first ever extra-large, industry-wide survey to captured the state of data science and machine learning.

As the data science field booms, so has our community. In 2017 we hit a new milestone of reaching over 1M registered data scientists from almost every country in the world. Representing many different backgrounds, skill levels, and professions, we were excited to ask our community a wide range of questions about themselves, their skills, and their path to data science. We asked them everything from “what’s your yearly salary?” to “what’s your favorite data science podcasts?” to “what barriers are faced at work?”, letting us piece together key insights about the people and the trends behind the machine learning models.

Without further ado, we’d love to share everything with you. Over 16,000 responses surveys were submitted, with over 6 full months of aggregated time spent completing it (an average response time of more than 16 minutes).

Click through for a few reports.  Something interesting to me is that the top languages/tools were, in order, Python, R, and SQL.  For the particular market niche that Kaggle competitions fit, that makes a lot of sense:  I tend to like R more for data exploration and data cleansing, but much of that work is already done by the time you get the dataset.

Comments closed

Data Set Robustness

Tomaz Kastrun shows how robust the iris data set is:

Conclusion, IRIS dataset is – due to the nature of the measurments and observations – robust and rigid; one can get very good accuracy results on a small training set. Everything beyond 30% for training the model, is for this particular case, just additional overload.

The general concept here is, how small can you arbitrarily slice the data and still come up with the same result as the overall data set?  Or, phrased differently, how much data do you need to collect before predictions stabilize?  Read on to see how Tomaz solves the problem.

Comments closed

Using Service Broker To Queue Up External Script Calls

Arvind Shyamsundar shows how to use Service Broker to run external R or Python scripts based on new data coming into a transactional system:

Here, we will show you how you can use the asynchronous execution mechanism offered by SQL Server Service Broker to ‘queue’ up data inside SQL Server which can then be asynchronously passed to a Python script, and the results of that Python script then stored back into SQL Server.

This is effectively similar to the external message queue pattern but has some key advantages:

  • The solution is integrated within the data store, leading to fewer moving parts and lower complexity
  • Because the solution is in-database, we don’t need to make copies of the data. We just need to know what data has to be processed (effectively a ‘pointer to the data’ is what we need).

Service Broker also offers options to govern the number of readers of the queue, thereby ensuring predictable throughput without affecting core database operations.

There are several interconnected parts here, and Arvind walks through the entire scenario.

Comments closed

Measuring Semantic Relatedness

Sandipan Dey re-works a university assignment on semantic relatedness in Python:

Let’s define the semantic relatedness of two WordNet nouns x and y as follows:

  • A = set of synsets in which x appears
  • B = set of synsets in which y appears
  • distance(x, y) = length of shortest ancestral path of subsets A and B
  • sca(x, y) = a shortest common ancestor of subsets A and B

This is the notion of distance that we need to use to implement the distance() and sca() methods in the WordNet data type.

It looks like a helpful assignment for understanding natural language processing a little better.

Comments closed

R And Python: Two Growing Languages

David Smith notes that as fast as Python is growing, R is as well:

Python has been getting some attention recently for its impressive growth in usage. Since both R and Python are used for data science, I sometimes get asked if R is falling by the wayside, or if R developers should switch course and learn Python. My answer to both questions is no.

First, while Python is an excellent general-purpose data science tool, for applications where comparative inference and robust predictions are the main goal, R will continue to be the prime repository of validated statistical functions and cutting-edge research for a long time to come. Secondly, R and Python are both top-10 programming languages, and while Python has a larger userbase, R and Python are both growing rapidly — and at similar rates.

I had a discussion about this last night.  I like the language diversity:  R is more statistician-oriented, whereas Python is more developer-oriented.  They both can solve the same set of problems, but there are certainly cases where one beats the other.  I think Python will end up being the more popular language for data science because of the number of application developers moving into the space, but for the data analysts and academicians moving to this field, R will likely remain the more interesting language.

Comments closed

Configuring Visual Studio To Execute Python Code

Dave Mason shows us how to install Python support in Visual Studio 2015 and hook it up to the SQL Server 2017 Machine Learning Services installation of Python:

I’m starting to experiment with Python scripts in SQL Server 2017 using Machine Learning Services (In-Database). The problem is, I don’t know Python. If I run into a Python error, the output I get from SSMS is not looking too helpful. My instincts tell me I’ll be much better off developing and debugging Python code from a development tool. What I settled on was to use Visual Studio along with the Python interpreter that comes with SQL Server 2017 Machine Learning Services. I ran into a few issues that I’ll review here.

The first thing I did was Install Python support in Visual Studio on Windows. This article from Microsoft was simple enough. It worked for me with Visual Studio Community 2015. I quickly created a “PythonApplication1” project and tried Hello World. But I got an error telling me Visual Studio couldn’t find any interpreters.

Click through to read more.  With Visual Studio 2017, it’s a bit easier to get started:  select the Data Science pack on installation and you’ll get both Python and R support out of the box.

Comments closed

Fun With The Beta Distribution

John D. Cook shows how one chatoic equation just happens to follow a beta distribution:

Indeed the points do bounce all over the unit interval, though they more often bounce near one of the ends.

Does that distribution look familiar? You might recognize it from Bayesian statistics. It’s a beta distribution. It’s symmetric, so the two beta distribution parameters are equal. There’s a vertical asymptote on each end, so the parameters are less than 1. In fact, it’s a beta(1/2, 1/2) distribution. It comes up, for example, as the Jeffreys prior for Bernoulli trials.

The graph below adds the beta(1/2, 1/2) density to the histogram to show how well it fits.

It’s an interesting bit of math and statistics, and John provides some Python demo code at the end.

Comments closed

Getting Started With TensorFlow

Vivek Kalyanrangan shows us how to install TensorFlow:

Installing Tensorflow with GPU requires you to have NVIDIA GPU. AMD video cards are not supported with tensorflow. NVIDIA uses low level GPU computing system called CUDA. It is an NVIDIA proprietary software.

One can go the OpenCL way with AMD but as of now it won’t work with tensorflow.

Also, all NVIDIA devices are not supported. Here is a list from the NVIDIA documentation listing the supported GPUs.

By the end of it, Vivek also shows us a simple trained model.

Comments closed

R Versus Python

Vincent Granville believes that Python is overtaking R in the realm of data science:

We use the app in question to compare search interest for R data Science versus Python Data Science, see above chart.  It looks like until December 2016, R dominated, but fell below Python by early 2017. The above chart displays an interest index, 100 being maximum and 0 being minimum. Click here to access this interactive chart on Google, and check the results for countries other than US, or even for specific regions such as California or New York.

Note that Python always dominated R by a long shot, because it is a general-purpose language, while R is a specialized language. But here, we compare R and Python in the niche context of data science. The map below shows interest for Python (general purpose) per region, using the same Google index in question.

It’s an interesting look at the relative shift between R and Python as a primary language for statistical analysis.

Comments closed