Press "Enter" to skip to content

Category: Python

Defining Result Sets With ML Services

Dave Mason covers a pain point in SQL Server Machine Learning Services:

The example above is so simple, defining the RESULT SETS poses no problems. But what if the format of the output isn’t known at design time? R (or Python) might take the input data set and add, remove, or change columns conditionally. Further, the input data set might not even be known at design time. How would you define the RESULT SETS at run time?

WITH RESULT SETS needs a MAKE_A_GUESS or FIGURE_IT_OUT option. If there’s some other type of “easy button” for this, I haven’t found it.

It would be nice if the service could the ability to read the data frame columns and use those by default.

Comments closed

Scraping SQL Saturday Statistics

Tomaz Kastrun shows how to use rvest to read the SQL Saturday website and parse schedule details:

I wanted to check a simple query: How many times has a particular topic been presented and from how many different presenters.

Sounds interesting, tackling the problem should not be a problem, just that the end numbers may vary, since there will be some text analysis included.

Read on for the code and some analysis.

Comments closed

Quickly Computing Area Under The Curve

Jean-Francois Puget has a fast method for computing Area Under the Curve in Python:

When the target only takes two values we have a binary classification problem at hand.  Example of binary classification are very common. For instance fraud detection where examples are credit card transactions, features are time, location, amount, merchant id, etc., and target is fraud or not fraud.  Spam detection is also a binary classification where examples are emails, features are the email content as a string of words, and target is spam or not spam.  Without loss of generality we can assume that the target values are 0 and 1, for instance 0 means no fraud or no spam, whiloe 1 means fraud or spam.

For binary classification, predictions are also binary.  Therefore, a prediction is either equal to the target, or is off the mark.  A simple way to evaluate model performance is accuracy: how many predictions are right? For instance, if our test set has 100 examples in it, how many times is the prediction correct?  Accuracy seems a logical way to evaluate performance: a higher accuracy obviously means a better model.  At least this is what people think when they are exposed to the first time to binary classification problems.  Issue is that accuracy can be extremely misleading.

Read Jean-Francois’ explanation and scroll down for the Python sample.

Comments closed

Offline Installation Of SQL Server 2017 ML Services

Jan Mulkens shows how to install SQL Server 2017 Machine Learning Services when your the server hosting SQL Server doesn’t have outbound internet access:

That’s when you remember it… Your server isn’t connected to the internet!
Pretty normal, but in your enthusiasm you completely forgot that SQL Server needs to download some binaries for the R and Python components you so desperately want on your precious machine!

Luckily, the installer comes to your rescue and shows you where to download those binaries it needs.
Turns out however… This link only is for one R component and the installer won’t let you pass to the next screen!

Read on for the answer.

Comments closed

TensorFlow Tutorial

Ashish Bakshi has a TensorFlow tutorial:

As shown in the image above, tensors are just multidimensional arrays, that allows you to represent data having higher dimensions. In general, Deep Learning you deal with high dimensional data sets where dimensions refer to different features present in the data set. In fact, the name “TensorFlow” has been derived from the operations which neural networks perform on tensors. It’s literally a flow of tensors. Since, you have understood what are tensors, let us move ahead in this TensorFlow tutorial and understand – what is TensorFlow?

The sample here is Python, though there is an R library as well.

Comments closed

Creating A Poekr AI In Python

Kevin Jacobs has a fairly simple framework for building poker-playing bots:

The bot uses Monte Carlo simulations running from a given state. Suppose you start with 2 high cards (two Kings for example), then the chances are high that you will win. The Monte Carlo simulation then simulates a given number of games from that point and evaluates which percentage of games you will win given these cards. If another King shows during the flop, then your chance of winning will increase. The Monte Carlo simulation starting at that point, will yield a higher winning probability since you will win more games on average.

If we run the simulations, you can see that the bot based on Monte Carlo simulations outperforms the always calling bot. If you start with a stack of $100,-, you will on average end with a stack of $120,- (when playing against the always-calling bot).

It’s a start, and an opening for more sophisticated logic and analysis.

Comments closed

Python For The DBA: Copying SQL Logins

David Fowler has an example of how DBAs can use Python to do something interesting:

There are plenty of times when you might want to copy your SQL logins (including the SID) from one server to another.  Perhaps you’re running an AG and need to make sure that all users exist on all your secondaries with the correct SID, perhaps you’re migrating servers and need all the logins on your new server.  Whatever the reason, there are a number of different ways in which you can do this but they usually require scripting out on one server and then running the script into another server, or of course there’s Powershell.

The below script will use Python to copy all or specified logins from one server to another, including the password and SID.

Click through for the script.  The main use case for a SQL Server DBA to learn Python as a DBA scripting language would be if you run SQL on Linux—I don’t think Powershell on Linux is far enough developed to handle the full range of DBA tasks.  Otherwise, I’d use Powershell and dbatools.

Comments closed

Sentiment Analysis With Python In SQL Server

Nellie Gustafsson has a quick example of sentiment analysis using SQL Server Machine Learning Services:

You don’t have to be a data scientist to use machine learning in SQL Server. You can use pre-trained models available for usage out of the box to do your analysis. The following example shows you how you quickly get started and do text sentiment analysis.

Before starting to use this model, you need to install it. The installation is quick and instructions for installing the model can be found here: How to install the models on SQL Server

Once you have SQL Server installed with Machine Learning Services, enabled external script execution, and installed the pre-trained model, you can execute the following  script to create a stored procedure that uses Python and the microsoftml function get_sentiment with the pre-trained model to determine the probability of positive sentiment of a text:

Click through to read the whole thing.

Comments closed

Vectorized UDFs For PySpark

Li Jin talks about a performance optimization coming in Apache Spark 2.3:

To enable data scientists to leverage the value of big data, Spark added a Python API in version 0.7, with support for user-defined functions. These user-defined functions operate one-row-at-a-time, and thus suffer from high serialization and invocation overhead. As a result, many data pipelines define UDFs in Java and Scala, and then invoke them from Python.

Vectorized UDFs built on top of Apache Arrow bring you the best of both worlds—the ability to define low-overhead, high performance UDFs entirely in Python.

This looks like a good performance improvement coming to PySpark, bringing it closer to Scala/Java performance with respect to UDFs.

Comments closed

Kaggle Data Science Report For 2017

Mark McDonald rounds up a few notebooks covering a recent Kaggle survey:

In 2017 we conducted our first ever extra-large, industry-wide survey to captured the state of data science and machine learning.

As the data science field booms, so has our community. In 2017 we hit a new milestone of reaching over 1M registered data scientists from almost every country in the world. Representing many different backgrounds, skill levels, and professions, we were excited to ask our community a wide range of questions about themselves, their skills, and their path to data science. We asked them everything from “what’s your yearly salary?” to “what’s your favorite data science podcasts?” to “what barriers are faced at work?”, letting us piece together key insights about the people and the trends behind the machine learning models.

Without further ado, we’d love to share everything with you. Over 16,000 responses surveys were submitted, with over 6 full months of aggregated time spent completing it (an average response time of more than 16 minutes).

Click through for a few reports.  Something interesting to me is that the top languages/tools were, in order, Python, R, and SQL.  For the particular market niche that Kaggle competitions fit, that makes a lot of sense:  I tend to like R more for data exploration and data cleansing, but much of that work is already done by the time you get the dataset.

Comments closed