To put the graph database to the test, I took bunch of emails from a particular MVP SQL Server distribution list (content will not be shown and all the names will be anonymized). On my gmail account, I have downloaded some 90MiB of emails in mbox file format. With some python scripting, only FROM and SUBJECTS were extracted:writer.writerow(['from','subject']) for index, message in enumerate(mailbox.mbox(infile)): content = get_content(message) row = [ message['from'].strip('>').split('<')[-1], decode_header(message['subject']),"|" ] writer.writerow(row)
This post walks you through loading data, mostly. But at the end, you can see how easy it is to find who replied to whose e-mails.
Python supports a limited number of data types in comparison to SQL Server. As a result, whenever you use data from SQL Server in Python scripts, the data might be implicitly converted to a type compatible with Python. However, often an exact conversion cannot be performed automatically, and an error is returned. This table lists the implicit conversions that are provided. Other data types are not supported.
This article will get you started, and from there, the wide world of Anaconda awaits you.
Unfortunately, although it gave me better results locally it got a worse score on the unseen data, which I figured meant I’d overfitted the model.
I wasn’t really sure how to work out if that theory was true or not, but by chance, I was reading Chris Albon’s blog and found a post where he explains how to inspect the importance of every feature in a random forest. Just what I needed!
There’s a nagging voice in my head saying “Principal Component Analysis” as I read this post.
Apparently, the data consists of 28 variables (V1, …, V28), an “Amount” field a “Class” field and the “Time” field. We do not know the exact meanings of the variables (due to privacy concerns). The Class field takes values 0 (when the transaction is not fraudulent) and value 1 (when a transaction is fraudulent). The data is unbalanced: the number of non-fraudulent transactions (where Class equals 0) is way more than the number of fraudulent transactions (where Class equals 1). Furthermore, there is a Time field. Further inspection shows that these are integers, starting from 0.
There is a small trick for getting more information than only the raw records. We can use the following code:print(df.describe())
This code will give a statistically summary of all the columns. It shows for example that the Amount field ranges between 0.00 and 25691.16. Thus, there are no negative transactions in the data.
The Kaggle competition data set is available, so you can follow along.
With variance score of 0.43 linear regression did not do a good job overall. When the x values are close to 0, linear regression is giving a good estimate of y, but we near end of x values the predicted y is far way from the actual values and hence becomes completely meaningless.
Here is where Quantile Regression comes to rescue. I have used the python package statsmodels 0.8.0 for Quantile Regression.
Let us begin with finding the regression coefficients for the conditioned median, 0.5 quantile.
The article doesn’t render the code very well at all, but Gopi does have the example code on Github, so you can follow along that way.
In the call to the
producemethod, both the
valueparameters need to be either a byte-like object (in Python 2.x this includes strings), a Unicode object, or
None. In Python 3.x, strings are Unicode and will be converted to a sequence of bytes using the UTF-8 encoding. In Python 2.x, objects of type
unicodewill be encoded using the default encoding. Often, you will want to serialize objects of a particular type before writing them to Kafka. A common pattern for doing this is to subclass
Producerand override the
producemethod with one that performs the required serialization.
The produce method returns immediately without waiting for confirmation that the message has been successfully produced to Kafka (or otherwise). The
flushmethod blocks until all outstanding produce commands have completed, or the optional timeout (specified as a number of seconds) has been exceeded. You can test to see whether all produce commands have completed by checking the value returned by the
flushmethod: if it is greater than zero, there are still produce commands that have yet to complete. Note that you should typically call
flushonly at application teardown, not during normal flow of execution, as it will prevent requests from being streamlined in a performant manner.
This is a fairly gentle introduction to the topic if you’re already familiar with Python and have a familiarity with message broker systems.
The winning strategy of the last round, submitted by Vince Vatter, was
(0, 1, 2, 16, 21, 3, 2, 1, 32, 22), with an official record1 of 751 wins, 175 losses, and 5 ties. Naturally, the top-performing strategies look similar. This should not be surprising; winning strategies exploit common vulnerabilities among submissions.
I’ve downloaded the submitted strategies for the second round (I already have the first round’s strategies). Lets load them in and start analyzing them.
This is a great blog post, which looks at using evolutionary algorithms to evolve a winning strategy.
Earlier today two new command line tools were announced for SQL Server, one an experimental Linux tools DBFS which enables access to live DMVs without using a UI like SSMS and secondly a tool that enables script generation of objects within SQL rather like the Generate SQL Scripts option in SSMS.
In this post I’m going to run through the installation of the script generator tool and provide a very quick demo. The reason I’m going through this is because in order to install the tool we need to use something called PIP. PIP is a package management system that enables us install and use packages written in Python. Yeah, Python again!
I’m pretty interested in DBFS, as it seems well-placed to make crusty Linux sysadmins happier with SQL Server, and that’s a big positive in my book.
The new Visual Studio 2017 has built-in support for programming in R and Python. For older versions of Visual Studio, support for these languages has been available via the RTVS and PTVS add-ins, but the new Data Science Workloads in Visual Studio 2017 make them available without a separate add-in. Just choose the “Data Science and analytical applications” option during installation to install everything you need, including Microsoft R Client and the Anaconda Python distribution.
I’m personally going to wait a little bit before jumping onto Visual Studio 2017, but I’m glad that RTVS is now available.
For the rest of this post, I assume that you have some basic familiarity with Python, Pandas and Jupyter.
On your machine, you will need all of the following installed:
Python 2 or 3 with Pip
Amit shows two separate methods for retrieving data, so check it out.