It’s never been easy to get TensorFlow installed on a Pi though. I had created a makefile script that let you build the C++ part from scratch, but it took several hours to complete and didn’t support Python. Sam Abrahams, an external contributor, did an amazing job maintaining a Python pip wheel for major releases, but building it required you to add swap space on a USB device for your Pi, and took even longer to compile than the makefile approach. Snips managed to get TensorFlow cross-compiling for Rust, but it wasn’t clear how to apply this to other languages.
Plenty of people on the team are Pi enthusiasts, and happily Eugene Brevdo dived in to investigate how we could improve the situation. We knew we wanted to have something that could be run as part of TensorFlow’s Jenkins continuous integration system, which meant building a completely automatic solution that would run with no user intervention. Since having a Pi plugged into a machine to run something like the makefile build would be hard to maintain, we did try using a hosted server from Mythic Beasts. Eugene got the makefile built going after a few hiccups, but the Python version required more RAM than was available, and we couldn’t plug in a USB drive remotely!
Read the whole thing, even if for the science experiment aspect.
Recursion is a topic in mathematics and computer science. In computer programming languages, the term recursion refers to a function that calls itself. Another way of putting it would be a function definition that includes the function itself in its definition. One of the first warnings I received when my computer science professor talked about recursion was that you can accidentally create an infinite loop that will make your application hang. This can happen because when you use recursion, your function may end up invoking itself infinitely. So, as with any other potential infinite loop, you need to make sure you have a way to break out of the loop. The idea in most recursive functions is to break up the procedure being done into smaller pieces that we can still process with the same function.
Read on for a couple quick recursion scenarios.
Tim Sweester and Aaron Bradley announce Diamond, a Python library which solves certain kinds of generalized linear models. In a two-part series, they explain more. Part 1 covers the mathematical principles behind it:
Many computational problems in data science and statistics can be cast as convex problems. There are many advantages to doing so:
- Convex problems have a unique global solution, i.e. there is one best answer
- There are well-known, efficient, and reliable algorithms for finding it
One ubiquitous example of a convex problem in data science is finding the coefficients of an-regularized logistic regression model using maximum likelihood. In this post, we’ll talk about some basic algorithms for convex optimization, and discuss our attempts to make them scale up to the size of our models. Unlike many applications, the “scale” challenge we faced was not the number of observations, but the number of features in our datasets. First, let’s review the model we want to fit.
In this example, GLMMs allow you to pool information across different brands, while still learning individual effects for each brand. It breaks the problem into sets of fixed and random effects. The fixed effects are similar to what you would find in a traditional logistic regression model, while the random effects allow the regression relationship to vary for each brand. One of the advantages of GLMMs is that they learn how different brands are from each other. Brands that are very similar to the overall average will have small random effect estimates. Because of the regularization of these models, brands with few observations will also have small random effect estimates, and be treated more like the overall average. In contrast, for brands that are very different from the average, with lots of data to support that, GLMMs will learn large random effect estimates.
Check it out. Part 2 also contains a link to the GitHub repo if you want to try it on your own.
The process for using Python in SQL Server is very similar to the previous process of installing R. Microsoft renamed R Services to Machine Learning Services, and now allows both R and Python to be installed, as shown in the screen. Microsoft’s version of Python uses Anaconda, which is an open source analytics platform created by Continuum. This is where Python differs from other open source languages, as Continuum is providing the version of Python as it contains data science components which are not included in the standard distribution of Python. Continuum also sells an enterprise version of Anaconda, with of course more features than come with the free version. It is important to remember the python environment as you will need select the same distribution when running Python code outside of SQL Server.
Read on to see how to install Python support in SQL Server 2017 and for a few links to tools.
To put the graph database to the test, I took bunch of emails from a particular MVP SQL Server distribution list (content will not be shown and all the names will be anonymized). On my gmail account, I have downloaded some 90MiB of emails in mbox file format. With some python scripting, only FROM and SUBJECTS were extracted:writer.writerow(['from','subject']) for index, message in enumerate(mailbox.mbox(infile)): content = get_content(message) row = [ message['from'].strip('>').split('<')[-1], decode_header(message['subject']),"|" ] writer.writerow(row)
This post walks you through loading data, mostly. But at the end, you can see how easy it is to find who replied to whose e-mails.
Python supports a limited number of data types in comparison to SQL Server. As a result, whenever you use data from SQL Server in Python scripts, the data might be implicitly converted to a type compatible with Python. However, often an exact conversion cannot be performed automatically, and an error is returned. This table lists the implicit conversions that are provided. Other data types are not supported.
This article will get you started, and from there, the wide world of Anaconda awaits you.
Unfortunately, although it gave me better results locally it got a worse score on the unseen data, which I figured meant I’d overfitted the model.
I wasn’t really sure how to work out if that theory was true or not, but by chance, I was reading Chris Albon’s blog and found a post where he explains how to inspect the importance of every feature in a random forest. Just what I needed!
There’s a nagging voice in my head saying “Principal Component Analysis” as I read this post.
Apparently, the data consists of 28 variables (V1, …, V28), an “Amount” field a “Class” field and the “Time” field. We do not know the exact meanings of the variables (due to privacy concerns). The Class field takes values 0 (when the transaction is not fraudulent) and value 1 (when a transaction is fraudulent). The data is unbalanced: the number of non-fraudulent transactions (where Class equals 0) is way more than the number of fraudulent transactions (where Class equals 1). Furthermore, there is a Time field. Further inspection shows that these are integers, starting from 0.
There is a small trick for getting more information than only the raw records. We can use the following code:print(df.describe())
This code will give a statistically summary of all the columns. It shows for example that the Amount field ranges between 0.00 and 25691.16. Thus, there are no negative transactions in the data.
The Kaggle competition data set is available, so you can follow along.
With variance score of 0.43 linear regression did not do a good job overall. When the x values are close to 0, linear regression is giving a good estimate of y, but we near end of x values the predicted y is far way from the actual values and hence becomes completely meaningless.
Here is where Quantile Regression comes to rescue. I have used the python package statsmodels 0.8.0 for Quantile Regression.
Let us begin with finding the regression coefficients for the conditioned median, 0.5 quantile.
The article doesn’t render the code very well at all, but Gopi does have the example code on Github, so you can follow along that way.
In the call to the
producemethod, both the
valueparameters need to be either a byte-like object (in Python 2.x this includes strings), a Unicode object, or
None. In Python 3.x, strings are Unicode and will be converted to a sequence of bytes using the UTF-8 encoding. In Python 2.x, objects of type
unicodewill be encoded using the default encoding. Often, you will want to serialize objects of a particular type before writing them to Kafka. A common pattern for doing this is to subclass
Producerand override the
producemethod with one that performs the required serialization.
The produce method returns immediately without waiting for confirmation that the message has been successfully produced to Kafka (or otherwise). The
flushmethod blocks until all outstanding produce commands have completed, or the optional timeout (specified as a number of seconds) has been exceeded. You can test to see whether all produce commands have completed by checking the value returned by the
flushmethod: if it is greater than zero, there are still produce commands that have yet to complete. Note that you should typically call
flushonly at application teardown, not during normal flow of execution, as it will prevent requests from being streamlined in a performant manner.
This is a fairly gentle introduction to the topic if you’re already familiar with Python and have a familiarity with message broker systems.