R Versus Python

Vincent Granville believes that Python is overtaking R in the realm of data science:

We use the app in question to compare search interest for R data Science versus Python Data Science, see above chart.  It looks like until December 2016, R dominated, but fell below Python by early 2017. The above chart displays an interest index, 100 being maximum and 0 being minimum. Click here to access this interactive chart on Google, and check the results for countries other than US, or even for specific regions such as California or New York.

Note that Python always dominated R by a long shot, because it is a general-purpose language, while R is a specialized language. But here, we compare R and Python in the niche context of data science. The map below shows interest for Python (general purpose) per region, using the same Google index in question.

It’s an interesting look at the relative shift between R and Python as a primary language for statistical analysis.

The SQL Has Landed

Kevin Feasel



Drew Furgiuele’s SQL Server in Space project was a success:

We took quite a few readings to try and account for any variations, and when we were done we ended up with a little more lift than we calculated, but not by much. We then proceeded to tie off the balloon. This is by far the most stressful part of the entire operation. Weather balloons have what amounts to a very long, thick neck where you inflate them from. Unlike, say, a party balloon, you don’t just knot the end and call it a day. Instead, you need to tie one end of the neck off to prevent the escape of the helium, then bend the neck back. Then, you tie that off. You’re left with a “U” shaped bend at the bottom of the balloon that you then attach your payload train to: the parachute, then the reflector, and ultimately the payload. Oh, and you have to do this while, you know, trying to wrangle a 7 foot diameter balloon.

Tagged “cloud” even though it was a bright and sunny day.

Lambda And Kappa Architectures

Michael Verrilli has a post contrasting the Lambda and Kappa data architectures:

Any query may get a complete picture by retrieving data from both the batch views and the real-time views. The queries will get the best of both worlds. The batch views may be processed with more complex or expensive rules and may have better data quality and less skew, while the real-time views give you up to the moment access to the latest possible data. As time goes on, real-time data expires and is replaced with data in the batch views.

One additional benefit to this architecture is that you can replay the same incoming data and produce new views in case code or formula changes.

The biggest detraction to this architecture has been the need to maintain two distinct (and possibly complex) systems to generate both batch and speed layers. Luckily with Spark Streaming (abstraction layer) or Talend (Spark Batch and Streaming code generator), this has become far less of an issue… although the operational burden still exists.

I haven’t seen much on the topic of Big Data architectures this year; it seems like it was a much more popular topic last year.

Monitoring SQL Agent Job Failures

Mark Wilkinson shows how to set up a SQL Agent job failure monitoring solution:

Since we are storing the date the records are added to the table, this query will always return the latest set of failures. This is a simple example, but the possibilities are endless:

  • Send the results of this query via database mail

  • Join with dbo.sysjobs and dbo.syscategories, alerting on different thresholds per job category

  • Extend the TOP (1) to include multiple capture periods and alert on average failures per capture

Check it out.  This is particularly helpful if you get blasted with thousands of error messages per minute because somebody made a bunch of untested changes and broke every job in your environment and caused the mail server to throttle your account for a multi-day period.  Not that this has ever happened to me, of course…

Columnstore Segment Elimination Pains

Niko Neugebauer shows us two pain points with row group (segment) elimination on clustered columnstore indexes:

Still this gives us processing of just 1 Segment with 253 lob logical reads vs 6 Segments and 2669 lob logical reads in the original query.

But wait, that’s not all! It can get significantly worse!
Let’s use the inequality search with the OR condition, to see what happens – should be safe, right ?

Spoilers:  it’s not safe.  I’ve been burned on the Min + Max issue that Niko shows, where you’d think that it could eliminate all segments except for the first and last, but that’s not how the process works today.

SSIS Out Of Memory

Kevin Hill diagnoses an out-of-memory error in SQL Server Integration Services:

Notice the red sections above:

  • Not enough storage – was this physical memory, virtual memory or disk?

  • Buffer failed to allocate (10MB) – Why?

  • 2% memory load – yes, this box has 1 TB of RAM…plenty for what we are doing you would think.

  • Attempt to add a row – well, yeah…if SSIS uses buffers and cannot allocate one, there are either none, or they are full.  Makes logical sense to me

I won’t spoil the ending; you’ll have to read it over there.

Using Azure Active Directory With Azure SQL Database

Shannon Lowder shows how to integrate Azure Active Directory with Azure SQL Database:

Now that our instance is set up to use AAD, let’s connect to our instance from SSMS. If you’re running Management Studio 2016 and SQL Server Data Tools for 2015 (14.0.60311.1) or later you should have noticed there are some extra authentication methods available in SSMS now.  We’re going to cover these out of order, since some of these options take more work than others.

Active Directory Password Authentication looks similar to a SQL authentication, but it accepts AAD User names and passwords.  When you choose this method, your credentials are sent over to Azure and end at your AAD instance.  Once your username and password are validated, AAD will check to see if you have rights to connect to the instance.  If so, you will connect.  If not, you will get an error message that you’ll need to google (bing) to find out what it really means.

With the steps we took in the last section, you should be able to log in to your Azure SQL Server as an administrator by using Active Directory Password Authentication.

Click through for the process, as well as links to additional resources.

SQL Server In Kubernetes

Andrew Pruski shows us how to spin up SQL Server containers within Kubernetes running on Azure Container Services:

Looks pretty good to me! SQL is up and has accepted the config value within our yaml file to change the SA password. But how are we going to connect to it?

What we need to do now is define a Kubernetes service. A service is a level of abstraction over pods to allow connections to the containers within the pods regardless of where the pod is located in the cluster. So let’s setup a service.

Andrew does a good job of taking us through the process step by step.

Explaining Max Server Memory

Randolph West explains SQL Server memory limits in various editions:

For the sake of this post, I am talking specifically about SQL Server 2012 and higher.

Let’s break this down.

On all editions of SQL Server (Enterprise, Standard, Web, Express), there is a configuration setting called Max Server Memory. This value should not be left at the default 2.1 petabytes, because at that value SQL Server will use all the memory you allow it (plus a little bit more if it needs to), and once it has it, it doesn’t release it back to the operating system without a fight.

Read on to see what Max Server Memory actually includes, as well as additional limitations on editions other than Enterprise.

Tokenizing Text With R

Rachael Tatman shows how to tokenize a set of text as the first step in a natural language processing experiment:

In this tutorial you’ll learn how to:

  • Read text into R
  • Select only certain lines
  • Tokenize text using the tidytext package
  • Calculate token frequency (how often each token shows up in the dataset)
  • Write reusable functions to do all of the above and make your work reproducible

For this tutorial we’ll be using a corpus of transcribed speech from bilingual children speaking in English.  You can find more information on this dataset and download it here.

It’s a nice tutorial, especially because the data set is a bit of a mess.


August 2017
« Jul Sep »