The Assumptive Nature Of R

Tim Sweetser and Kyle Schmaus explain some of the less-obvious bits of R that make it harder to use as a production language:

For us, the biggest surprise when using an R data.frame is what happens when you try to access a nonexistent column. Suppose we wanted to do something with the prices of our diamonds. price is a valid column of diamonds, but say we forgot the name and thought it was title case. When we ask for diamonds[["Price"]], R returns NULL rather than throwing an error! This is the behavior not just for tibble, but for data.tableand data.frame as well. For production jobs, we need things to fail loudly, i.e. throw errors, in order to get our attention. We’d like this loud failure to occur when, for example, some upstream data change breaks our script’s assumptions. Otherwise, we assume everything ran smoothly and as intended. This highlights the difference between interactive use, where R shines, and production use.

Read on for several good points along these lines.

Great Circles In R

Yan Holtz shows how to draw great circles using an R package called geosphere:

This post explains how to draw connection lines between several localizations on a map, using R. The method proposed here relies on the use of the gcIntermediate function from the geosphere package. Instead of making straight lines, it offers to draw the shortest routes, using great circles. A special care is given for situations where cities are very far from each other and where the shortest connection thus passes behind the map.

Now we know how to make pretty-looking global route charts.

Modular, Production-Ready R

David Smith highlights Syberia, a development framework for productionalizing R code:

Syberia also encourages you to break up your process into a series of distinct steps, each of which can be run (and tested) independently. It also has a make-like feature, in that results from intermediate steps are cached, and do not need to be re-run each time unless their dependencies have been modified.

Syberia can also be used to associate specific R versions with scripts, or even other R engines like Microsoft R. I was extremely impressed when during a 30-minute-break at the R/Finance conference last month, Robert was able to sketch out a Syberia implementation of a modeling process using the RevoScaleR library. In fact Robert’s talk from the conference, embedded below, provides a nice introduction to Syberia.

Interesting stuff.  If you’re working with models in R today, this could be up your alley.

Finding Failed Agent Jobs

Adrian Buckman has a stored procedure which retrieves failed SQL Agent jobs over a given timeframe:

So here is what it does:

  1. Check for failed agent jobs within the dates you specify (provided the agent log has data for this period) @FromDate will default to the past 12 hours if  no value is passed in, @ToDate will default to GetDate() if no value is passed in.

  2. Check that any failed jobs that have occurred within the date range have not subsequently succeeded, if the job has since succeeded since its last failure then this job will not be included in the results.

  3. Check that the job is not currently running, if the job is still running at the time of executing the procedure then this job will be excluded from the results.

  4. If a failed agent job exists that passes through the above checks then the Last Failure message will be obtained for the job and shown in the results along with the Job name, Failed Step and Failed Datetime.

Read on for the script.

Performance Effects Of Constraints

Shane O’Neill discusses how constraints can in some cases improve database performance:

David Morrison ( blog | twitter ) followed up with his presentation on “Query Plan Deep Dives” (I had seen this at SQL Bits, but it’s a great session so I had no problems watching it again) and, as an aside, through his presentation he showed us the compound effects that can happen from not specifying a columns NULLABILITY(it’s got letters so it’s word-like…)

Now his slides and scripts are up on his blog and they do a great job of walking you through them so check them out and you’ll see the compound effects they create!

Shane’s overview of two presentations was interesting, but despite all his rage, he is still just a Nicholas Cage.

Facial Recognition With Amazon Rekognition

Chris Adzima describes how his law enforcement agency uses Amazon Rekognition for facial recognition:

Setup was fairly straightforward. In the Washington County jail management system (JMS), we have an archive of mugshots going back to 2001. We needed to get the mugshots (all 300,000 of them) into Amazon S3. Then we need to index them all in Amazon Rekognition, which took about 3 days.

Our JMS allows us to tag the shots with the following information: front view or side view, scars, marks, or tattoos. We only wanted the front view, so we used those tags to get a list of just those.

Read on for sample implementation details, including moving images to S3, building the facial recognition “database,” and using it.

Foreign Key Batch Drop And Re-Create

Louis Davidson has a utility to batch drop and re-create foreign key constraints:

I was working on a database a few weeks back (maybe longer, I am kind of busy with several projects right now), and I wanted to truncate a set of tables. There was one central table, and 8 tables related by a FOREIGN KEY constraint. The central table had 6 million rows, and a few of the other tables 20+ million. TRUNCATE is amazingly fast,and when I tried to just use DELETE, it took almost as long as it took me to load the tables. (Exaggeration? A little bit, but not as much as you might imagine, since I use SSIS to BULK LOAD these tables, and the source data is very simple.)

I could just get rid of the constraints, but as a relational db lover, I love constraints. When enforced, they protect me. When not enforced (like when SSIS disables them for a BULK LOAD operation), they are still good documentation. So as any good nerd programmer type would, I started coding a tool to deal with the constraints for me. And as a decent blogger, as soon as it started to get interesting, I realized I could blog about it and upload the code to my website. This in the end makes the code better, because I have to test more, and I learn stuff from readers reminding me things (like a FK script needs to honor CASCADE and NOT FOR REPLICATION, oops.)

Read on for a link to the utility, as well as an example.

The Costs Of Statistics Updates With FULLSCAN

Kendra Little explains what happens when you update a table’s statistics with FULLSCAN:

On my test instance, the command that uses the default sampling takes 6 seconds to complete.

The command which adds “WITH FULLSCAN” takes just over five minutes to complete.

The reason is that those two little words can add a whole lot of extra IO to the work of updating statistics.

Kendra shows the query plans for each statistics update in some detail.  It’s a very interesting post, well worth taking the time to read.

DevOps T-SQL Tuesday Roundup

Grant Fritchey puts a bow on the latest T-SQL Tuesday:

Thank you to everyone who participated in T-SQL Tuesday #091 which was on databases and DevOps. As I anticipated, this brought out quite a bit of variety on the posts. This is because DevOps is still… not quite cooked…(?) in many peoples minds. I think with the range of posts we saw here, it’ll be a lot more clear to those who are just getting an introduction to it. Here are the posts (in no particular order) and a few comments on each:

Click through to see all the entrants.


June 2017
« May