Writing Vectorized Code In R

John Mount helps us understand writing R code like a native:

This sort of difference, scalar oriented C++ being so much faster than scalar oriented R, is often distorted into “R is slow.”
This is just not the case. If we adapt the algorithm to be vectorized we get an R algorithm with performance comparable to the C++implementation!
Not all algorithms can be vectorized, but this one can, and in an incredibly simple way. The original algorithm itself (xlin_fits_R()) is a bit complicated, but the vectorized version (xlin_fits_V()) is literally derived from the earlier one by crossing out the indices. That is: in this case we can move from working over very many scalars (slow in R) to working over a small number of vectors (fast in R).

This is akin to writing set-based SQL instead of cursor-based SQL: you’re thinking in terms which make it easier for the interpreter (or optimizer, in the case of a database engine) to operate quickly over your inputs. It’s also one of a few reasons why I think learning R makes a lot of sense when you have a SQL background.

gganimate Now On CRAN

Thomas Lin Pedersen announces that gganimate is now available on CRAN:

While this commit was done in the autumn 2017, nothing further happened until I decided to make gganimate the center of my useR 2018 keynote, at which point I was forced (by myself) to have some sort of package ready by the summer of 2018.
A fair amount of users have shown displeasure in the breaking changes this history has resulted in. Many blog posts have already been written focusing on the old API, as well as code on numerous computers that will no longer work. I understand this frustration, of course, but both me and David agreed that doing it this way was for the best in the end. I’m positive that the new API has already greatly exceeded the mind-share of the old API and given a year the old API will be all but a distant memory…

Read on for information on these breaking changes, and how the changes will make life easier in the long run. And stay for the fireworks. H/T R-Bloggers

Practical AI Workshop Notebooks

David Smith has published a set of notebooks from the Practical AI for the Working Software Engineer workshop:

Last month, I delivered the one-day workshop Practical AI for the Working Software Engineer at the Artificial Intelligence Live conference in Orlando. As the title suggests, the workshop was aimed at developers, bu I didn’t assume any particular programming language background. In addition to the lecture slides, the workshop was delivered as a series of Jupyter notebooks. I ran them using Azure Notebooks (which meant the participants had nothing to install and very little to set up), but you can run them in any Jupyter environment you like, as long as it has access to R and Python. You can download the notebooks and slides from this Github repository (and feedback is welcome there, too). 

Read on for details about those notebooks and to get your own copies.

Finding Max Concurrent Operations With T-SQL

Kevin Feasel

2019-01-04

T-SQL

I have a post up showing how to calculate the maximum number of concurrent operations using T-SQL:

You can probably see by this point how the pieces are coming together:  each time frame has a starting point and an ending point.  If there were no overlap at all, we’d see in the fourth column a number followed by a NULL, followed by a number followed by a NULL, etc.  But we clearly don’t see that:  we see work item ordinals 3 and 4 share some overlap:  item 3 started at 3:06:15 PM and ended after item 4’s start of 3:07:20 PM.  This means that those two overlapped to some extent.  Then we see two NULL values, which means they both ended before 5 began.  So far so good for our developers!

Click through for a bunch of T-SQL scripts, images, and important advice about always having interns around to take the blame.

Visualizing Hash Match Joins

Bert Wagner continues his series on visualizing physical join operators:

Hash Match joins are the dependable workhorses of physical join operators.
While Nested Loops joins will fail if the data is too large to fit into memory, and Merge Joins require that the input data are sorted, a Hash Match will join any two data inputs you throw at it (as long as the join has an equality predicate and you have enough space in tempdb).  

Bert has some great animated GIFs too.

Deep Dive On The Adaptive Join Operator

Hugo Kornelis covers the Adaptive Join operator:

The Adaptive Join operator was added in SQL Server 2017 as an alternative to the other join operators: Nested Loops (ideal for joining a small data stream with a cheap input), Hash Match (most effective for joining large unsorted sets) and Merge Join (ideal for joining data streams that are sorted by the join key). It is intended to be used when there is no efficient way to fulfill the order requirement of the Merge Join, and the optimizer cannot reliably predict which of the remaining algorithms (Hash Match or Nested Loops) would perform best.
Because it has to be able to join the data using either the Nested Loops or the Hash Match algorithm, Adaptive Join suffers from the combined restrictions of these operators. As such, Adaptive Join supports only four logical join operations: inner join, left outer join (but not the probed version), left semi join, and left anti semi join;  it requires at least one equality-based join predicate, it uses lots of memory, and it is semi-blocking.

Read on for a detailed look at this operator.

Splatting In Powershell

Chrissy LeMaire explains how you can simplify Powershell cmdlet calls using hashtables:

Splatting in PowerShell makes code easier to read. Instead of typing a bunch of parameters allllll across the screen, you can use an easy-to-read hashtable or array. Argument splatting was introduced in PowerShell v3 and works with all PowerShell commands, not just dbatools.
Note: I’ve only used splatting with hashtables, as they allow me to be explicit about which parameters I’m passing. It appears that arrays would employ positional parameters, which is less wordy but leaves room for error.

Whenever I hear the word “splatting” I think of the Naked Gun series of movies and OJ Simpson getting run over by a steamroller. Those were some funny movies, so that’s a good connotation.

Design Tips For Scaling Systems

Erik Darling has a few ideas for how you can design that SQL Server instance and database for future growth:

I can’t begin to tell you how many terrible things you can avoid by starting your apps out using an optimistic isolation level. Read queries and write queries can magically exist together, at the expense of some tempdb.
Yes, that means you can’t leave transactions open for a very long time, but hey, you shouldn’t do that anyway.
Yes, that means you’ll suffer a bit more if you perform large modifications, but you should be batching them anyway.

Optimistic concurrency is huge—definitely worth the top slot in Erik’s list.

Query Store & Max Storage Size

Arthur Daniels points out that Query Store’s max size is a suggestion rather than a requirement:

But the MAX_STORAGE_SIZE_MB is not a hard limit
I’ve seen this occur in production systems with heavy ad-hoc workloads. As the Query Store gathers data, there seems to be a point where the writes are queued up. When they are written to disk, the max size can be exceeded.
Anecdotally, I’ve seen a production systems where the max size has been vastly exceeded. But let me show you an example on a small scale on my demo machine.

Click through for a demonstration on the small. I’d much rather go over my self-imposed limit than lose useful data, so I’m okay with that decision. If you’re extremely tight on disk space, however, it’s good to know this detail.

Dynamic Programming In R With RCppDynProg

John Mount has a new package available in R:

In the above we have an input (or independent variable) x and an observed outcome (or dependent variable) y_observed (portrayed as points). y_observed is the unobserved idea value y_ideal (portrayed by the dashed curve) plus independent noise. The modeling goal is to get close the y_ideal curve using the y_observed observations. Obviously this can be done with a smoothing spline, but let’s use RcppDynProg to find a piecewise linear fit.
To encode this as a dynamic programming problem we need to build a cost matrix that for every consecutive interval of x-values we have estimated the out-of sample quality of fit. This is supplied by the function RcppDynProg::lin_costs() (using the PRESS statistic), but lets take a quick look at the idea.

It’s an interesting package whose purpose is to turn an input data stream into a set of linear functions which approximate the stream. I’m not sure I’ll ever have a chance to use it, but it’s good to know that it’s there if I do ever need it.

Categories

January 2019
MTWTFSS
« Dec Feb »
 123456
78910111213
14151617181920
21222324252627
28293031