A Cheat Sheet For Purrr

Kevin Feasel

2017-09-27

R

Mara Averick has a list of resources for learning more about purrr:

Purrr royal decree (ok, I’ll stop with the 🐱  puns now), the purrr 📦  now has its very own official RStudio cheat sheetApply Functions Cheat Sheet

The purrr package makes it easy to work with lists and functions. This cheatsheet will remind you how to manipulate lists with purrr as well as how to apply functions iteratively to each element of a list or vector. The back of the cheatsheet explains how to work with list-columns. With list columns, you can use a simple data frame to organize any collection of objects in R.

So, I thought we’d celebrate with a bit of a purrr 🐦  tweet roundup:

Purrr is one of those libraries I know I need to learn more about, and this looks like a good starting point.

Using Spark Streaming On Kafka

Ayush Tiwari has an introductory tutorial on using Spark Streaming on top of Kafka:

The Spark Streaming integration for Kafka 0.10 is similar in design to the 0.8 Direct Stream approach. It provides simple parallelism, 1:1 correspondence between Kafka partitions and Spark partitions, and access to offsets and metadata. However, because the newer integration uses the new Kafka consumer API instead of the simple API, there are notable differences in usage. This version of the integration is marked as experimental, so the API is potentially subject to change.

In this blog, I am going to implement the basic example on Spark Structured Streaming & Kafka Integration.

This is a code-heavy tutorial, so check it out.

Fun With The Beta Distribution

John D. Cook shows how one chatoic equation just happens to follow a beta distribution:

Indeed the points do bounce all over the unit interval, though they more often bounce near one of the ends.

Does that distribution look familiar? You might recognize it from Bayesian statistics. It’s a beta distribution. It’s symmetric, so the two beta distribution parameters are equal. There’s a vertical asymptote on each end, so the parameters are less than 1. In fact, it’s a beta(1/2, 1/2) distribution. It comes up, for example, as the Jeffreys prior for Bernoulli trials.

The graph below adds the beta(1/2, 1/2) density to the histogram to show how well it fits.

It’s an interesting bit of math and statistics, and John provides some Python demo code at the end.

Building A Disaster Recovery Checklist

Robert Davis has a checklist for you:

If a disaster struck tomorrow, are you ready for it? Are you sure you thought of everything? I cannot count the number of times something happened, and I thought that I should have been able to anticipate that. It can be a big benefit to get another viewpoint on your disaster plan. No matter how well you think you’ve covered yourself, someone who thinks about things differently than you do can often see something you don’t.

Here’s a checklist of things you might not have considered. This can be your other set of eyes.

This is not a to-do list or a process guide when disaster strikes; rather, it’s a checklist of things you should have in place before a disaster occurs.

Minimizing Problem Reproductions

Lonny Niederstadt explains the value of a minimalistic repro:

Often among the hardest of my decisions is whether I should spend more time trying to simplify a given problem by eliminating additional factors, shrinking the data needed for a repro, etc… or just put all that effort into investigation purely aimed at understanding the behavior. I expect that to be a long-term challenge 🙂

I was quite happy with the way this particular one worked out, though. It started as a maze… access violations generated on a SQL Server 2016 instance with dozens of databases. The access violation came from an insert query using a synonym – with the underlying table in another database! (I didn’t know that was possible – or *ever* desirable – until I saw it in this context.) The AV was occurring during a multi-step data transfer process and it was hard to isolate the data flowing into and out of the two databases in question. But after some finagling, I got the problem repro pretty doggone small. Reproduced the AVs on several CUs of SQL Server 2016 and on SQL Server 2017 RC2.

If you think you’re going to enlist the help of someone outside your organization, then you definitely want a minimalistic repro.  That will reduce the risk of red herrings, reduce the burden of assistance, and make it much more likely that some poor sap in support can actually fix your problem if it turns out to be a bug.

Basics Of Powershell Modules

Tracy Boggiano shows you how to create a Powershell module:

The first file you need to create is a manifest file.  This file tells who wrote the module what version PowerShell should be used and module version.  You will need to open either PowerShell ISE or PowerShell under Run as administrator to run this command. You change the path to match the name you want the name you want your module to have along with the name of the module and author.  You may have to create the folder in your Modules folder (DBAFunctions) first depending on the security on your machine.

There isn’t too much to the process, but there is a big benefit:  code portability.

Working With Rowversion Data Types

Louis Davidson walks through some of the properties of rowversion data types:

For years, I had thought (and was probably taught in SQL.AlongTimeAgoInAPlaceFarFarAway) that the timestamp column (well before rowversion was a thing,) was not guaranteed to be an ever increasing value. But this is not the case.

In BOL (https://docs.microsoft.com/en-us/sql/t-sql/data-types/rowversion-transact-sql)  it states:

“The rowversion data type is just an incrementing number…”

This makes it useful for determining rows that have changed, because it it automatic and the user cannot override the value in the column. However, there is a major concern when you use rowversions, and that is what happens when you change the structure of the table, and expect the consumer to see that change. The problem is that when you change the structure of the table, the rowversion will not be set (except when adding a new rowversion column.)

I’m not much of a fan of rowversion and tend not to use it, but my biases don’t have to be yours.

Setting Data Types When Building Table Values In Power BI

Chris Webb shows how to use a parameter on the Table.AddColumn function in M to define specific data types:

In the last two screenshots the ABC123 icon in the column headers show that they are set to use the Any data type; the columns returned by calling the function have lost their data types.

The key to solving this problem is using the optional fourth parameter of the Table.AddColumn() function, which allows you to set a data type for the column that function adds to a table. Altering the Invoked Custom Function step of the previous query to do this, setting the new column to be a table type like so:

Worth reading in its entirety.

Live Query Doesn’t Show Blocking

Erik Darling points out that the Live Query Plan view does not show wait events:

What the live query plan doesn’t tell you about your query.

  1. That it’s being blocked
  2. What it’s waiting on

No seriously, nothing turns red or beeps or starts checking its watch, and there’s no Clippy asking you to hang on a moment. It just sits there.

And since the query hasn’t finished running yet, we don’t have any wait stats in the plan XML yet.

Assuming that’s fairly easy for Management Studio to get at the operator level—Erik does note that you can get this blocking information from DMVs at the session/query level—it’d be nice to add this information.

Categories

September 2017
MTWTFSS
« Aug Oct »
 123
45678910
11121314151617
18192021222324
252627282930