dplyr Mutate Quirks

Kevin Feasel

2017-09-25

R

John Mount explains a quirk in dplyr’s mutate function:

It is hard for experts to understand how frustrating the above is to a new R user or to a part time R user. It feels like any variation on the original code causes it to fail. None of the rules they have been taught anticipate this, or tell them how to get out of this situation.

This quickly leads to strong feelings of learned helplessness and anxiety.

Our rule for dplyr::mutate() has been for some time:

Each column name used in a single mutate must appear only on the left-hand-side of a single assignment, or otherwise on the right-hand-side of any number of assignments (but never both sides, even if it is different assignments).

If you do data analysis with R, you’ve probably run into this before.  I certainly have, and it’s nice to understand why this is the case.

Distributed Database Writes

James Serra provides a number of options around distributed writes:

In SQL Server, scaling out reads (i.e. using Active secondary replicas via AlwaysOn Availability Groups) is a lot easier than scaling out writes.  So what are your options when you have a tremendous amount of writes that scaling up will not handle, no matter how big your server is?  There are a number of options that allow you to write to many servers (instead of writing to one master server) that I’ll call distributed writes.  Here are some ideas:

Read on for more options and some additional thoughts around Cosmos DB.  My first inclination would be to put Kafka in front of a distributed write system, but that’s my bias.

Instant File Initialization On Linux

Anthony Nocentino explains how instant file initialization works on SQL Server on Linux:

With strace up and running let’s turn on the trace flags to enable output for Instant File Initialization and create database that has a 100MB data file and a 100MB log file. Check out this post from Microsoft for more details on the trace flags. This database create code is straight from their post. I changed the model database’s data and log file sizes to 100MB each. Also, it’s important to note Instance File Initialization is only for data files, log files are zeroed out due to requirements for crash recovery. We’re going to see that in action in a bit…

Read the whole thing.

What Prevents Columnstore Row Group Merge

Niko Neugebauer shows two major scenarios in which the Row Group Merge operator will not work as expected, leaving you with an unexpectedly large number of row groups:

Let’s take a look at the most evil and uncorrectable pressure within SQL Server for the Columnstore Indexes – the Dictionary Pressure. For any final user without a bit of the internal knowledge of the Row Group sizes, Dictionaries & Pressures, it is always a huge surprise to find out that they can’t have those perfectly sized Row Groups with 1.048.576 rows.

Let us set up a table producing a good Dictionary Pressure and load 1.1 Million Rows into it, expecting to get just 2 Row Groups, while perfectly being aware that the result shall be quite different:

Niko’s dictionary pressure example is a good reason not to include textual columns on columnstore indexes.

Automated Edition Downgrade For SQL Server

Jana Sattainathan shows how to automate downgrade of SQL Server Enterprise to Standard:

Standard edition is limited to lesser of 4 sockets or 24 cores with a maximum memory of 128 GB plus a few truly Enterprise level features like Compression, Availability Groups, Partitioning etc are off limits. I would say most places would fall under this threshold for “Standard” but feel inferior to say they run “Standard”! I don’t, especially when money matters.

But, all kidding aside, most shops don’t even realize that they do not use any Enterprise features on 90% of their instances but pay Enterprise price anyway! If you don’t trust me, go check for yourself at your place – we did, on hundred’s of SQL Server instances! I painfully built the infrastructure to do this type of thing using PowerShell in seconds  if not a few minutes, for scanning hundreds of servers/instances.

There’s a lot here, so if you’re thinking about downgrading in a post-2016 SP1 world, Jana’s post is a must-read.  But even with the new features, there are still quite a few enterprise-level features that make it so I don’t want to live without Enterprise Edition.

Bugs With Backup Compression And TDE

Parikshit Savjani provides recommendations on combining backup compression with Transparent Data Encryption:

In past months, we discovered some edge scenarios related to backup compression for TDE databases causing backups or restores to fail, hence our recommendations have been

  • Avoid using striped backups with TDE and backup compression.

  • If your database has virtual log files (VLFs) larger than 4GB then do not use backup compression with TDE for your log backups. If you don’t know what a VLF is, start here.

  • Avoid using WITH INIT for now when working with TDE and backup compression. Instead, use WITH FORMAT.

  • Avoid using backup checksum with TDE and backup compression

Brent Ozar explains the risk:

When you install a new version of SQL Server, you get new features – and sometimes, you’re not told about them. For example, when 2016’s TDE compression came out, nobody told you, “If you back up across multiple files, your backups might suddenly be compressed.” You didn’t know that you had a new thing to test – after all, I don’t know a lot of DBAs who have the time to test that the new version of SQL Server successfully performs restores. They restore their production databases into the new version, test a few things, and declare victory – but testing restores FROM the new version’s backups isn’t usually on that list.

Keep up to date on those patches.

The Predicate For SQL On Linux: SQLPAL

Kevin Feasel

2017-09-25

Linux

Slava Oks explains the importance of the SQL Platform Abstraction Layer:

In addition to business matters, Hal’s article highlights key drawbacks which need to be addressed around engineering as well as  product supportability for the endeavor to be successful.   The article affirms that bringing SQL Server to *nix platform is not the hardest task compare to the additional work that the journey would require.  Indeed, in order for the project to be successful the team would have to:

  • Bring other SQL Server inbox products such as SSAS along
  • Implement platform specific features such as CLR, AGs, and much more
  • Guarantee adequate performance and scalability
  • Create new ecosystem around product support, engineering systems  and more.

So throughout the course of the work, I have continued to question the path we have been on and if we succeed at the end or not.  Every time I would go back, reread the article and every time I would come to the same favorable conclusion.  But why?

Slava links to this article from December from the data platform team explaining what SQLPAL is.  In the end, I think the benefits of this model will be much larger than SQL on Linux (which is itself large).

Categories

September 2017
MTWTFSS
« Aug Oct »
 123
45678910
11121314151617
18192021222324
252627282930