Active Directory And ElasticMapReduce

Bruno Faria shows how to use AWS’s CloudFormation to extend Active Directory into an AWS ElasticMapReduce cluster and run jobs via Kerberos:

In this example, you build a solution that allows Active Directory users to seamlessly access Amazon EMR clusters and run big data jobs. Here’s what you need before setting up this solution:

  • An AWS account
  • An Amazon EC2 key pair
  • A possible limit increase for your account (Note: Usually a limit increase will not be necessary. See the AWS Service Limits documentation if you encounter a limit error while building the solution.)

To make it easier for you to get started, I created AWS CloudFormation templates that automatically configure and deploy the solution for you. The following steps and resources are involved in setting up the solution:

  1. Create and configure an Amazon Virtual Private Cloud (Amazon VPC).
  2. Launch an Amazon EC2 Windows instance (Active Directory domain controller).
  3. Create an Amazon EMR security configuration for Kerberos and cross-realm trust.
  4. Launch an Amazon EMR cluster with Kerberos enabled and a cross-realm trust configuration.

You can use the AWS CloudFormation templates to complete each step individually, or you can deploy the entire solution through a single step.

Read the whole thing.

Digging Into The Data Professional Survey

Melissa Connors looks at the 2018 Data Professionals Salary Survey:

This report is filtered to the United States, Private sector, full-time employees, Job Titles with more than 50 results, all primary databases, a salary between $15,000 and $200,000, and a survey year of 2018.

On the top are employees who said they work remotely 0 days per week, the middle is office employees who telecommute 1-4 days per week, and the bottom is the true remote employee who does this 5+ days per week.

The overall median salaries were $97,316 for office employees, $111,500 for part time telecommuters, and $114,163 for full time remote employees, which led to the click-bait title of this post. 🙂 It’s possible that this is because only more senior or highly-valued employees feel comfortable working from home, or are even allowed to, depending on the company culture.

Click through to see all of Melissa’s findings.

Online Database Object Changes

Michael J Swart continues his online deployment series:

Procedures are very easy to Blue-Green. Brand new procedures are added during the pre-migration phase. Obsolete procedures are dropped during the post-migration phase.

If the procedure is changing but is logically the same, then it can be altered during the pre-migration phase. This is common when the only change to a procedure is a performance improvement.

But if the procedure is changing in other ways. For instance, when a new parameter is added, or dropped, or the resultset is changing. Then use the Blue-Green method to replace it: During the pre-migration phase, create a new version of the procedure. It must be named differently and the green version of the application has to be updated to call the new procedure. The original blue version of the procedure is deleted during the post-migration phase. It’s not always elegant calling a procedure something like s_USERS_Create_v2 but it works.

This has been a great series so far, and the way he does deployments matches very closely to the way we do them.

Spark And NVMe

Alicja Luszczak, et al, introduce NVMe caching in the Databricks distribution of Spark:

A particularly important and widespread use case is caching the results of scan operations. This allows the users to eliminate the low throughput associated with reading remote data. For this reason, many users who intend to run the same or similar workload repeatedly decide to invest extra development time into manually optimizing their application, by instructing Spark exactly what files to cache and when to do it, and thus “explicit caching.”

For all its utility, Spark cache also has a number of shortcomings. First, when the data is cached in the main memory, it takes up space that could be better used for other purposes during query execution, for example, for shuffles or hash tables. Second, when the data is cached on the disk, it has to be deserialized when read — a process that is too slow to adequately utilize the high read bandwidths commonly offered by the NVMe SSDs. As a result, occasionally Spark applications actually find their performance regressing when turning on Spark caching.

Third, having to plan ahead and explicitly declare which data should be cached is challenging for the users who want to interactively explore the data or build reports. While Spark cache gives data engineers all the knobs to tune, data scientist often find it difficult to reason about the cache, especially in a multi-tenant setting, where engineers still require the results to be returned as quickly as possible in order to keep the iteration time short.

Read on for more details, as well as performance comparisons.

Filtering Alert Noise With A Leaky Bucket Algorithm

I have a post implementing a leaky bucket algorithm in T-SQL:

Now that we have a table, we want to do something with it.  The most naive solution would be to fire off an alert every time a row gets added to this table.  The problem with this solution is that it can flood our inbox with useless information.  Suppose the developers push out an API change that breaks everything.  The first 500 response will be important information.  The second one might be important because it’s confirmatory.  The six hundredth alert will not help.  And heaven help you if you’ve got this alert tied to your PagerDuty account…

So let’s only alert if we get to a certain threshold—we’ll call it 5 messages.  Fewer than 5 and we can assume it’s a blip or just somebody doing something weird.  The next part of the problem is, 5 messages per what?  We don’t want to alert for every fifth error message.  Let’s say we get one error message a year, so there was one in 2014, one in 2015, one in 2016, one in 2017, and now one in 2018.  If we simply set our threshold to report on every fifth error, we’ll get an alert in 2018, but most likely, the prior years’ errors are long gone by now, so that’s not helpful.

Read on for the solution.  I’ve been quite happy with the solution in practice, as it has cut down the number of spurious alert e-mails to practically nil.

Naming Indexes

Monica Rathbun hits one of my hobby horses:

As you can see from above, none of the names gave a complete indication of what the index encompassed. The first one did indicate it was a Non Clustered Index so that was good, but includes the date which to me is not needed. At least I knew it was not a Clustered Index. The second index did note it is a “Covering Index” which gave some indication that many columns could be included I also know it was created with the Data Tuning Advisor due to the dta prefix. The third index was also created with dta but it was left with the default dta naming convention. Like the first one I know the date it was created but instead of the word Cover2, I know there are 16 key columns are noted by the “K#” and regular numbers tell me those are included columns. However, I still have no idea exactly what these numbers denote without looking deeper into the index. The last index is closer to what I like however the name only tells me one column name when in fact it encompasses five key columns and two included columns.

I absolutely love seeing lots and lots of “_dta_” indexes; it’s a sign that I have a long day ahead of me.

Comparing Databases On SQL Server Instances

Andrew Pruski has written a quick comparison script to check if two instances have the same set of databases or the same configuration settings:

For instance, last week I was working with two instances that contained databases part of an Always On availability group and needed to ensure that all databases on the primary were on the secondary.

Now I know there’s a few ways to do this but I needed a quick and easy method as there were over 300 dbs involved.

The method I used to do this implemented the powershell cmdlet Compare-Object

What this does is pretty much what it says on the tin. The cmdlet takes two objects and compares them based on a input property (in this case it’ll be database name).

This can turn into a much more complicated comparison, but Andrew shows that the basic concept is quite straightforward.

More SSMS Tricks

Wayne Sheffield continues his SSMS tips & tricks series.  He first covers SQLCMD:

Notice that all of the SQLCMD commands start with a colon (:)

The complete reference for how all of these commands work is at I will cover just a few of them in this blog post.

SETVAR allows you to set a variable for use throughout the script. Its syntax is SETVAR . For instance, “:SETVAR SourceServer .\SQL2005” will set the variable SourceServer to “.\SQL2005”, which is the name of the SQL 2005 instance on my laptop.

Then he talks about multi-server queries, including local server groups and using a Central Management Server.  Having a CMS is critical when you have more than a couple of instances, and frankly, even then it can be helpful when bringing new people up to speed.

Wayne also covers the basics of regular expressions in SSMS.

All three of these are powerful tips.

Collecting Statistics Usage Info

Grant Fritchey shows us how (safely) to collect data on statistics usage:

Years ago I was of the opinion that it wasn’t really possible to see the statistics used in the generation of a query plan. If you read the comments here, I was corrected of that notion. However, I’ve never been a fan of using undocumented trace flags. Yeah, super heroes like Fabiano Amorim and Paul White use them, but for regular individuals like me, it seems like a recipe for disaster. Further, if you read about these trace flags, they cause problems on your system. Even Fabiano was getting the occasional crash.

So, what’s a safe way to get that information? First up, Extended Events. If you use the auto_stats event, you can see the statistics getting created and getting loaded and used. Even if they’re not created, you can see them getting loaded. It’s an easy way to quickly see which statistics were used to generate a plan. One note, you’ll have to compile or recompile a given query to see this in action.

Read on for more.

Altering Procedures And Object Definitions

Solomon Rutzky shows what metadata doesn’t get updated when you call sp_rename on a T-SQL procedure, function, view, or trigger:

This behavior is noted in the Microsoft documentation for sp_rename:

Renaming a stored procedure, function, view, or trigger will not change the name of the corresponding object name in the definition column of the sys.sql_modules catalog view. Therefore, we recommend that sp_rename not be used to rename these object types. Instead, drop and re-create the object with its new name.

Ok, so we have all been warned, at least when it comes to using sp_rename. But that is not the end of the story. There is, indeed, another way to change the object such that the definition does not reflect its current state. And that other way has to do with something missing from the examples shown thus far, something that wouldn’t be missing had I been following best-practices.

Click through to see what else doesn’t get updated.


January 2018
« Dec