2017-02-03 – Curated SQL

Encryption In ElasticMapReduce

Published 2017-02-03 by Kevin Feasel

Sai Sriparasa shows how to enable encryption in an ElasticMapReduce cluster:

In this post, I go through the process of setting up the encryption of data at multiple levels using security configurations with EMR. Before I dive deep into encryption, here are the different phases where data needs to be encrypted.

Data at rest

Data residing on Amazon S3—S3 client-side encryption with EMR

Data residing on disk—the Amazon EC2 instance store volumes (except boot volumes) and the attached Amazon EBS volumes of cluster instances are encrypted using Linux Unified Key System (LUKS)

Data in transit

Data in transit from EMR to S3, or vice versa—S3 client side encryption with EMR
Data in transit between nodes in a cluster—in-transit encryption via Secure Sockets Layer (SSL) for MapReduce and Simple Authentication and Security Layer (SASL) for Spark shuffle encryption
Data being spilled to disk or cached during a shuffle phase—Spark shuffle encryption or LUKS encryption

Turns out this is rather straightforward.

Comments closed

Data Frame Serialization In R

Published 2017-02-03 by Kevin Feasel

David Smith shows a new contender for serializing data frames in R, fst:

And now there’s a new package to add to the list: the fst package. Like the data.table package (the fast data.frame replacement for R), the primary focus of the fst package is speed. The chart below compares the speed of reading and writing data to/from CSV files (with fwrite/fread), feather, fts, and the native R RDS format. The vertical axis is throughput in megabytes per second — more is better. As you can see, fst outperforms the other options for both reading (orange) and writing (green).

These early numbers look great, so this is a project worth keeping an eye on.

Comments closed

JSON In Clustered Columnstore Indexes

Published 2017-02-03 by Kevin Feasel

Jovan Popovic gives a use case for JSON data being part of a clustered columnstore index:

This is equivalent to collections that you might find in classic NoSQL database because they store each JSON document as a single entity and optionally create indexes on these documents. The only difference is CLUSTERED COLUMNSTORE index on this table that provides the following benefits:

Data compression – CCI uses various techniques to analyze your data and choose optimal compression algorithms to compress data.
Batch mode analytic – queries executed on CCI process rows in the batches from 100 to 900 rows, which might be much faster than row-mode execution.

I think it’s worth reading this in conjunction with Niko Neugebauer’s comments regarding strings in columnstore.

Comments closed

Compatibility Level And Forced Plans

Published 2017-02-03 by Kevin Feasel

Erin Stellato has an experiment with Query Store plan forcing:

A question came up recently about plan guides and compatibility mode, and it got me thinking about forced plans in Query Store and compatibility mode. Imagine you upgraded to SQL Server 2016 and kept the compatibility mode for your database at 110 to use the legacy Cardinality Estimator. At some point, you have a plan that you force for a specific query, and that works great. As time goes on, you do testing with the new CE and eventually are ready to make the switch to compatibility mode 130. When you do that, does the forced plan continue to use compatibility mode 110? I had a guess at the answer but thought it was worth testing.

There are some interesting results here.

Comments closed

Thinking About The Gitlab Outage

Published 2017-02-03 by Kevin Feasel

Brent Ozar shares his thoughts on the recent Gitlab outage:

You can read more about the details in GitLab’s outage timeline doc, which they heroically shared while they worked on the outage. Oh, and they streamed the whole thing live on YouTube with over 5,000 viewers.

There are so many amazing lessons to learn from this outage: transparency, accountability, processes, checklists, you name it. I’m not sure that you, dear reader, can actually put a lot of those lessons to use, though. After all, your company probably isn’t going to let you live stream your outages. (I do pledge to you that I’m gonna do my damnedest to do that ourselves with our own services, though.)

There are some good pointers in here.

Comments closed

Understanding DBCC PAGE

Published 2017-02-03 by Kevin Feasel

David Alcock looks at the undocumented DBCC PAGE command:

This is an error that has been picked up on one of my test systems and indicates that SQL Server has detected a torn page, that is a page that has been incorrectly written by SQL Server and possibly indicates a problem in the IO subsystem.

The problem here is that whilst we know the database and the page where the error has occurred we don’t know the specific table the page belongs and importantly what type of page is in error. The reason why the page type is important is because this will drastically impact our recovery process but the first thing we will do is check a system table to see if any other page errors have been reported:

DBCC PAGE is just about the most-documented undocumented command around. Worth a read.

Comments closed

Container-Network Interactions

Published 2017-02-03 by Kevin Feasel

Andrew Pruski explains how Docker containers interact with the network:

Anyway, during the podcast one of the questions that came up was “How do containers interact with the network resources on the host server?”

To be honest, I wasn’t sure. So rather can try and give a half answer I said to the guys that I didn’t know and I’d have to come back to them.

Click through for the answer.

Comments closed

Excel Data Cleansing

Published 2017-02-03 by Kevin Feasel

Cedric Charlier continues his series on fixing up an Excel file. First up is turning results into an enumeration:

We’ve previously decided that a DON’T KNOW, shouldn’t influence our average of the answers. To apply this decision, we just need to filter the table Result and remove all the values equal to 0 (Enum value of DON’T KNOW). Then we calculate the average and subtract 1 to get a value between 0 and 4. Coooool, except that if we’ve no value non equal to 0, it will return -1 … not what we’re expecting. We’ll need to validate that the average is not null before subtracting 1.

The next post involves converting respondent information into a dimension:

In this table, only the results with a QuestionId equal to 111 really interest me for a merge with the existing table Respondent. If you’re familiar with the UI of Power BI Desktop then you’ll probably think to create a new table referencing this one then filter on QuestionId equals 111and finally merge. It’ll work but applying this strategy could result in many “temporary” tables. A lot of these small tables used only for a few steps before merging with other tables tend to be a nightmare for maintenance. You can use the “Advanced formula editor” to not display this kind of temporary tables and embed them in your main table.

Read on for more.

Comments closed

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28

Day: February 3, 2017

Data at rest

Data in transit