Press "Enter" to skip to content

Day: June 6, 2017

BDD In Spark

Aaron Colcord and Zachary Nanfelt explain how to use Cucumber to create behavior-driven development tests on Apache Spark:

Cucumber allows us to write a portion of our software in a simple, language-based approach that enables all team members to easily read the unit tests. Our focus is on detailing the results we want the system to return. Non-Technical members of the team can easily create, read, and validate the testing of the system.

Often Apache Spark is one component among many in processing data and this can encourage multiple testing frameworks. Cucumber can help us provides a consistent unit testing strategy when the project may extend past Apache Spark for data processing. Instead of mixing the different unit testing strategies between sub-projects, we create one readable agile acceptance framework. This is creating a form of ‘Automated Acceptance Testing’.

Best of all, we are able to create ‘living documentation’ produced during development. Rather than a separate Documentation process, the Unit Tests form a readable document that can be made readable to external parties. Each time the code is updated, the Documentation is updated. It is a true win-win.

It’s an interesting mix.  I’m not the biggest fan of BDD but I’m happy that this information is out there.

Comments closed

S3 And HDFS Data Migration

Ilya Yalovyy looks at S3DistCp, which allows you efficiently to migrate data back and forth between HDFS and S3:

Raw files often land in S3 or HDFS in an uncompressed text format. This format is suboptimal both for the cost of storage and for running analytics on that data. S3DistCp can help you efficiently store data and compress files on the fly with the --outputCodec option:

$ s3-dist-cp --src s3://my-tables/incoming/hourly_table_filtered --dest s3://my-tables/incoming/hourly_table_gz --outputCodec=gz

The current version of S3DistCp supports the codecs gzip, gz, lzo, lzop, and snappy, and the keywords none and keep (the default). These keywords have the following meaning:

  • none” – Save files uncompressed. If the files are compressed, then S3DistCp decompresses them.

  • keep” – Don’t change the compression of the files but copy them as-is.

This is an important article if you’ve got a Hadoop cluster running on EC2 nodes.

Comments closed

Memory Grants Affecting Columnstore Load

Denzil Ribeiro explains how memory grant pressure can determine whether a columnstore bulk insert skips the deltastore or not:

We found that only at the beginning of the run, there was contention on memory grants (RESOURCE_SEMAPHORE waits), for a short period of time. After that and later into the process, we could see some latch contention on regular data pages, which we didn’t expect as each thread was supposed to insert into its own row group. You would also see this same data by querying sys.dm_exec_requests live, if you caught it within the first minute of execution, as displayed below.

This is useful in case you run into the issue, but also useful as a case study on effective troubleshooting.

Comments closed

Hash Tables In Powershell

Adam Bertram explains what hash tables are and why they’re useful:

Notice that each of the keys is unique. This is required in a hash table. It’s not possible to add two keys with the same name. Below I’ve defined the SomeKey2 key twice.

We’ve just talked about creating a hash table and adding key/value pairs at creation time. It’s also possible to add keys to an already created hash table in a number of different ways, either through dot notation, enclosing the key in brackets, or by using the Add() method. Each approach has the same effect.

Hash tables are quite useful in Powershell for storing key-value pairs, like if you’re building a dictionary of configuration settings.

Comments closed

Watch Those Aggregates

Ronald Dameron warns us about graphs in the Azure portal:

I would expect to see the CPU spike with the same value no matter what time range I selected. But, to see the spike that fired the alert, I had to to “Edit” the chart and select different time ranges to see the differences. It wasn’t until I selected a narrow custom time range that the CPU graph would display the CPU spike that corresponded to the alert firing. The alert fires if the CPU percentage exceeds 80% over 15 minutes. So, if you “know” something happened, try different time ranges but especially the custom range to find what you are looking for.

For your highbrow reference of the quarter, FA Hayek on John Maynard Keynes’s Treatise of Money:  “Mr. Keynes’s aggregates conceal the most fundamental mechanisms of change.”

This isn’t an Azure-specific problem; it’s something we have to think about whenever we aggregate data.

Comments closed

Misleading Perfmon Counters

Kendra Little describes five Performance Monitor counters whose metrics aren’t quite as straightforward as you’d think:

SQLServer: Locks – Average Wait Time (ms); Lock Wait Time (ms);  Lock Waits/sec

The idea: Alert if you have high lock waits

The problem: These counters update when the lock wait ends, not while it’s ongoing. Let’s say you have a query that’s blocked for five minutes. You’ll only see this in the perf monitor when it becomes un-blocked.

These counters can cause confusion and make you look for blocking queries at the wrong times.

Kendra provides better solutions for each of the five perfmon counters, so read the whole thing.

Comments closed

Build Your Own Perfmon Counter

Daniel Janik shows us two ways to create our own Perfmon counters, the good way and the not-as-good way:

Whether you’re just now finding out about these or you’ve been reminded that these exist, I’m sure your asking: “How can I use them?”

Some many years ago, I used the procedures in SQL 2000 to monitor blocked processes. I was onsite at a customer in San Jose, Costa Rica and I figured pictures would help bridge the gap since I didn’t speak Spanish.

The customer was experiencing outages that they could see with blocking. While investigating I found that blocking was directly related to disk latency. More Latency = More Blocking.

I was unaware that this even existed, but it could be very useful in some circumstances.

Comments closed

Endpoints In SQL Server

Jason Brimhall explains what SQL Server endpoints are:

You may have heard the term endpoints thrown around in technical discussion and wondered “what the heck is an endpoint?” Well, that is a good question. An endpoint in the simplest form is a connection or point of entry into SQL server.

Another way of thinking about an endpoint is to look at the ends of a line. Generally speaking, there is a stop point at each end of the line. At this stopping point, the line may connect to something else and therefore be a point of entry into that “something else”.

Read on for more information, as well as a script to show who owns each endpoint.

Comments closed

OLEDB Waits In CHECKTABLE

Lonny Niederstadt has dropped a boatload of information on us, with a four-part series looking at the OLEDB wait type when running DBCC CHECKTABLE.  Part 1 sets up the problem:

So… yeah.  Definitely a relationship between the number of pages in the heap and the number of OLEDB waits.  Something(s) increase the number of OLEDB waits beyond the number of data pages, the number of used pages and the number of reserved pages in a way that I haven’t deciphered yet, though.

Part 2 looks at what happens when logical checks run and not just physical checks:

On this very simple heap, with nearly the same number of rows as pages in the heap, the number of OLEDB wait events is nearly the same – only increased by one – when “upgrading” the checktable operation from physical_only to a checktable with logical checks.

Part 3 switches to using a clustered index, and things start getting wacky:

So how do the results match with expectations?  Whoa!!  25745 OLEDB wait events as the starting point for MAXDOP 1 and 2.  Then single-stepping up to 25752 OLEDB wait events for MAXDOP 8.  25752 is exactly 4 times the used page count of 6438 for the clustered index.  So the OLEDB wait events went from a small number more than the count of pages/rows with the heap… to almost 4 times as many with the clustered index.  *That* was a bit unexpected.

Part 4 looks at a heap with a non-clustered index:

All right.  What if we use a heap with a nonclustered include index?  Based on the table and index definition below, I expect the nonclustered include index nci_OLEDB_TEST_HEAP_NCI_INCLUDE to be remarkably similar to the clustered index we previously used.

Interesting reading so far in this series.

Comments closed