Online HDFS Disk Balancer

Kevin Feasel

2016-10-19

Hadoop

Lei Xu demonstrates the intra-DataNode disk balancer in HDFS:

By default, the DataNode uses the round-robin-based policy to write new blocks. However, in a long-running cluster, it is still possible for the DataNode to have created significantly imbalanced volumes due to events like massive file deletion in HDFS or the addition of new DataNode disks via the disk hot-swap feature. Even if you use the available-space-based volume-choosing policy instead, volume imbalance can still lead to less efficient disk I/O: For example, every new write will go to the newly-added empty disk while the other disks are idle during the period, creating a bottleneck on the new disk.

Recently, the Apache Hadoop community developed server offline scripts (as discussed inHDFS-1312, the dev@ mailing list, and GitHub) to alleviate the data imbalance issue. However, due to being outside the HDFS codebase, these scripts require that the DataNode be offline before moving data between disks. As a result, HDFS-1312 also introduces an online disk balancer that is designed to re-balance the volumes on a running DataNode based on various metrics. Similar to the HDFS Balancer, the HDFS disk balancer runs as a thread in the DataNode to move the block files across volumes with the same storage types.

This is a good read and sounds like a very useful feature.

Sparklyr On EMR

Kevin Feasel

2016-10-19

R, Spark

Tom Zeng shows how to use sparklyr on Amazon ElasticMapReduce:

The recently released sparklyr package by RStudio has made processing big data in R a lot easier. sparklyr is an R interface to Spark that allows users to use Spark as the backend for dplyr, one of the most popular data manipulation packages. sparklyr provides interfaces to Spark packages and also allows users to query data in Spark using SQL and develop extensions for the full Spark API.

You can also install sparklyr locally and point to a Spark cluster.

Resetting Kafka Topics

Kevin Feasel

2016-10-19

Hadoop

I show two methods to clear out a Kafka topic:

The first method works fine for non-production scenarios where you can stop all of the producers and consumers, but let’s say that you want to flush the topic while leaving your producers and consumers up (but maybe you have a downtime window where you know the producers aren’t pushing anything).  In this case, we can change the retention period to something very short, let the queue flush, and bring it back to normal, all using the kafka-configs shell script.

Points deducted for slipping and writing “queue” there, but otherwise, I prefer the second method, as things are still online.  In less-extreme scenarios, you might drop the retention period to a few minutes, especially if your consumers are all caught up.

R Services Resource Utilization

Ginger Grant shows off some R Services reports to see how hard the developers are battering your poor servers with their R scripts:

R Services – Extended Events is also not a report but a list of all the extended events that are available for R Services. This is a handy bit of information, which can be a great reference tool for extended events monitoring. R Services – Packages lists the packages which are currently installed on SQL Server. When people write R, many lot of different packages are used within the script. Prior to running a package, check the information on this report to ensure the libraries used are installed on SQL Server. If the library is missing the code will not work. R Services – Resource Usage is a great way to see how R has been configured to run on the server. Notice I have created an External Pool for R. This is a configuration recommended by Microsoft to better monitor your R Services.

Click through for more information, and grab the reports from Microsoft’s Github repo.

Stretch Database Authentication Failures

Kevin Feasel

2016-10-19

Bugs, Cloud

Jack Li walks through a bug in Stretch database:

The message provided enough directions.  It says either you have a bad login or firewall setting on the Azure DB Server side is not configured correctly.     The very first thing is to ensure the Firewall was configured correctly.   We even tried 0.0.0.0. to 255.255.255.255. But it didn’t resolve the issue.

Next we created a brand new database on the same server and tried on that one.  It worked.  But customer just couldn’t get the old database to work even she made sure that she could use the login/password to log in using SSM on the same server to the Azure DB server.

On the same server, brand new database worked but the old database didn’t.   So that made me wonder what happens if I manually cause an failure and later retry.

Read on for the repo and solution.

Parallel PoshRSJob Template

Cody Konior walks through using PoshRSJob with a custom function:

Recently I migrated from my own runspace module to Boe Prox’s PoshRSJob which is pretty much perfect. But today I wanted to share how to integrate PoshRSJob cleanly into your functions through a default -Parallel parameter and using a template.

You can very easily modify this for your own purposes however it’s even more awesome as-is if you run parallelised tests for one major input (like a computer name) but where additional information might also be passed in through object properties on a pipeline (I’ll explain why you’d want to do that later in the post). Here’s what it looks like:

Read on for code and explanation.  Powershell parallelism is something that I’ve never been good at, so hopefully this makes it easier for me…

Linear Models

Kevin Feasel

2016-10-19

R

Andrea Spano, et al, are starting a new book:

This chapter is an introduction to the first section of the book, Linear Models, and contain some theoretical explanation and lots of examples. At the end of the chapter you will find two summary tables with Linear model formulae and functions in R and Common R functions for inference.

The book is just getting started, but you can get it from the Quantide website.  In the meantime, there are two other books on learning R and developing in R.  These books are licensed Creative Commons, so they’re free to read and share.

Debugging Biml

Kevin Feasel

2016-10-19

Biml

Bill Fellows shows how to write out your intermediate Biml for debugging purposes:

Using tooling is always a trade-off between time/frustration and monetary cost. BIDS Helper/BimlExpress are free so you’re prioritizing cost over all others. And that’s ok, there’s no judgement here. I know what it’s like to be in places where you can’t buy the tools you really need. One of the hard parts about debugging the expanded Biml from BimlScript is you can’t see the intermediate or flat Biml. You’ve got your Metadata, Biml and BimlScript and a lot of imagination to think through how the code is being generated and where it might be going wrong. That’s tough. Even at this point where I’ve been working with it for four years, I can still spend hours trying to track down just where the heck things went wrong. SPOILER ALERT It’s the metadata, it’s always the metadata (except when it’s not). I end up with NULLs where I don’t expect it or some goofball put the wrong values in a field. But how can you get to a place where you can see the result? That’s what this post is about.

It’s a trivial bit of code but it’s important. You need to add a single Biml file to your project and whenever you want to see the expanded Biml, prior to it being translated into SSIS packages, right click on the file and you’ll get all that Biml dumped to a file. This recipe calls for N steps.

This is a good tip and has helped me a few times in the past.

Compression On Temporal Tables

Daniel Janik notes that system-generated temporal tables automatically use page-level compression:

At first I was a bit puzzled. I noticed that the system generated table was consistently smaller than my user created table. It was not only smaller it was twice as small!

I did some further testing on my Surface this weekend and here’s what I found:

— Side note:  I use Person.Address a lot in demos, so I decided to create a new table to test with in hopes of not breaking any other demos I do regularly.

I think this is a good decision for a default, but if you are unable to support page-level compression for some reason, there’s a workaround:  create your history table beforehand.

Who Is Active Update

Adam Machanic has an update to sp_whoisactive:

Four and a half years have flown by since I released sp_whoisactive version 11.11.

It’s been a pretty solid and stable release, but a few bug reports and requests have trickled in. I’ve been thinking about sp_whoisactive v.Next — a version that will take advantage of some newer SQL Server DMVs and maybe programmability features, but in the meantime I decided to clear out the backlog on the current version.

Given that I have three keyboard shortcuts dedicated to sp_whoisactive, you know I’m excited.  Adam also has a new domain for the product.

Categories

October 2016
MTWTFSS
« Sep Nov »
 12
3456789
10111213141516
17181920212223
24252627282930
31