Press "Enter" to skip to content

Author: Kevin Feasel

Cluster Rebalancing

Peter Coates discusses cluster rebalancing in Hadoop:

After adding new racks to our 70 node cluster, we noticed that it was taking several hours per terabyte to rebalance the nodes. You can copy a terabyte of data across a 10GbE network in under half an hour with SCP, so why should HDFS take several hours?

It didn’t take long to discover the cause—the configuration parameterdfs.datanode.balance.bandwidthPerSecond controls how much bandwidth each node is allowed to use for rebalancing, and it defaults to a conservative value of 10Mb/sec/node, which is 1.25MB/sec. If you have 70 nodes (the number we started with before adding new ones), that’s 87.5MB/second. One terabyte, i.e., a million MB, divided 87.5MB/sec, equals 11,428 sec, or 3.17 hours per TB. The more nodes in the original cluster, the faster it will write.

On the development side, “it’ll automatically rebalance without us having to worry” is a great thing.  On the administrative side, we’re paid to worry about these things…

Comments closed

Data Cleansing Outside Of Excel

Lee Baker shows some free alternatives to Excel for data cleansing:

Another issue is that some Excel functions operate on selected data, whereas others act on the whole worksheet. If you select a column of data and use Find to identify certain characters, it will identify only those characters in your chosen column. If you now use Replace it will change all such characters in the entire worksheet – which is probably not what you wanted to do, and you may have unwittingly introduced new errors into your data without being aware of it.

The safest way to clean your data in Excel is to copy an individual column to a separate worksheet, perform all your cleaning operations in isolation until you’re happy with the result, then copy your cleaned data to your original sheet (or better still, to a new sheet that stores only clean data). The repeated use of Copy, Paste and using multiple worksheets to clean your data can become extremely messy.

Lee recommends three free tools, and they look like they’re worth trying out.

Comments closed

StackLite Dataset

David Robinson reports on a new Stack Exchange data set available to the public:

For each Stack Overflow question asked since the beginning of the site, the dataset includes:

  • Question ID
  • Creation date
  • Closed date, if applicable
  • Deletion date, if applicable
  • Score
  • Owner user ID (except for deleted questions)
  • Number of answers
  • Tags

This is ideal for performing analyses such as:

  • The increase or decrease in questions in each tag over time

  • Correlations among tags on questions

  • Which tags tend to get higher or lower scores

  • Which tags tend to be asked on weekends vs weekdays

  • Rates of question closure or deletion over time

  • The speed at which questions are closed or deleted

This is pretty exciting.  Getting good, high-quality data sets for demonstration and pedagogical purposes is time-consuming, so the fact that the Stack Exchange people are tossing one out our way could be a major time-saver.

Comments closed

Simplified Order Of Operations

Michael J. Swart looks at how SQL Server implements order of operations:

I have a book on my shelf called Practical C Programming published by O’Reilly (the cow book) by Steve Oualline. I still love it today because although I don’t code in C any longer, the book remains a great example of good technical writing.

That book has some relevance to SQL today. Instead of memorizing the full list of operators and their precedence, Steve gives a practical subset:

    1. * (Multiply), / (Division)
    2. + (Add), – (Subtract)

Put parentheses around everything else.

Parentheses, even when unnecessary, are usually a good idea.  They help the reader understand what was going through your mind at the time.

Comments closed

New DBATools Cmdlet

Rob Sewell describes his experience creating a new cmdlet:

The journey to Remove-SQLDatabaseSafely started with William Durkin b | t who presented to the SQL South West User Group  (You can get his slides here)

Following that session  I wrote a Powershell Script to gather information about the last used date for databases which I blogged about here and then a T-SQL script to take a final backup and create a SQL Agent Job to restore from that back up which I blogged about here The team have used this solution (updated to load the DBA Database and a report instead of using Excel) ever since and it proved invaluable when a read-only database was dropped and could quickly and easily be restored with no fuss.

This is a combination of describing what the cmdlet does as well as the circumstances behind its creation.  It’s a good read.

Comments closed

HBase Incremental Backup And Restore

Carter Shanklin and Vladimir Rodionov discuss incremental backup and restore coming to HBase & Phoenix:

If your tables are large it may not be possible to restore them under a different name due to space constraints. The really powerful thing about HBase backups is they are stored in WAL files that can be parsed using a simple interface that can be consumed either in Java or using the “hbase wal” utility.

Consider this scenario: A customer rep deleted some data because he thought it was unimportant. A week later the customer is upset because the data was important and you need to restore these few pieces of information. With HBase backups all you need to do is parse through the backups with a WAL reader and extract the historical values, which you can then add back in. With other databases you would have to bring another database instance online and load the backups into it. Having backups in open, well-understood formats unlocks many powerful opportunities and can bring recovery times down from days to minutes.

Read on if you manage a Hadoop cluster with HBase (or you’re likely to administer one soon).

Comments closed

Satellite Image Combination

David Yanofsky discusses how he pieced together satellite images for a report on mainland Chinese trash washing up on Hong Kong shores:

The USGS has a website called EarthExplorer that lets you search through decades of satellite data. I limited my search to Landsat 8 data with less than 10% cloud cover.

(you can do this search on the command line with landsat-util too but i prefer the web interface. In the future I will probably use this online toolfrom Development Seed)

Hat tip to Nathan Yau.

Comments closed

SparkR + Zeppelin

I take a look at using SparkR and Zeppelin:

My goal is to do some of the things that I did in my Touching on Advanced Topics post.  Originally, I wanted to replicate that analysis in its entirety using Zeppelin, but this proved to be pretty difficult, for reasons that I mention below.  As a result, I was only able to do some—but not all—of the anticipated work.  I think a more seasoned R / SparkR practitioner could do what I wanted, but that’s not me, at least not today.

With that in mind, let’s start messing around.

SparkR is a bit of a mindset change from traditional R.

Comments closed

Biml Transformers

Bill Fellows uses a Biml transformer to change a variable’s value:

Our transformer is simple. We’ve specified “LocalMerge” as we only want to fix the one thing. That one thing, is an SSIS Variable named “ServerName”.

What is going to happen is that we will redeclare the variable, this time we will specify a value of “SQLQA” for our Value. Additionally, I’m going to add a Description to my Variable and preserve the original value. The TargetNode object has a lot of power to it as we’ll see over this series of posts on Biml Transformers.

I’ve not used transformers before; this was interesting.

Comments closed

Listing Enterprise-Only Features

Patrick LeBlanc has an embedded report which shows Enterprise Edition-only features:

Someone recently asked me if there was a list of all the SQL Server “Enterprise Only” features available on the web.  I pointed them to the Features Supported by the Editions of SQL Server web page and thought I was done.  He stated that this site was good, but did not provide a simple list of enterprise only features.  I thought for a second, and my thoughts went straight to Power BI.  Why?  Simple, there are tables on the web page, and Power BI can easily extract that data into a data model.  I am not going to go into all those details in this blog post.  Maybe one day, but for now take a look at this interactive Power BI report and let me know what you think.

I think this layout is a bit easier to read and follow than the features website, although I’d love to be able to click on an item and get more information on the feature.

Comments closed