Choosing The Right HDInsight Cluster

Josh Fennessy discusses things to consider before you create an HDInsight cluster:

Now for the big question, Windows or Linux?


That’s absolutely correct.

Help With Extended Events

Jason Brimhall has two recent blog posts on figuring out Extended Events information.  First is a republication of an older article:

First let’s tackle the problem of discovery.  When we want to use extended events to try and troubleshoot a problem or to capture more information, it is really good to know if such an event exists.  There are many events that capture data for various different things within SQL Server.  More and more events are being added with each release.  More and more data is being made available to the DBA to help perform a better job and to help the DBA better understand what is really happening within the database environment.

In order to determine if there might be an event, that can provide the data for that one “thing” that may be happening within your environment, we could start by querying the SQL Server Internals.  This next query will do just that for us.

After you read that article and check out the queries there, Jason has another post on finding the right event:

In my previous article I demonstrated how to find an event based solely on the name or description of the event. This is fantastic if the event name (or description) contains one of the magical words you have used. What if the event name or description has nothing to do with the terms you selected? Or, what if the data you seek may be attached to the event but wouldn’t necessarily stand out as a description for that event (by name or description details for that event)?

Now comes the more difficult task right? If the name or description of the event doesn’t relate to the search terms then you just might overlook a few events and be stuck trying to troubleshoot a problem. An equally big problem this could cause is yet another invisible barrier to using Extended Events. It would be easy to slide down the slippery slope and not transition to Extended Events just because an event, applicable to the problem at hand, could not be found.

Check out both of these posts.

Clustering With Spark

Konur Unyelioglu shows how to implement k-means and Guassian clustering techniques in Apache Spark using MLlib:

Clustering is the task of assigning entities into groups based on similarities among those entities. The goal is to construct clusters in such a way that entities in one cluster are more closely related, i.e. similar to each other than entities in other clusters. As opposed to classification problems where the goal is to learn based on examples, clustering involves learning based on observation. For this reason, it is a form of unsupervised learning task.

There are many different clustering algorithms and a central notion in all of those is the definition of ’similarity’ between the entities that are being grouped. Different clustering algorithms may have different ways of measuring the similarity. In many clustering algorithms, another common notion is the so-called cluster center, which is a basis to represent the cluster. For example, in K-means clustering algorithm, the cluster center is the arithmetic mean position of all the points in that cluster.

This is a fairly lengthy article but if you want to get into machine learning with Spark, it’s a good one.

Rename A Primary Key Constraint

Steve Jones shows how to rename a primary key constraint:

When you compare that with the same table in another database, what’s the likelihood that you’ll have the PK named PK__OrderDet__D3B9D30C7D677BB4? Probably pretty low.

This means that if you are looking to deploy changes, and perhaps compare the deployment from one database to the next, you’ll think you have different indexes. Most comparison tools will then want to change the index on your target server, which might be using this technique. Or the choice might be something that performs much worse.

What we want to do is get this named the same on all databases. In this case, the easiest thing to do with rename the constraint on all systems. This is easy to do with sp_rename, which is better than dropping and rebuilding the index.

Do read this and avoid renaming a constraint the bad way.

Statistic Column Sort Order

Shaun J. Stuart points out an inconsistency in display order for columns on a statistic:

What’s going on? Why are the columns in the statistic not in the same order as the columns in the index? Well, it turns out, they are. If we look on the Details page, we see the density vector is, in fact, created as Col2, Col1, Col3, which is the order of the columns in the index:

Read the whole thing to avoid confusion next time you look at the statistics GUI.

Cluster Rebalancing

Kevin Feasel



Peter Coates discusses cluster rebalancing in Hadoop:

After adding new racks to our 70 node cluster, we noticed that it was taking several hours per terabyte to rebalance the nodes. You can copy a terabyte of data across a 10GbE network in under half an hour with SCP, so why should HDFS take several hours?

It didn’t take long to discover the cause—the configuration parameterdfs.datanode.balance.bandwidthPerSecond controls how much bandwidth each node is allowed to use for rebalancing, and it defaults to a conservative value of 10Mb/sec/node, which is 1.25MB/sec. If you have 70 nodes (the number we started with before adding new ones), that’s 87.5MB/second. One terabyte, i.e., a million MB, divided 87.5MB/sec, equals 11,428 sec, or 3.17 hours per TB. The more nodes in the original cluster, the faster it will write.

On the development side, “it’ll automatically rebalance without us having to worry” is a great thing.  On the administrative side, we’re paid to worry about these things…

Data Cleansing Outside Of Excel

Lee Baker shows some free alternatives to Excel for data cleansing:

Another issue is that some Excel functions operate on selected data, whereas others act on the whole worksheet. If you select a column of data and use Find to identify certain characters, it will identify only those characters in your chosen column. If you now use Replace it will change all such characters in the entire worksheet – which is probably not what you wanted to do, and you may have unwittingly introduced new errors into your data without being aware of it.

The safest way to clean your data in Excel is to copy an individual column to a separate worksheet, perform all your cleaning operations in isolation until you’re happy with the result, then copy your cleaned data to your original sheet (or better still, to a new sheet that stores only clean data). The repeated use of Copy, Paste and using multiple worksheets to clean your data can become extremely messy.

Lee recommends three free tools, and they look like they’re worth trying out.

StackLite Dataset

David Robinson reports on a new Stack Exchange data set available to the public:

For each Stack Overflow question asked since the beginning of the site, the dataset includes:

  • Question ID
  • Creation date
  • Closed date, if applicable
  • Deletion date, if applicable
  • Score
  • Owner user ID (except for deleted questions)
  • Number of answers
  • Tags

This is ideal for performing analyses such as:

  • The increase or decrease in questions in each tag over time

  • Correlations among tags on questions

  • Which tags tend to get higher or lower scores

  • Which tags tend to be asked on weekends vs weekdays

  • Rates of question closure or deletion over time

  • The speed at which questions are closed or deleted

This is pretty exciting.  Getting good, high-quality data sets for demonstration and pedagogical purposes is time-consuming, so the fact that the Stack Exchange people are tossing one out our way could be a major time-saver.

Simplified Order Of Operations

Kevin Feasel



Michael J. Swart looks at how SQL Server implements order of operations:

I have a book on my shelf called Practical C Programming published by O’Reilly (the cow book) by Steve Oualline. I still love it today because although I don’t code in C any longer, the book remains a great example of good technical writing.

That book has some relevance to SQL today. Instead of memorizing the full list of operators and their precedence, Steve gives a practical subset:

    1. * (Multiply), / (Division)
    2. + (Add), – (Subtract)

Put parentheses around everything else.

Parentheses, even when unnecessary, are usually a good idea.  They help the reader understand what was going through your mind at the time.

New DBATools Cmdlet

Rob Sewell describes his experience creating a new cmdlet:

The journey to Remove-SQLDatabaseSafely started with William Durkin b | t who presented to the SQL South West User Group  (You can get his slides here)

Following that session  I wrote a Powershell Script to gather information about the last used date for databases which I blogged about here and then a T-SQL script to take a final backup and create a SQL Agent Job to restore from that back up which I blogged about here The team have used this solution (updated to load the DBA Database and a report instead of using Excel) ever since and it proved invaluable when a read-only database was dropped and could quickly and easily be restored with no fuss.

This is a combination of describing what the cmdlet does as well as the circumstances behind its creation.  It’s a good read.


July 2016
« Jun Aug »