Hadoop And Active Directory

RK Kuppala explains how to integrate a Hadoop cluster with Active Directory:

This post explains kerberizing an existing Hadoop cluster using Ambari. Kerberos helps with the Authentication part of enterprise security (while authorization, auditing and data protection being the remaining parts).

HDP uses Kerberos, which is an industry standard for authenticate users and resources and providing strong identity for users. Apache Ambari can kerberize an existing cluster by using an existing MIT key distribution center (KDC) or Microsoft’s Active Directory.

This was a lot easier than I expected.

Understanding Naive Bayes

Ahmet Taspinar explains the Naive Bayes classificiation algorithm and writes Python code to implement it:

Within Machine Learning many tasks are – or can be reformulated as – classification tasks.

In classification tasks we are trying to produce a model which can give the correlation between the input data $X$ and the class $C$ each input belongs to. This model is formed with the feature-values of the input-data. For example, the dataset contains datapoints belonging to the classes ApplesPears and Oranges and based on the features of the datapoints (weight, color, size etc) we are trying to predict the class.

Ahmet has his entire post saved as a Jupyter notebook.

Taxi Rides And Amazon Athena

Mark Litwintschik looks at using Amazon Athena to process the New York City taxi rides data set:

It’s important to note that Athena is not a general purpose database. Under the hood is Presto, a query execution engine that runs on top of the Hadoop stack. Athena’s purpose is to ask questions rather than insert records quickly or update random records with low latency.

That being said, Presto’s performance, given it can work on some of the world’s largest datasets, is impressive. Presto is used daily by analysts at Facebook on their multi-petabyte data warehouse so the fact that such a powerful tool is available via a simple web interface with no servers to manage is pretty amazing to say the least.

Athena is Amazon’s response to Azure Data Lake Analytics.  Check out Mark’s blog post for a good way of getting started with Athena.

Locking Azure Resources

Kevin Feasel



Arun Sirpal explains how to lock resources in Azure:

There are 2 types of lock resources in Azure.

  • Delete – Obviously you can’t delete but you can read / modify a resource, this applies to authorised users.
  • ReadOnly – Authorised users can read a resource but they cannot edit or delete it.

For this blog post I create a delete lock on one of my SQL Databases.

My overly simplistic advice:  lock any production resource which you wouldn’t want accidentally deleted.  It won’t prevent a malicious user from doing something catastrophic, but it can prevent the “Oops, I meant to click the thing above this” class of mistake.

R Links

Kevin Feasel


Power BI, R

Ginger Grant has some links on learning R in the context of Power BI:

Comprehensive Resource Archive Network [CRAN] is where one can download Open Source R, packages and contains lots of information about R.

Microsoft R Open which is a fully CRAN compatible version created using the Intel MKL for improved performance can be downloaded here.

One thing I would push a little bit on that list is R Tools for Visual Studio.  My default R IDE is still R Studio, but RTVS has made some nice improvements, and it’s worth checking out.

Understanding HTAP

James Serra explains what Hybrid Transactional and Analytical Processing means:

HTAP is used to describe the capability of a single database that can perform both online transaction processing (OLTP) and online analytical processing (OLAP) for the purpose of real-time operational intelligence processing.  The term was created by Gartner in 2014.

In the SQL Server world you can think of it as: In-memory analytics (columnstore) + in-memory OLTP = real-time operational analytics.  Microsoft supports this in SQL Server 2016 (see SQL Server 2016 real-time operational analytics).

I’m not completely sold on HTAP yet, particularly once you get to high-scale OLTP systems doing hundreds of thousands of transactions per second.  That said, there’s always more and more pressure to get data available for analytics faster and faster.

Backup Up Analysis Services

Jens Vestergaard shows how to take backups of Analysis Services cubes:

I have not met a setup where applying compression was not an option, yet. Obviously this has a penalty cost on CPU while executing the backup, and will affect the rest of the tasks running on the server (even if you have your data and backup dir on different drives). But in my experience, the impact is negligible.

This may not be the case with the encryption option, as this has a much larger foot print on the server. You should be using this with some caution in production. Test on smaller subsets of the data if in doubt.
Another thing to keep in mind, as always when dealing with encryption, do remember the password. There is no way of retrieving the data other than with the proper password.

My goal is to be able to rebuild any cube from the relational database, but even with that goal in mind, it is smart to have backups.

When Was That Index Modified?

Kendra Little looks at index creation and modification dates:

SQL Server doesn’t really track index create or modification date by default

I say “really”, because SQL Server’s default trace captures things like index create and alter commands. However, the default trace rolls over pretty quickly on most active servers, and it’s rare that you’re looking up the creation date for an index you created five minutes ago.

I think it’s fine that SQL Server doesn’t permanently store the creation date and modification date for most indexes, because not everyone wants this information — so why not make the default as lightweight as possible?

That said, Kendra has several methods for answering the question of when a particular index was created.


December 2016
« Nov Jan »