HDP 3.0 Updates To Hive And Druid

Nishant Bangarwa has some updates to Apache Druid in HDP 3.0:

There are numerous improvements that went into HDP 3.0 and the performance improvements shown are an aggregate result of all of them. Here are some of the more noteworthy improvements related to Druid-Hive integration :

    1. Druid Expressions Support – HIVE-18893CALCITE-2170   added support for Druid expressions in Hive. In HDP 3.0, Hive can push the computation of SQL expressions as part of a Druid query and they can be evaluated by Druid.

    2. Use of Scan Query instead of Select Query – In HDP 3.0 we use Druid Scan query instead of Select Query. Scan Query is a streaming version of Select Query which returns the results in a compact streaming format. Scan query also does not need all the results to be retained in memory before they can be returned to Hive. This improves the memory usage of the historical nodes too.

    3. GroupBy Query Improvements – Many optimizations are done in order to address the performance of GroupBy queries on Druid side. Main ones are –

      1. #4660 Parallel sort for ConcurrentGrouper
      2. #4576 Array-based aggregation for GroupBy query
      3. #4668 Add IntGrouper to avoid unnecessary boxing/unboxing in array-based aggregation
      4. #4704 Parallel merging of intermediate groupby results on the broker nodes.
    4. Better column pruning – In some cases when hive cannot push any operator to druid, hive ended up pulling all the columns from druid. This led to lots of unnecessary data transfer between druid and hive. HIVE-15619 improved the column pruner logic to only fetch the columns from druid which are required to answer a query.

    5. Druid Version upgrade from 0.10.0 to 0.12.2 – HDP 3.0 comes with latest version of Druid i.e 0.12.2 which has many new features, performance enhancements and bug-fixes over the previous version.

Druid is still a specialty technology which doesn’t fit every use case, but if it does fit your use, you’ll get a lot of performance benefit out of it.

Analyzing Day-Over-Day Changes With Power Query

Dany Hoter shows how to use Power Query to join one row to the next in a data set (given similar criteria) and do day-by-day comparisons:

I want to analyze the daily prices of certain commodities and be able to show the patterns of daily changes side by side. I want to calculate predictably the differences between each row and the row before. Each row represents data for a day, so the difference between rows is the daily change or in some cases, several days change.

I downloaded from Quandl 50 years of daily prices of gold and silver, and my goal is to calculate the daily changes in terms of dollars and percentage from day to day. Not all days are represented, so in case of a gap I calculate the number of days in the gap, and I divide the growth % by the number of days. I already imported and appended the data for both metals into a single table in Excel and we’ll start the process from this table.

Read on for the solution.  I’d just as soon LAG() the data in SQL Server, but if that’s not an option, this certainly works.

Distance Between Strings: Levenshtein Distance

Nikhil Babar has an introduction to the Levenshtein distance algorithm:

The Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions, or substitutions) required to change one word into the other. It is named after Vladimir Levenshtein, who discovered this equation in 1965.

Levenshtein distance may also be referred to as edit distance, although it may also denote a larger family of distance metrics. It is closely related to pairwise string alignments.

Read on for an explanation and example.  Levenshtein is a great way of calculating string similarities, possibly helping you with tasks like data cleansing by finding typos or alternate spellings, or matching down parts of street addresses.

Unused Indexes Might Not Be

Tara Kizer has a warning for people eager to drop “unused” indexes:

About 10 years ago, I decided to drop an unused index on a table that had 2 billion rows. The database was around 7TB in size. We were having storage and performance issues. I thought I could help the system out if I were to drop the index.

4 days after I dropped the index, I got a call from our NOC that CPU utilization was pegged at 100% for an hour so they were reaching out to the on-call DBA to check it out. I logged in and saw a query using A LOT of CPU. Its WHERE clause matched the index I had dropped. The execution plan showed it was scanning the table.

It turned out that I only had 2 weeks of uptime, which didn’t include the 1st of the month. The query that was causing the high CPU utilization was a report that ran on the 1st of each month.

Tara has also provided us with a script to track these details over time, so check that out.

Azure SQL Managed Instance Inventory Analysis

Jovan Popovic shows us how to use the Azure CLI plus JSON support in SQL Server to manage a list of Azure SQL Managed Instances:

Sometime you would need to know how many Managed Instance you have created in Azure cloud. Although you can find all information about the Azure SQL Managed Instances in Azure portal or API (ARM, PowerShell, Azure CLI), sometime it is hard to list all instances and search them using some criteria. In this post you will see how easily you can load list of your Managed Instances and build inventory of your resources.


Imagine that you have a large number of Managed Instances and you need to know how many instances you have, in what regions, subnets, and virtual networks they are placed, how much compute and storage is allocated to each of them, etc. Analyzing inventory of Managed Instances might be hard if you just use PowerShell.

Click through for the solution.

Testing Azure SQL Database Failover

Arun Sirpal shows us how easy it is to fail over Azure SQL Database to another region:

So what happens now if I connect to the read/write endpoint? (I test this via SSMS)

The dreaded IP address / create a new firewall rule message. Why? Well this setup utilized a “server” level firewall rule and the server in the US did NOT have the IP address mapped in, you can see from the below screen shot that there are no firewall rules configured.

Fixing this is easy, you could just add the IP address on the secondary server as another server level rule but you should seriously consider using a database level firewall rule, the setup will get replicated to the secondary server making failover experience smoother.

Read on to see how to set this.

Checking AD Group Membership

Matthew McGiffen shows us a way to test if an account is a member of an Active Directory group:

As part of my job I manage a bunch of SQL instances for Development and Test.

Access is managed though Active Directory groups, so I rarely have to do anything regards managing permissions. Nonetheless I often get requests from people to give them access. This is usually for a new starter or someone who has moved from one team to another.

Of course, the answer is usually that they just need adding to the right AD group. Rather than assume though, I always get them to check before I pass the request on to the AD team. You never know, there could be something else wrong.

You can also use xp_logininfo if you want to go the other way and get all of the members of a particular group.


October 2018
« Sep