Hortonworks Released HDP 2.6.5

Mitra Mohsenian and Roni Fontaine announce Hortonworks Data Platform 2.6.5:

We are excited to make several product announcements including the general availability of :

  • HDP 2.6.5
    • Apache Kafka 1.0
    • Apache Spark 2.3
  • Apache Ambari 2.6.2
  • SmartSense 1.4.5

HDP 2.6.5 is an important release for Hortonworks given it is the first release that enables Apache Kafka 1.0 and Apache Spark 2.3

It looks like Ubuntu 18.04 isn’t supported just yet, but I imagine that’s coming.

Using Burrow To Monitor Kafka

Gaurav Garg shows us how to install and configure Burrow, a tool for monitoring Apache Kafka clusters:

According to Burrow’s GitHub page: Burrow is a Kafka monitoring tool that keeps track of consumer lag. It does not provide any user interface to monitor. It provides several HTTP request endpoints to get information about Kafka clusters and consumer groups. Burrow also has a notifier system that can notify you (via email or at an HTTP endpoint) if a consumer group has met certain criteria.

Burrow is designed in a modular way that separates the work done into multiple subsystems. Below are the subsystems in Burrow.

  • Clusters: This component periodically updates the topic list and the last committed offset for each partition.
  • Consumers: This component fetches the information about consumer groups like consumer lag, etc.
  • Storage: This component stores all the information in a system.
  • Evaluator: Gets information from storage and checks the status of consumer groups, like if it’s consuming messages at a slow rate using consumer lag evaluation rules.
  • Notifier: Requests the status of a consumer group and sends a notification if certain criteria are met via email, etc.
  • HTTP server: Provides HTTP endpoints to fetch information about a cluster and consumers.

This looks like a good tool to hook into an existing monitoring solution.

The Value Of Statistics In SQL Server

Monica Rathbun walks us through the benefits of having statistics on tables in SQL Server:

Statistics are made up of three parts. Each part tells the optimizer important information regarding the make up the table’s data distribution.

Header – Last Time Stats were updated and number of sample rows

Density Vector – Uniqueness of the columns or set of columns

Histogram– Data’s distribution and frequency of distinct values

Let’s look at a Header, Density and Histogram example.

You can read what the statistic are broken down into using DBCC SHOW_STATISTICS. All field definitions are taken from MSDN.

This is from AdventureWorks2016CTP3 sample database, if you want to follow along. Using the Sales. SalesOrderDetail table let’s look the stats and see what we can find out what it shows us.

Read the whole thing.

Positive And Negative Value Validation In Powershell 6

Thomas Rayner points out a cool addition to parameter validation as of Powershell 6:

If you’ve written at least a couple of advanced PowerShell functions, you’re probably no stranger to parameter validation. These are the attributes you attach to parameters to make sure that they match a certain regular expression using [ValidatePattern()], or that when they are plugged into a certain script, that it evaluates to true using [ValidateScript({})]. You’ve probably also used [ValidateRange()] to make sure a number falls between a min and a max value that you specified.

In PowerShell 6, though, there’s something new and cool you can do with ValidateRange. You can specify in a convenient new syntax that the value must be positive or negative.

Read on to see a few examples.

Continuous Integration And Building SSIS Projects

Koos van Strien gives us three methods for building SSIS projects:

First things first

First, set your expectations: you won’t create a one-size-fits-all build task that will build all your project types. Instead, you will split up your builds by project type – essentially just as described in Continuous Integration for BI in VSTS: Splitting Build Steps by Project Type.

Building SSIS projects

With folder and solution structure in place, we’ll explore three ways to build SSIS projects:

  • SSISBuild / SSISDeploy

  • Just-for-build SSIS projects

  • “Build” inside PowerShell

It’s a good post, so check it out if you’re looking at automating SSIS project deployments.

Spreading Out Multi-Server Agent Runs

Tracy Boggiano shows how to distribute SQL Agent job runtimes for multi-server jobs using MSX/TSX:

First, you need to decide how many time blocks or hours you want the jobs to run in.  So let’s start with scenario one where you pick for example four time blocks.  First, you declare a variable with the time block in it and we will feed in the @@SERVERNAME to let determine a value for the time block that server will run.  Then we wrap our code around our time block, our example we will run Index Maintenance for a 12 period spread out for three hours.  Mind you for my index process which I probably should blog about as well I am processing one index at a time have something that BREAKs out of the procedure when it exceeds the time block it is.  So below we run Index Maintenace between start the index maintenance job on a server between the hours 6  PM and 5 AM based on the time block value we got back.

Click through for a sample.

There’s Only One Way To Order

Matthew McGiffen notes that there is only one way to order, and that is to use the ORDER BY clause:

Everyone, at the beginning of their SQL career, get’s told that it is important to include an ORDER BY if they want the results ordered. Otherwise the order in which they are returned is not guaranteed.

But then you run queries a lot of times that don’t need a specific order – and you see that they (at least seem to) come out in the same order every time. You could (almost) be forgiven for thinking you can rely on that.

There was even a question on a Microsoft SQL certification exam a few years ago that asked what the default order was for records returned by a simple SELECT – the answer it was looking for was that it would be according to the order of the clustered index. So you can rely on that – right?

Wrong. The question was a bad question, and the answer was incorrect. Let’s look at this in action.

Order is never guaranteed to be stable unless you specify a unique ordering using ORDER BY.

Hidden Extended Events: The Debug Events

Jess Pomfret goes looking for Extended Events relating to the transaction log:

I was troubleshooting an issue last week which led to me firing up extended events to look at records being written to the transaction log, I typed into the search bar ‘Transaction’ hoping to find something that would do the trick and didn’t quite find what I was looking for.

After a few more failed attempts I headed to the internet and found a post by Paul Randal describing exactly what I needed for this situation, using the [sqlserver].[transaction_log] event. Hold on, that’s exactly what I searched for.  I ran the T-SQL within his blog post, the event was successfully created and gave me the information I was looking for.

But, as Jess points out, you can still get to it from the GUI.  Read on to learn how.

A SQL Client For Apache Flink

Alex Woodie points out that Apache Flink now has a SQL client built in:

Apache Flink has contained SQL functionality since Flink version 1.1, which introduced a SQL API based on Apache Calcite and a table API, too. While the combined SQL and Table API today provides valuable ways for developers to apply well-understood relational data and SQL constructs to the world of stream data processing, its usefulness is somewhat limited.

For starters, only Scala and Java experts can avail themselves of API, according to the description of the new SQL client, which is codenamed FLIP-24. What’s more, any table program that was written with the SQL and Table API had to be packaged with Apache Maven, a Java-based project management tool, and submitted to the Flink cluster before running.

With the launch of the SQL CLI Client in Flink version 1.5, the Flink community is taking its support for SQL in a new direction. According to the FLIP-24 project page, providing an interactive shell will not only make Flink accessible to non-programmers, including data scientists, but it will also eliminate the need for a full IDE to program Flink apps. With millions of SQL-loving data analysts out there, the benefits could certainly be vast.

Good stuff.  Feasel’s Law in action.

Grouping By Nothing In SQL

Lukas Eder points out a subtlety of the GROUP BY clause:

SELECT count(*)
FROM film

This will yield:

count |
1000 |

What’s the point, you’re asking? Can’t we just omit the GROUP BY clause? Of course, this will yield the same result:

SELECT count(*)
FROM film

Yet, the two versions of the query are subtly different.

Great post and also shows a case when GROUP BY () isn’t supported.


May 2018
« Apr Jun »