Hortonworks Data Platform 3.0 Released

Kevin Feasel



Saumitra Buragohain, et al, announce the newest version of the Hortonworks Data Platform:

Highlighted Apache Hive features include:

  • Workload management for LLAP:  You can assign resource pools within LLAP pool and allocate resources on a per user or per group basis. This enables support for large multi-tenant deployments.

  • ACID v2 and ACID on by default:  We are releasing ACID v2. With the performance improvements in both storage format and execution engine we are seeing equal or better performance when comparing to non-ACID tables. Thus we are turning ACID on by default and enable full support for data updates.

  • Hive Warehouse Connector for Spark:  Hive Warehouse Connector allows you to connect Spark application with Hive data warehouses. The connector automatically handles ACID tables. This enables data science workloads to work well with data in Hive.

  • Materialized view navigation:  Materialized view allows you to pre-aggregate and pre-compute tables used in queries. Typically works best on sub-queries or intermediate tables. The cost based optimizer will automatically plan a query if those intermediate results are available, drastically speed up your queries.

  • Information schema:  Hive now exposes the metadata of the database (tables, columns etc.) via Hive SQL interface directly.

  • JDBC storage connector:  You can now map any JDBC databases into Hive’s catalog. This means you can join data across Hive and other databases using Hive query engine

This looks pretty good.  So of course I learn about it two days after I rebuild my demo Hadoop cluster.

Executing ML Services Scripts From Jupyter Notebooks

Kevin Feasel



Kyle Weller has an inception moment with Python and SQL Server Machine Learning Services:

While this example is trivial with the Iris dataset, imagine the additional scale, performance, and security capabilities that you now unlocked. You can use any of the latest open source R/Python packages to build Deep Learning and AI applications on large amounts of data in SQL Server. We also offer leading edge, high-performance algorithms in Microsoft’s RevoScaleR and RevoScalePy APIs. Using these with the latest innovations in the open source world allows you to bring unparalleled selection, performance, and scale to your applications.

Normally I see examples come straight from SQL Server or maybe C#, but it’s a bit fun to see one originate in Python on order to execute Python in SQL Server.

Using ggpairs To Find Correlations Between Variables In R

Kevin Feasel



Akshay Mahale shows how to use the ggpairs function in R to see the correlation between different pairs of variables:

From the above matrix for iris we can deduce the following insights:

  • Correlation between Sepal.Length and Petal.Length is strong and dense.
  • Sepal.Length and Sepal.Width seems to show very little correlation as datapoints are spreaded through out the plot area.
  • Petal.Length and Petal.Width also shows strong correlation.

Note: The insights are made from the interpretation of scatterplots(with no absolute value of the coefficient of correlation calculated). Some more examination will be required to be done once significant variables are obtained for linear regression modeling. (with help of residual plots, the coefficient of determination i.e Multiplied R square we can reach closer to our results)

Click through to read the whole thing.

Changing SQL Agent Job Owners

Jason Brimhall shows us a quick method for changing SQL Agent job owners:

It is not unusual to find a server where some random user created a bunch of jobs to be run by SQL Agent. Sometimes, the user creating the job(s) sets themself as the owner of the job. There are certain cases where this behavior is hard to avoid like when creating a maintenance plan.

And of course, there are times when the user just doesn’t know any better. There is of course, the rare occasion when setting the job owner to be ones self makes the most sense -but that is few and far between in the grand scheme. Usually, you will want a non-expiring account such as a service account or a principal without “logon” permissions to be the owner.

The primary reason being simple – humans have an expiration date for every job they will ever have. When that expiration occurs, you may end up with any number of unwanted side effects. Unwanted side effects is exactly what we try to avoid in our jobs run via SQL Agent.

Click through for the script.  It’s also a good rule of thumb not to have your name attached to too many things; that way, your successor has no way of knowing you’re to blame for whatever awful hack is in place…

Updating Tables With Faked Data

Kevin Feasel



Phil Factor continues his data obfuscation series:

We are taking a slow-but-steady approach. We rewrite our code from the previous blog post that assembles the string; it now uses a view to get its random numbers, and we’ll speed it up slightly by putting a bit more intelligence into the markov table. We then put it in a slow User-defined Scalar function. We  want a scalar function that isn’t schema verified and is not considered to be deterministic. The reason for this is that it has to be executed every row despite having the same parameter.

There are many ways to store the information permanently in a Markov table but we’ll be using Table-valued parameters for our function. I’ll show how they are generated from the original information in AdventureWorks, but they could be so easily fetched from a table of markov entries with each markov set identified by a name. This could be delivered to you by the production DBA so that you wouldn’t need any access to the production server.

Next time, Phil promises to tackle dates.

Nested Loops, Hash, Or Merge: Which Is Best?

Grant Fritchey dodges the important questions:

First response, also a joke, was the question at the title of this post:

What is the preferred operator when joining tables: Hash Match, Nested Loops or Merge?

While my immediate response to this question is, yes. Meaning, they’re all preferred, situationally. I decided to expand on that a bit.

I completely agree with Grant:  there is no single best operator.  If there were, database companies wouldn’t have multiple options.  That said, in an ideal world, all joins would be merge joins; in our fallen world, nested loops and hash matches often prove superior second-best alternatives.


Kenneth Fisher digs into the COUNT() function and sees how it deals with NULL values:

Count the number of values
SELECT COUNT(FieldName) FROM TableName;
-- or

The ALL argument is the default and is unnecessary (I didn’t even know it existed until I started this post). Here you are counting the number of non NULL values in FieldName. So in a column with (1, NULL, 1, 2, 3, NULL, 1) you’ll get a count of 5. You do get a nice warning (depending on your ANSI_WARNINGS setting) if there was a NULL value though.

By the way, the ALL operator isn’t useful there, but can be useful along with its counterparts SOME and ANY.  I rarely keep them in my mind, so I instead tend to write EXISTS and NOT EXISTS statements which have operate on an equivalent function.

New Features In Public Preview On Azure SQL Database

Microsoft has a round of announcements for public previews on Azure SQL Database.  First up is Kevin Farlee announcing approximate count distinct:

The new APPROX_COUNT_DISTINCT aggregate function returns the approximate number of unique non-null values in a group.

This function is designed for use in big data scenarios and is optimized for the following conditions:

  • Access of data sets that are millions of rows or higher AND
  • Aggregation of a column or columns that have a large number of distinct values

Assuming these conditions, the accuracy will be within 2% of the precise result for a majority of workloads.

I’m liking this change.  Sometimes I simply need an approximate number  but I want it fast.

Shreya Verma announces MATCH support in the MERGE operator:

We will be further expanding the graph database capabilities with several new features. In this blog we will discuss one of these features that is now available for public preview in Azure SQL Database, MATCH support in MERGE DML for graph tables.

The MERGE statement performs insert, update, or delete operations on a target table based on the results of a join with a source table. For example, you can synchronize two tables by inserting, updating, or deleting rows in a target table based on differences between the target table and the source table. Using MATCH predicates in a MERGE statement is now supported on Azure SQL Database. That is, it is now possible to merge your current graph data (node or edge tables) with new data using the MATCH predicates to specify graph relationships in a single statement, instead of separate INSERT/UPDATE/DELETE statements.

I’ll use that approximately the day they fix all of the bugs with the MERGE operator.

Joe Sack announces row mode memory grant feedback:

In Azure SQL Database, we are further expanding query processing capabilities with several new features under the Intelligent Query Processing (QP) feature family.  In this blog post we’ll discuss one of these Intelligent QP features that is now available in public preview, row mode memory grant feedback.  Row mode memory grant feedback expands on the memory grant feedback feature by adjusting memory grant sizes for both batch and row mode operators.

Key feature benefits:

  • Reduce wasted memory. For an excessive memory grant condition, if the granted memory is more than two times the size of the actual used memory, memory grant feedback will recalculate the memory grant. Consecutive executions will then request less memory.

  • Decrease spills to disk. For an insufficiently sized memory grant that results in a spill to disk, memory grant feedback will trigger a recalculation of the memory grant. Consecutive executions will then request more memory.

This was big for batch mode operators, and I’m happy to see it move to row mode operators as well.

Finally, Joe also announces table variable deferred compilation:

In Azure SQL Database, we will be further expanding query processing capabilities with several new features under the Intelligent Query Processing (QP) feature family.  In this blog post we’ll discuss one of these Intelligent QP features that is now available in public preview in Azure SQL Database, table variable deferred compilation.

Table variable deferred compilation improves plan quality and overall performance for queries referencing table variables. During optimization and initial compilation, this feature will propagate cardinality estimates that are based on actual table variable row counts.  This accurate row count information will be used for optimizing downstream plan operations.

This one has the potential to be a pretty big performance improvement as well.

Creating The Ultimate Calendar Table

Daniel Hutmacher has started to create a comprehensive calendar table:

It’s a collection of inline table value functions that generate different types of calendars, with a number of properties that could be relevant for a calendar dimension. Each function has a unique date column, so you can join the functions you need together in a view or a procedure. The functions are:

  • Dates: a plain gregorian calendar.

  • Fiscal, annual: a gregorian, year-based calendar where you can define the start of a year, like a corporate fiscal calendar.

  • Fiscal, 4-4-5 or 52/53: a week-based calendar where years comprise four quarters of 4+4+5 weeks respectively.

  • Indian national calendar

  • Persian calendar

  • Thai calendar

  • Dates of Catholic and Orthodox easter

  • Lunar cycle

I was going to jokingly be shocked that this list didn’t include the Hebrew or Islamic calendars, but then Daniel had to ruin my fun by explaining why not.  Check it out and when you’re ready to give it a try, head over to his downloads page.


July 2018
« Jun Aug »