Breaking Changes Coming To dbatools

Chrissy LeMaire warns us about breaking changes coming to dbatools with release 1.0:

Sometime in the next month, I’ll also be updating Start-DbaMigration to more closely match the parameters of Export-DbaInstance. Parameters like NoDatabases and NoLogins will be replaced by -Exclude Databases, Logins.

So the functionality won’t necessarily change, but if you have scheduled tasks or scripts that perform migrations, you will need to update your parameters once you update dbatools once these changes are made.

Keep an eye out for all of these changes if you’re a regular dbatools user or have processes scripted.

Hadoop + SQL Server In 2019

Travis Wright shows off a big part of what the SQL Server team has been working on the last couple of years:

SQL Server 2019 big data clusters provide a complete AI platform. Data can be easily ingested via Spark Streaming or traditional SQL inserts and stored in HDFS, relational tables, graph, or JSON/XML. Data can be prepared by using either Spark jobs or Transact-SQL (T-SQL) queries and fed into machine learning model training routines in either Spark or the SQL Server master instance using a variety of programming languages, including Java, Python, R, and Scala. The resulting models can then be operationalized in batch scoring jobs in Spark, in T-SQL stored procedures for real-time scoring, or encapsulated in REST API containers hosted in the big data cluster.

SQL Server big data clusters provide all the tools and systems to ingest, store, and prepare data for analysis as well as to train the machine learning models, store the models, and operationalize them.
Data can be ingested using Spark Streaming, by inserting data directly to HDFS through the HDFS API, or by inserting data into SQL Server through standard T-SQL insert queries. The data can be stored in files in HDFS, or partitioned and stored in data pools, or stored in the SQL Server master instance in tables, graph, or JSON/XML. Either T-SQL or Spark can be used to prepare data by running batch jobs to transform the data, aggregate it, or perform other data wrangling tasks.

Data scientists can choose either to use SQL Server Machine Learning Services in the master instance to run R, Python, or Java model training scripts or to use Spark. In either case, the full library of open-source machine learning libraries, such as TensorFlow or Caffe, can be used to train models.

Lastly, once the models are trained, they can be operationalized in the SQL Server master instance using real-time, native scoring via the PREDICT function in a stored procedure in the SQL Server master instance; or you can use batch scoring over the data in HDFS with Spark. Alternatively, using tools provided with the big data cluster, data engineers can easily wrap the model in a REST API and provision the API + model as a container on the big data cluster as a scoring microservice for easy integration into any application.

I’ve wanted Spark integration ever since 2016 and we’re going to get it.

Improvements In Table Variable Performance In SQL Server 2019

Matthew McGiffen tries out SQL Server 2019 to test a scenario where table variables were giving poor estimates in prior versions:

One of the most popular posts on my blog last year was where I pretty much suggested that people not use table variables:

Think twice before using table variables

This wasn’t new information when I wrote it, but bad performance due to the use of table variables remained such a common anti-pattern that I thought it was worth stressing again.

So, when I saw the above 2019 feature I thought I’d better investigate and update what I’m telling people.

TL;DR It looks like table variables are no longer a problem.

Read the whole thing.  This has the potential of changing long-standing advice going back a decade regarding table variables.

What’s In SQL Server 2019 CTP 2.0?

Aaron Bertrand gives us the highlights:

  • Certificate Management in Config Manager View and validate all of your certificates from a single interface, and manage and deploy certificate changes across all of the replicas in an Availability Group or all of the nodes in a Failover Cluster Instance.

  • Built-in data classification A new ADD SENSITIVITY CLASSIFICATION statement helps you identify and automatically audit sensitive data, a huge step up from the previous SSMS wizard (which just used extended properties).

Aaron also digs into the engine a bit:


This new aggregate function is designed for data warehouse scenarios, and is an equivalent for COUNT(DISTINCT()). Instead of performing expensive distinct sort operations to determine actual counts, it relies instead on statistics to get something relatively accurate. You should find that the margin of error is within 2% of the precise count, 97% of the time, which is usually fine for high-level analytics, values that populate a dashboard, or quick estimates.

On my system I created a table with integer columns ranging from 100 to 1,000,000 unique values, and string columns ranging from 100 to 100,000 unique values. There were no indexes other than a clustered primary key on the leading integer column. Here are the results of COUNT(DISTINCT()) vs. APPROX_COUNT_DISTINCT() against those columns, so you can see where it is off by a bit (but always well within 2%):

By the way, APPROX_COUNT_DISTINCT() is a really good idea, and I’m glad it’s here.

SSMS 17.9 Released

Alan Yu announces a new version of SQL Server Management Studio:

SSMS 17.9 provides support for almost all feature areas on SQL Server 2008 through the latest SQL Server 2017, which is now generally available.

In addition to enhancements and bug fixes, SSMS 17.9 comes with several new features:

  • ShowPlan improvements
  • Azure SQL support for vCore SKUs
  • Bug Fixes

View the Release Notes for more information.

It looks like the big push for this release was bug fixes, and there are quite a few of them.

Databricks Runtime 4.3 Released

Todd Greenstein announces Databricks Runtime 4.3:

In addition to the performance improvements, we’ve also added new functionality to Databricks Delta:

  • Truncate Table: with Delta you can delete all rows in a table using truncate.  It’s important to note we do not support deleting specific partitions.  Refer to the documentation for more information: Truncate Table

  • Alter Table Replace columns: Replace columns in a Databricks Delta table, including changing the comment of a column, and we support reordering of multiple columns.   Refer to the documentation for more information: Alter Table

  • FSCK Repair Table: This command allows you to Remove the file entries from the transaction log of a Databricks Delta table that can no longer be found in the underlying file system. This can happen when these files have been manually deleted.  Refer to the documentation for more information: Repair Table

  • Scaling “Merge” Operations: This release comes with experimental support for larger source tables with “Merge” operations. Please contact support if you would like to try out this feature.

Looks like a nice set of reasons to upgrade.

HDF 3.2 Updates

Dinesh Chandrasekhar walks us through some of the updates to Hortonworks Data Flow version 3.2:

Kerberos keytab isolation
Kerberos keytabs can now be isolated at a per principal level. This allows for users in a multi-tenant environment to safely be able to reference specific keytabs and principals. This ensures that just because a user has access to a HDFS keytab they will not have access to all of the HDFS principals. This provides a more granular control so that users are limited to only the principals they require.

Kafka 1.1.1 Support
In HDF 3.2, Kafka has been upgraded from 1.0.0 to 1.1.1. Key features and improvements have been added with respect to security and governance. In addition to these bug fixes, an important new feature was added to capture producer and topic metrics at partition level without instrumenting or configuring interceptors on the clients. This provides a non-invasive approach to capture important metrics for producers without refactoring/modifying your existing Kafka clients

Hive 3 support
Apache NiFi now supports Hive 3 running on HDP 3.0. This support ensures better performance for Hive streaming to HDP, Hive streaming to S3, and the ability to write directly to ORC from NiFi without first converting your datasets to Avro. Writing directly to ORC for better Hive query performance is accomplished by using the NiFi PutORC processor. With HDF 3.2, a few other processors related to HBase and HDFS have also been updated and enhanced.

Looks like there are some good updates to this version.

Confluent Platform 5.0 Released

Raj Jain and Michael Noll walk through the latest version of Confluent Platform, Confluent’s Kafka solution:

With Confluent Platform 5.0, operators can secure infrastructure using the new, easy-to-use LDAP authorizer plugin and can deliver faster disaster recovery (DR) thanks to automatic offset translation in Confluent Replicator. In Confluent Control Center, operators can now view broker configurations and inspect consumer lag to ensure that they are getting the most out of Kafka and that applications are performing as expected.

We have also introduced advanced capabilities for developers. In Confluent Control Center, developers can now better understand the data in Kafka topics due to the new topic inspection feature and Confluent Schema Registry integration. Control Center presents a new graphical user interface (GUI) for writing KSQL, making stream processing more effortless and intuitive as well. The latest version of KSQL itself introduces exciting additions, such as support for nested data, user-defined functions (UDFs), new types of joins and an enhanced REST API. Furthermore, Confluent Platform 5.0 includes the new Confluent MQTT Proxy for easier Internet of Things (IoT) integration with Kafka. The latest release is built on Apache Kafka 2.0, which features several new functionalities and performance improvements.

Looks like there have been some nice incremental improvements here.

SQL Operations Studio July Edition

Alan Yu announces a new version of SQL Operations Studio:

Highlights for this release include the following.

  • SQL Server Agent preview extension Job configuration support
  • SQL Server Profiler preview extension Improvements
  • Combine Scripts Extension
  • Wizard and Dialog Extensibility
  • Social content
  • Fix GitHub Issues

For complete updates, refer to the Release Notes.

Alan also has demos for each of these.  I still wish that they wouldn’t call their Extended Events viewer “Profiler” because that makes it harder for us to explain the difference between “good Profiler” and “bad Profiler.”

New Features In Public Preview On Azure SQL Database

Microsoft has a round of announcements for public previews on Azure SQL Database.  First up is Kevin Farlee announcing approximate count distinct:

The new APPROX_COUNT_DISTINCT aggregate function returns the approximate number of unique non-null values in a group.

This function is designed for use in big data scenarios and is optimized for the following conditions:

  • Access of data sets that are millions of rows or higher AND
  • Aggregation of a column or columns that have a large number of distinct values

Assuming these conditions, the accuracy will be within 2% of the precise result for a majority of workloads.

I’m liking this change.  Sometimes I simply need an approximate number  but I want it fast.

Shreya Verma announces MATCH support in the MERGE operator:

We will be further expanding the graph database capabilities with several new features. In this blog we will discuss one of these features that is now available for public preview in Azure SQL Database, MATCH support in MERGE DML for graph tables.

The MERGE statement performs insert, update, or delete operations on a target table based on the results of a join with a source table. For example, you can synchronize two tables by inserting, updating, or deleting rows in a target table based on differences between the target table and the source table. Using MATCH predicates in a MERGE statement is now supported on Azure SQL Database. That is, it is now possible to merge your current graph data (node or edge tables) with new data using the MATCH predicates to specify graph relationships in a single statement, instead of separate INSERT/UPDATE/DELETE statements.

I’ll use that approximately the day they fix all of the bugs with the MERGE operator.

Joe Sack announces row mode memory grant feedback:

In Azure SQL Database, we are further expanding query processing capabilities with several new features under the Intelligent Query Processing (QP) feature family.  In this blog post we’ll discuss one of these Intelligent QP features that is now available in public preview, row mode memory grant feedback.  Row mode memory grant feedback expands on the memory grant feedback feature by adjusting memory grant sizes for both batch and row mode operators.

Key feature benefits:

  • Reduce wasted memory. For an excessive memory grant condition, if the granted memory is more than two times the size of the actual used memory, memory grant feedback will recalculate the memory grant. Consecutive executions will then request less memory.

  • Decrease spills to disk. For an insufficiently sized memory grant that results in a spill to disk, memory grant feedback will trigger a recalculation of the memory grant. Consecutive executions will then request more memory.

This was big for batch mode operators, and I’m happy to see it move to row mode operators as well.

Finally, Joe also announces table variable deferred compilation:

In Azure SQL Database, we will be further expanding query processing capabilities with several new features under the Intelligent Query Processing (QP) feature family.  In this blog post we’ll discuss one of these Intelligent QP features that is now available in public preview in Azure SQL Database, table variable deferred compilation.

Table variable deferred compilation improves plan quality and overall performance for queries referencing table variables. During optimization and initial compilation, this feature will propagate cardinality estimates that are based on actual table variable row counts.  This accurate row count information will be used for optimizing downstream plan operations.

This one has the potential to be a pretty big performance improvement as well.


November 2018
« Oct