Press "Enter" to skip to content

Month: December 2018

What’s New With Kafka 2.1

Stephane Maarek updates us on the goings-on with Apache Kafka:

Kafka 2.1 is quite a special upgrade because you cannot downgrade due to a schema change in the consumer offsets topics. Otherwise the procedure to upgrade Kafka is still the same as before, see: https://kafka.apache.org/documentation/#upgrade

One of the big changes is support for Java 11. It’s a shame that Spark currently doesn’t support versions past 8.

Comments closed

Static Data Masking In SSMS 18.0

Monica Rathbun introduces a new feature in SQL Server Management Studio:

Ever need to have a test database on hand that you can allow others to query “real like” data without actually giving them actual production data values? In SQL Server Management Studio (SSMS) 18.0 preview Microsoft introduces us to Static Data Masking
. Static Data Masking is a new feature that allows you to create a cloned copy of your database and replace sensitive data with new data (fake data, referred to as masked). You can use this for things like development of business reports and analytics, trouble shooting, database development and even sharing data with outside teams or third parties. Unlike Dynamic Data Masking
 added in SQL Server 2016, this feature does not hide the data with characters, rather it replaces the entire value.  For example with dynamic data masking the name Peter = Pxxxx, whereas Static Data Masking changes Peter to Paul.  This makes it very easy to use in place of production. Let’s see it in action. If you are not on a newer version on SSMS, don’t worry, you can download it

It looks like there are a few limitations to keep in mind, so click through to read about those.

Comments closed

SSMS Keyboard Shortcuts In Azure Data Studio

Bob Pusateri reduces a bit of the mental burden of shifting to Azure Data Studio:

One thing about Azure Data Studio I’m not too keen about, though, is that many of the keyboard shortcuts are different. One keyboard shortcut that’s particularly helpful to me is using Ctrl + E to execute queries. I realize that F5 is the most common key to execute a query, however on most laptop keyboards you now need to hold an additional key to make the function keys behave like function keys. For this reason, Ctrl+ E is a wonderful and quick alternative, but it doesn’t work in Azure Data Studio. Or didn’t, until now.
Fortunately, Azure Data Studio is designed to be expanded upon with extensions from both Microsoft and the community. In the case of keyboard shortcuts, a particularly helpful one is called SSMS Keymap, which ports many popular SSMS keyboard shortcuts into Azure Data Studio. With this extension,  Ctrl + E is once again an option, and I no longer have to click “Execute” with a mouse, or fumble to find my laptop’s F5 equivalent.

Click through for the demo and grab that extension.

Comments closed

Using IDENTITY In A SELECT Statement

Kenneth Fisher shares something he learned recently about the IDENTITY function:

Now, looking at it a bit more closely you’ll see that this is a function call, not just a property. Now, in my research for this post I did find where I’d mentioned this function briefly in my somewhat comprehensive identity post. Technically I didn’t mention so much as it was mentioned to me in the comments so I added it to the list. I guess I either didn’t look at it closely enough at the time or it’s just one of those cases where I forgot. Either way, it’s worth highlighting now.

Click through to learn more.

Comments closed

Comparing Data With CHECKSUM

David Fowler shows how to use CHECKSUM and CHECKSUM_AGG to compare data:

There are times when we need to compare two tables and figure out if the data matches. I often see a number of ways of doing this suggested, most are quite slow and inefficient. I’d quite like to share a quick and slightly dirty way of doing this using the CHECKSUM and CHECKSUM_AGG functions.

CHECKSUM()
Just a reminder that CHECKSUM() will generate a checksum for an entire row or selection of columns in the row.

CHECKSUM_AGG()
Will generate a checksum for a dataset.

David then has a couple of examples showing these in action.

Comments closed

When Using SSRS Makes Sense

Eugene Meidinger lays out the scenarios in which it makes sense to use SQL Server Reporting Services over Power BI, Excel, or other tooling:

SSRS makes it easy to control who has access to your reports and data. It is possible to specify permissions on the whole server, specific folders of reports or on a single report. Permissions inherit down, like a regular file system, unless you explicitly break inheritance to specify custom permissions.
In addition to permissions, you have a central server to house and control your reports. This is critical when you need an authoritative source of truth for your reporting. Users can trust that they are reading the latest version of any given report.
In addition to the administrative side of things, SSRS provides a powerful development environment with SSDT. SQL Server Data Tools (SSDT) is based on Visual Studio, a very popular Integrated Developer Environment or IDE. SSDT makes it incredibly easy to store your reports in source control since your reporting artefacts are just XML files. Source control makes it possible to collaborate on a team or rollback to earlier versions of a report. This is a capability that is not available with Excel or Power BI reports.  

Read the whole thing.

Comments closed

Keeping Polybase Tables In Sync With Biml

Ben Weissman combines two of my favorite things:

If you have started playing with polybase, you probably figured out by now, that – as awesome as it is – it’s still a bit of a pain to set it up and maintain external tables. There is a wizard in Azure Data Studio but it’s still under development, especially from a usability standpoint.
So what can be done about that? Well, we effectively looking for an easy way to read metadata from a relational database and automate T-SQL to mirror that metadata. HELLO?! Perfect usecase for Biml – which is NOT just for SSIS.
Let’s take a look at how that can be done…

If only Ben could have used F# instead of VB and VB with curly braces…

Comments closed

What’s New With Cloudera Enterprise 6.1.0

Krishna Maheshwari announces Cloudera Enterprise 6.1.0:

Platform Support & Security
Cloudera now supports deploying with OpenJDK 8 in addition to Oracle’s JDK. With this release, we also support AWS CloudHSM for HDFS encryption-at-rest.
As customers are increasingly implementing security, we are changing defaults to be secure in order to reduce setup complexity and configuration misses. As a part of this release, several defaults in Kafka, Impala, Sqoop, and Flume have been changed to be more secure and added BDR replication from insecure to secure (Kerberized) clusters to ease the transition to secure clusters.

There are several improvements and new features here worth checking out.

Comments closed

TPC-DS Testing With HDP 3.0

Nita Dembla and Gopal Vijayaraghavan compare HDP 3.0 versus HDP 2.6.5 when running the TPC-DS query set and note performance improvements in Hive LLAP:

Hortonworks announced the general availability of HDP 3.0 this year. You may read more about it here. Bundled with HDP 3.0, Apache Hive 3 with LLAP took a significant leap as a Enterprise Ready Real time Database Warehouse with transactional capabilities that continues to serve BI workloads with lower latencies. HDP 3.0 comes with exciting new capabilities – ACID support, materialized views, SQL constraints and Query result cache to name a few.  Additionally, we continued to build and improve on the performance enhancements introduced in earlier releases.
In this blog, we will provide an update on our performance benchmark blog, comparing performance of HDP 3.0 to HDP 2.6.5. The noteworthy difference in benchmark is that all tables are by default transactional and written in ACID format, which means there are additional metadata (ROW_ID) columns to uniquely identify each row and support transactional semantics. Another key database capability used and tested here is SQL constraints. The hive-testbench schema has been enhanced to declare Primary-Foreign key, not null and unique constraints.

Their headline is that Hive 3 is up to 2x faster than Hive 2, with huge gains in a few of the queries.

Comments closed

Variable Screening With vtreat

John Mount explains how you can use vtreat for determining variable importance:

Part of the vtreat philosophy is to assume after the vtreat variable processing the next step is a sophisticated supervised machine learningmethod. Under this assumption we assume the machine learning methodology (be it regression, tree methods, random forests, boosting, or neural nets) will handle issues of redundant variables, joint distributions of variables, overall regularization, and joint dimension reduction.
However, an important exception is: variable screening. In practice we have seen wide data-warehouses with hundreds of columns overwhelm and defeat state of the art machine learning algorithms due to over-fitting. We have some synthetic examples of this (here and here).
The upshot is: even in 2018 you can not treat every column you find in a data warehouse as a variable. You must at least perform some basic screening.

Read on to see a couple quick functions which help with this screening.

Comments closed