Press "Enter" to skip to content

Curated SQL Posts

Using IDENTITY In A SELECT Statement

Kenneth Fisher shares something he learned recently about the IDENTITY function:

Now, looking at it a bit more closely you’ll see that this is a function call, not just a property. Now, in my research for this post I did find where I’d mentioned this function briefly in my somewhat comprehensive identity post. Technically I didn’t mention so much as it was mentioned to me in the comments so I added it to the list. I guess I either didn’t look at it closely enough at the time or it’s just one of those cases where I forgot. Either way, it’s worth highlighting now.

Click through to learn more.

Comments closed

Comparing Data With CHECKSUM

David Fowler shows how to use CHECKSUM and CHECKSUM_AGG to compare data:

There are times when we need to compare two tables and figure out if the data matches. I often see a number of ways of doing this suggested, most are quite slow and inefficient. I’d quite like to share a quick and slightly dirty way of doing this using the CHECKSUM and CHECKSUM_AGG functions.

CHECKSUM()
Just a reminder that CHECKSUM() will generate a checksum for an entire row or selection of columns in the row.

CHECKSUM_AGG()
Will generate a checksum for a dataset.

David then has a couple of examples showing these in action.

Comments closed

When Using SSRS Makes Sense

Eugene Meidinger lays out the scenarios in which it makes sense to use SQL Server Reporting Services over Power BI, Excel, or other tooling:

SSRS makes it easy to control who has access to your reports and data. It is possible to specify permissions on the whole server, specific folders of reports or on a single report. Permissions inherit down, like a regular file system, unless you explicitly break inheritance to specify custom permissions.
In addition to permissions, you have a central server to house and control your reports. This is critical when you need an authoritative source of truth for your reporting. Users can trust that they are reading the latest version of any given report.
In addition to the administrative side of things, SSRS provides a powerful development environment with SSDT. SQL Server Data Tools (SSDT) is based on Visual Studio, a very popular Integrated Developer Environment or IDE. SSDT makes it incredibly easy to store your reports in source control since your reporting artefacts are just XML files. Source control makes it possible to collaborate on a team or rollback to earlier versions of a report. This is a capability that is not available with Excel or Power BI reports.  

Read the whole thing.

Comments closed

Keeping Polybase Tables In Sync With Biml

Ben Weissman combines two of my favorite things:

If you have started playing with polybase, you probably figured out by now, that – as awesome as it is – it’s still a bit of a pain to set it up and maintain external tables. There is a wizard in Azure Data Studio but it’s still under development, especially from a usability standpoint.
So what can be done about that? Well, we effectively looking for an easy way to read metadata from a relational database and automate T-SQL to mirror that metadata. HELLO?! Perfect usecase for Biml – which is NOT just for SSIS.
Let’s take a look at how that can be done…

If only Ben could have used F# instead of VB and VB with curly braces…

Comments closed

What’s New With Cloudera Enterprise 6.1.0

Krishna Maheshwari announces Cloudera Enterprise 6.1.0:

Platform Support & Security
Cloudera now supports deploying with OpenJDK 8 in addition to Oracle’s JDK. With this release, we also support AWS CloudHSM for HDFS encryption-at-rest.
As customers are increasingly implementing security, we are changing defaults to be secure in order to reduce setup complexity and configuration misses. As a part of this release, several defaults in Kafka, Impala, Sqoop, and Flume have been changed to be more secure and added BDR replication from insecure to secure (Kerberized) clusters to ease the transition to secure clusters.

There are several improvements and new features here worth checking out.

Comments closed

TPC-DS Testing With HDP 3.0

Nita Dembla and Gopal Vijayaraghavan compare HDP 3.0 versus HDP 2.6.5 when running the TPC-DS query set and note performance improvements in Hive LLAP:

Hortonworks announced the general availability of HDP 3.0 this year. You may read more about it here. Bundled with HDP 3.0, Apache Hive 3 with LLAP took a significant leap as a Enterprise Ready Real time Database Warehouse with transactional capabilities that continues to serve BI workloads with lower latencies. HDP 3.0 comes with exciting new capabilities – ACID support, materialized views, SQL constraints and Query result cache to name a few.  Additionally, we continued to build and improve on the performance enhancements introduced in earlier releases.
In this blog, we will provide an update on our performance benchmark blog, comparing performance of HDP 3.0 to HDP 2.6.5. The noteworthy difference in benchmark is that all tables are by default transactional and written in ACID format, which means there are additional metadata (ROW_ID) columns to uniquely identify each row and support transactional semantics. Another key database capability used and tested here is SQL constraints. The hive-testbench schema has been enhanced to declare Primary-Foreign key, not null and unique constraints.

Their headline is that Hive 3 is up to 2x faster than Hive 2, with huge gains in a few of the queries.

Comments closed

Variable Screening With vtreat

John Mount explains how you can use vtreat for determining variable importance:

Part of the vtreat philosophy is to assume after the vtreat variable processing the next step is a sophisticated supervised machine learningmethod. Under this assumption we assume the machine learning methodology (be it regression, tree methods, random forests, boosting, or neural nets) will handle issues of redundant variables, joint distributions of variables, overall regularization, and joint dimension reduction.
However, an important exception is: variable screening. In practice we have seen wide data-warehouses with hundreds of columns overwhelm and defeat state of the art machine learning algorithms due to over-fitting. We have some synthetic examples of this (here and here).
The upshot is: even in 2018 you can not treat every column you find in a data warehouse as a variable. You must at least perform some basic screening.

Read on to see a couple quick functions which help with this screening.

Comments closed

Using Sqoop’s Logic To Improve Spark JDBC Performance

Avi Yehuda analyzes how Sqoop works to make relational database access from Spark faster:

Sqoop performed so much better almost instantly, all you needed to do is to set the number of mappers according to the size of the data and it was working perfectly.
Since both Spark and Sqoop are based on the Hadoop map-reduce framework, it’s clear that Spark can work at least as good as Sqoop, I only needed to find out how to do it. I decided to look closer at what Sqoop does to see if I can imitate that with Spark.
By turning on the verbose flag of Sqoop, you can get a lot more details. What I found was that Sqoop is splitting the input to the different mappers which makes sense, this is map-reduce after all, Spark does the same thing. But before doing that, Sqoop does something smart that Spark doesn’t do.

Read on to see what in particular Sqoop does, and how you can use that in your Spark code.

Comments closed

Powershell Core Preview 3 Fails To Start

Max Trinidad walks us through an issue with Powershell Core Preview 3:

Just in case you haven’t try to install PowerShell Core Preview.3, in Windows, which became available on the evening of the 10th of December. If are doing a clean installation, meaning that it was previously manually uninstall, or that this is your first installation, then you are fine.
The issue with installing PowerShell Core Preview.3 is when you do an upgrade over a previous version: either Preview.1 or Preview.2.

That’s a common occupational hazard when working with preview tools.  But Max has your solution.

Comments closed

Deploying SQL Server 2019 Big Data Clusters With Kubernetes

Chris Adkin has the start of a new series:

Minikube is a good learning tool and Microsoft provides instructions for deploying a big data cluster to this ‘Platform’. However, its single node nature and the fact that application pods run on the master node means that this does not reflect a cluster that anyone would run in production. Kubernetes-as-a-service is probably by far the easiest option for spinning a cluster up, however it relies on an Aws, Azure or Google Cloud Platform account, hence there is a $ cost associated with this. This leaves a vanilla deployment of Kubernetes on premises. Based on the assumption that most people will have access to Windows server version 2008 or above, a relatively cheap and way of deploying a Kubernetes cluster is via Linux virtual machines running on Hyper-V. This blog post will provide step by step instructions for creating the virtual machines to act as the master and worker nodes in the cluster. 

This is going on my “try this out when I have time” list.  So expect a full report sometime in the year 2023.

Comments closed