HBase Updates in CDH 6.2

Krishna Maheshwari announces updates to the Cloudera Distribution of Hadoop:

Starting with CDH 6.2, Cloudera now includes the ability to use Intel’s newly released Optane Memory as an alternate destination for the 2nd tier of the bucket cache.  This deployment configuration enables you to have ~3x the size of the cache for constant cost (as compared to off-heap cache on DRAM). It does incur some additional latency compared to the traditional off-heap configuration, but our testing indicates that by allowing more (if not all) of the data’s working set to fit in the cache the set up results in a net performance improvement when the data is ultimately stored on HDFS (using HDDs).

When deploying to the cloud or using on-prem object storage, the performance improvement will be even better as object storage tends to be very expensive for random reads of small amounts of data.

There aren’t too many changes to HBase in the blog post, but the two mentioned are pretty good ones.

Connecting PolyBase to Spark

I have a blog post connecting PolyBase to a Spark cluster:

If you do define your Spark DataFrames well, you get a much happier result. Here’s me creating a better-looking DataFrame in Spark:

import org.apache.spark.sql.functions._
INT(SUMLEV) AS SummaryLevel,
POPTYPE AS PopulationType,
FROM NorthCarolinaPopulation

It’s not all perfect, though: I also cover driver problems that I ran into here with Spark and Hive.

Finding High-Variance Memory Grants

Erin Stellato shows how you can use Query Store to track the variance of memory grant requests:

One of the more perplexing problems to troubleshoot in SQL Server can be those related to memory grants. Some queries need more memory than others to execute, based on what operations need to be performed (e.g. sort, hash). SQL Server’s optimizer estimates how much memory is needed, and the query must obtain the memory grant in order to start executing. It holds that grant for the duration of query execution – which means if the optimizer overestimates memory you can run into concurrency issues. If it underestimates memory, then you can see spills in tempdb. Neither is ideal, and when you simply have too many queries asking for more memory than is available to grant, you’ll see RESOURCE_SEMAPHORE waits. There are multiple ways to attack this issue, and one of my new favorite methods is to use Query Store.

Click through for a demonstration.

CAST and CONVERT Make Expressions Nullable

Daniel Hutmacher points out a side effect of using CAST() and CONVERT():

Suppose we want to set up a view in the new solution that mirrors the names and definitions of the old table, so the legacy integration can use that view going forward:

CREATE OR ALTER VIEW new.the_table_like_before
SELECT CAST(id AS varchar(32)) AS id,
CAST([row] AS int) AS [row],
CAST(date_loaded AS datetime) AS dt
FROM new.the_table;

Now, if you check out the resulting datatypes of the view, you’ll notice that all the columns are marked nullable, even though they’re all based on non-nullable columns, so the values in the view could never be null.

Read on for a couple possible solutions.

Business Cases for SQL Server 2019

Kevin Chant gives you business cases for why you should upgrade to SQL Server 2019:

1. Polybase
Currently, you might import large amounts of data from another relational database system like Oracle or MongoDB into your SQL Server database. If so, you can use Polybase instead.

Polybase has certainly been improved in SQL Server 2019. For instance, more data sources are now available. Plus, you can even install Polybase on SQL Server 2019 on Linux as well.

Instead of importing the data from the source you can use Polybase to connect to it remotely. Which means that you can run your queries against the source directly instead.

Read on for more in this vein. Here it’s less about the technical capabilities and more about making life easier for other people in the business.

Wireshark and the SQL Server Browser Service

Joy George Kunjikkur takes a look at the SQL Server Browser service using Wireshark:

Normally on development environment the 1434 port would be open and SQL Server Browser running in SQL Server machine. If developers code based on that assumption, it may fail in higher environments. Most of the developers know what might have happened. But some may not and ends up in complete darkness. The symptom of this issue will be connection failure from application to SQL Server database.

This post is mainly to demonstrate how to check network failures on SQL Server Browser operations. The prerequisite is the availability of Wireshark in the machine where application runs and basic knowledge how to run the same. The above link has some screenshots which would help to run Wireshark for UDP connections.

This is usually the first service I disable post-installation.

Antivirus and SQL Server

Randolph West proffers advice should your IT team require installing antivirus software on a server with SQL Server running:

This is why it is documented that we should exclude SQL Server from any AV (anti-malware) detection products, so that it can get on with doing what it does best.

Yes, it’s formally documented. This is why we should read documentation when installing things. While it’s super-easy to click “Next,” “Next,” “Next,” that should not be the case with a complex product like SQL Server.

Read on for the list of exceptions you should add and processes to avoid scanning.


June 2019
« May