2018-10-26 – Curated SQL

To access data to display in our dashboard we will use some Spring Boot 2.06 Java 8 microservices to call Apache Hive 3.1.0 tables in HDP 3.0 on Hadoop 3.1.

We will have our website hosted and make REST Calls to Apache NiFi, our microservices, YARN, and other APIs.

As you can see we can easily incorporate data from HDP 3 — Apache Hive 3.1.0 in Spring Boot Java applications with not much trouble. You can see the Maven build script (all code is in GitHub).

Our motivation is to put all this data somewhere and show it on a dashboard that can use REST APIs for data access and updates. We may choose to use Apache NiFi for all REST APIs or we can do some in Apache NiFi. We are still exploring. We can also decide to change the backend to HBase 2.0, Phoenix, Druid or a combination of these. We will see.

Read on for a series of screenshots and config files showing you how to set this up.

Comments closed

What’s New In Cloudera Enterprise 6.0

Published 2018-10-26 by Kevin Feasel

The Cloudera Hive team looks at the introduction of Apache Hive 2.1 into Cloudera Enterprise 6:

We are also focusing on efficiency across our platform. While on-premises platform efficiency helps manage costs in the long run, the immediate benefits of in-cloud deployments are realized by reducing total cost of ownership (TCO). We introduced Hive-on-Spark two years ago to meet this goal in collaboration with Intel which is our strategic partner. We have a longstanding collaboration with Intel to optimize Cloudera’s stack on Intel architecture for our customers’ benefit.

In Enterprise 6.0, taking our strategic partnership with Intel ahead for further efficiency gains in Hive, we introduce a major performance and efficiency enhancement in HoS called Parquet Vectorization. This feature enables the HoS engine to process a vector of columns instead of one row at a time by batching data rows together into column vectors and making each operator work on such column vectors. This leads to better utilization of CPU caches and achieves high instructions per cycle by efficiently using the CPU instruction pipeline. In addition, we include numerous other performance improvements. For example, Hive often scans a given table multiple times during self joins, self-unions, or shared sub-queries. To address this, Dynamic RDD caching in HoS reuses a single scan across all these operations. Similarly, when the same subquery is used repeatedly, HoS executes this only once instead of separately for each subquery invocation. Overall, with all these enhancements, in Enterprise 6.0 Hive can be up to 2.2X faster than Hive on the latest Enterprise 5.x release. The majority of these gains can be attributed to Parquet Vectorization for Hive-on-Spark.

This is another case where the Cloudera-Hortonworks merger will get interesting: Cloudera seemed to hitch its wagon to Impala and Hortonworks to Hive; will they support both as much as they each did independently, or will the new corporate overlords settle on one of the two?

Comments closed

Digging Into Batch Mode And Parameter Sniffing

Published 2018-10-26 by Kevin Feasel

Erik Darling has mixed news on the efficacy of using batch mode for rowstore as a way of eliminating problems arising from parameter sniffing:

SQL Server 2019 introduced batch mode over row store, which allows for batch mode processing to kick in on queries when the optimizer deems it cost effective to do so, and also to open up row store queries to the possibility of Adaptive Joins, and Memory Grant Feedback.

These optimizer tricks have the potential to help with parameter sniffing, since the optimizer can change its mind about join strategies at run time, and adjust memory grant issues between query executions.

But of course, the plan that compiles initially has to qualify to begin with. In a way, that just makes parameter sniffing even more frustrating.

Read on for both the good and the bad.

Comments closed

Creating Firewall Rules With Azure Cloud Shell

Published 2018-10-26 by Kevin Feasel

Kellyn Pot’vin-Gorman shows how you can add a firewall rule for Azure SQL Database from the Azure Cloud Shell:

With my use of scripting and Azure Cloud Shell, I’m automating and building my environment, including SQL Database resources and then have a requirement to access and build the logical objects. This means that I need a firewall rule build for the Azure Cloud Shell I’m working from. The IP for this cloud shell is unique to the session I’m running at that moment.

The requirement to add this enhancement to my script is:

Capture and read the IP Address for the Azure Cloud shell session.
Populate the IP Address to a Firewall rule
Log into the new SQL Server database that was created as part of the bash script and then execute SQL scripts.

Click through for instructions.

Comments closed

Optimizing DAX SWITCH Statements With Variables

Published 2018-10-26 by Kevin Feasel

Marco Russo gives us some advice on optimizing IF and SWITCH statements in DAX:

Though the DAX engine might reuse the result obtained for the same measures in the same filter context (Sales Amount and Sales LY), this is not always the case. In this scenario, variables are a good way to ensure a better optimized code execution.

However, variables should only be used within their respective scope. For example, if a variable is defined before a conditional statement, then the variable will be evaluated regardless of the condition. This has a strong performance impact in case there are disconnected slicers in the report. To elaborate on this, consider the following report where a Time Selection table is used to define a slicer that controls which columns of the matrix should be visible. The matrix contains a single measure called Sales, whose content depends on the period selected in the column.

Read on for more.

Comments closed

Azure SQL Database Hyperscale Tier

Published 2018-10-26 by Kevin Feasel

Chris Seferlis looks at a new service tier offering for Azure SQL Database:

The Hyperscale service tier provides the following capabilities:

Support for up to 100 terabytes of database size (and this will grow over time)
Faster large database backups which are based on file snapshots
Faster database restores (also based on file snapshots)
Higher overall performance due to higher log throughput and faster transaction commit time regardless of the data volumes
The ability to rapidly scale out. You can provision one or more read only nodes for offloading your read workload for use as hot standbys.
You can rapidly scale up your compute resources (in constant time) to accommodate heavy workloads, so you can scale compute up and down as needed just like Azure Data Warehouse

At what cost? I like Chris’s “not inexpensive” understatement here.

Comments closed

Slipstream Installation Of SQL Server

Published 2018-10-26 by Kevin Feasel

Randolph West shows how to install a pre-patched version of SQL Server:

For this example, we will be using the SQL Server 2017 Developer Edition RTM (called en_sql_server_2017_developer_x64_dvd_11296168.iso), and Cumulative Update 11 (called SQLServer2017-KB4462262-x64.exe), which was the latest CU available at the time of this writing.

Place the Cumulative Update in a folder that will contain the patch files. On older versions of SQL Server, this could comprise the latest Service Pack, Cumulative Updates, as well as additional hotfixes you may wish to apply. For instance, as of this writing, SQL Server 2016 requires Service Pack 2, Cumulative Update 3 for Service Pack 2, and two more hotfixes to bring it up to date. We would have to have all four files in this folder.

The actual path does not matter as long as we keep track of where they are. For the purposes of this post, we will assume they are stored in C:\Temp.

Definitely a good idea if you’re installing SQL Server regularly.

Comments closed

Using Table-Valued Parameters In SQL Server

Published 2018-10-26 by Kevin Feasel

Ben Richardson has a post showing how to create user-defined table types and pass them into stored procedures:

Table-valued parameters were introduced in SQL Server 2008. Before that, there were limited options to pass tabular data to stored procedures. Most developers used one of the following methods:

Data in multiple columns and rows was represented in the form of a series of parameters. However, the maximum number of parameters that can be passed to a SQL Server stored procedure is 2,100. Therefore, in the case of a large table, this method could not be used. Furthermore preprocessing is required on the server side in order to format the individual parameters into a tabular form.
Create multiple SQL statements that can affect multiple rows, such as UPDATE. The statements can be sent to the server individually or in the batched form. Even if they are sent in the batched form, the statements are executed individually on the server.
Another way is to use delimited strings or XML documents to bundle data from multiple rows and columns and then pass these text values to parameterized SQL statements or stored procedures. The drawback of this approach was that you needed to validate the data structure in order to unbundle the values.

The .NET framework then makes it easy to pass in an IEnumerable as a table-valued parameter.

Comments closed

Day: October 26, 2018

Using Spring Boot To Build A NiFi Operational Dashboard

What’s New In Cloudera Enterprise 6.0

Digging Into Batch Mode And Parameter Sniffing

Creating Firewall Rules With Azure Cloud Shell

Optimizing DAX SWITCH Statements With Variables

Azure SQL Database Hyperscale Tier

The Hyperscale service tier provides the following capabilities:

Slipstream Installation Of SQL Server

Using Table-Valued Parameters In SQL Server