Author: Kevin Feasel

We are also focusing on efficiency across our platform. While on-premises platform efficiency helps manage costs in the long run, the immediate benefits of in-cloud deployments are realized by reducing total cost of ownership (TCO). We introduced Hive-on-Spark two years ago to meet this goal in collaboration with Intel which is our strategic partner. We have a longstanding collaboration with Intel to optimize Cloudera’s stack on Intel architecture for our customers’ benefit.

In Enterprise 6.0, taking our strategic partnership with Intel ahead for further efficiency gains in Hive, we introduce a major performance and efficiency enhancement in HoS called Parquet Vectorization. This feature enables the HoS engine to process a vector of columns instead of one row at a time by batching data rows together into column vectors and making each operator work on such column vectors. This leads to better utilization of CPU caches and achieves high instructions per cycle by efficiently using the CPU instruction pipeline. In addition, we include numerous other performance improvements. For example, Hive often scans a given table multiple times during self joins, self-unions, or shared sub-queries. To address this, Dynamic RDD caching in HoS reuses a single scan across all these operations. Similarly, when the same subquery is used repeatedly, HoS executes this only once instead of separately for each subquery invocation. Overall, with all these enhancements, in Enterprise 6.0 Hive can be up to 2.2X faster than Hive on the latest Enterprise 5.x release. The majority of these gains can be attributed to Parquet Vectorization for Hive-on-Spark.

This is another case where the Cloudera-Hortonworks merger will get interesting: Cloudera seemed to hitch its wagon to Impala and Hortonworks to Hive; will they support both as much as they each did independently, or will the new corporate overlords settle on one of the two?

Comments closed

Digging Into Batch Mode And Parameter Sniffing

Published 2018-10-26 by Kevin Feasel

Erik Darling has mixed news on the efficacy of using batch mode for rowstore as a way of eliminating problems arising from parameter sniffing:

SQL Server 2019 introduced batch mode over row store, which allows for batch mode processing to kick in on queries when the optimizer deems it cost effective to do so, and also to open up row store queries to the possibility of Adaptive Joins, and Memory Grant Feedback.

These optimizer tricks have the potential to help with parameter sniffing, since the optimizer can change its mind about join strategies at run time, and adjust memory grant issues between query executions.

But of course, the plan that compiles initially has to qualify to begin with. In a way, that just makes parameter sniffing even more frustrating.

Read on for both the good and the bad.

Comments closed

Creating Firewall Rules With Azure Cloud Shell

Published 2018-10-26 by Kevin Feasel

Kellyn Pot’vin-Gorman shows how you can add a firewall rule for Azure SQL Database from the Azure Cloud Shell:

With my use of scripting and Azure Cloud Shell, I’m automating and building my environment, including SQL Database resources and then have a requirement to access and build the logical objects. This means that I need a firewall rule build for the Azure Cloud Shell I’m working from. The IP for this cloud shell is unique to the session I’m running at that moment.

The requirement to add this enhancement to my script is:

Capture and read the IP Address for the Azure Cloud shell session.
Populate the IP Address to a Firewall rule
Log into the new SQL Server database that was created as part of the bash script and then execute SQL scripts.

Click through for instructions.

Comments closed

Optimizing DAX SWITCH Statements With Variables

Published 2018-10-26 by Kevin Feasel

Marco Russo gives us some advice on optimizing IF and SWITCH statements in DAX:

Though the DAX engine might reuse the result obtained for the same measures in the same filter context (Sales Amount and Sales LY), this is not always the case. In this scenario, variables are a good way to ensure a better optimized code execution.

However, variables should only be used within their respective scope. For example, if a variable is defined before a conditional statement, then the variable will be evaluated regardless of the condition. This has a strong performance impact in case there are disconnected slicers in the report. To elaborate on this, consider the following report where a Time Selection table is used to define a slicer that controls which columns of the matrix should be visible. The matrix contains a single measure called Sales, whose content depends on the period selected in the column.

Read on for more.

Comments closed

Azure SQL Database Hyperscale Tier

Published 2018-10-26 by Kevin Feasel

Chris Seferlis looks at a new service tier offering for Azure SQL Database:

The Hyperscale service tier provides the following capabilities:

Support for up to 100 terabytes of database size (and this will grow over time)
Faster large database backups which are based on file snapshots
Faster database restores (also based on file snapshots)
Higher overall performance due to higher log throughput and faster transaction commit time regardless of the data volumes
The ability to rapidly scale out. You can provision one or more read only nodes for offloading your read workload for use as hot standbys.
You can rapidly scale up your compute resources (in constant time) to accommodate heavy workloads, so you can scale compute up and down as needed just like Azure Data Warehouse

At what cost? I like Chris’s “not inexpensive” understatement here.

Comments closed

Slipstream Installation Of SQL Server

Published 2018-10-26 by Kevin Feasel

Randolph West shows how to install a pre-patched version of SQL Server:

For this example, we will be using the SQL Server 2017 Developer Edition RTM (called en_sql_server_2017_developer_x64_dvd_11296168.iso), and Cumulative Update 11 (called SQLServer2017-KB4462262-x64.exe), which was the latest CU available at the time of this writing.

Place the Cumulative Update in a folder that will contain the patch files. On older versions of SQL Server, this could comprise the latest Service Pack, Cumulative Updates, as well as additional hotfixes you may wish to apply. For instance, as of this writing, SQL Server 2016 requires Service Pack 2, Cumulative Update 3 for Service Pack 2, and two more hotfixes to bring it up to date. We would have to have all four files in this folder.

The actual path does not matter as long as we keep track of where they are. For the purposes of this post, we will assume they are stored in C:\Temp.

Definitely a good idea if you’re installing SQL Server regularly.

Comments closed

Using Table-Valued Parameters In SQL Server

Published 2018-10-26 by Kevin Feasel

Ben Richardson has a post showing how to create user-defined table types and pass them into stored procedures:

Table-valued parameters were introduced in SQL Server 2008. Before that, there were limited options to pass tabular data to stored procedures. Most developers used one of the following methods:

Data in multiple columns and rows was represented in the form of a series of parameters. However, the maximum number of parameters that can be passed to a SQL Server stored procedure is 2,100. Therefore, in the case of a large table, this method could not be used. Furthermore preprocessing is required on the server side in order to format the individual parameters into a tabular form.
Create multiple SQL statements that can affect multiple rows, such as UPDATE. The statements can be sent to the server individually or in the batched form. Even if they are sent in the batched form, the statements are executed individually on the server.
Another way is to use delimited strings or XML documents to bundle data from multiple rows and columns and then pass these text values to parameterized SQL statements or stored procedures. The drawback of this approach was that you needed to validate the data structure in order to unbundle the values.

The .NET framework then makes it easy to pass in an IEnumerable as a table-valued parameter.

Comments closed

Whither Running Kafka On Kubernetes

Published 2018-10-25 by Kevin Feasel

Gwen Shapira walks through some of the costs and benefits of using Kubernetes to host your Apache Kafka brokers:

First, if you are running most of your other applications and microservices on Kubernetes, it becomes the organizational path of least resistance. This is just like how organizations who standardized on VMs have found it very difficult to allocate physical machines with local disks for Kafka.

I see situations with larger organizations where deploying Kafka outside of Kubernetes causes significant organizational headache that involves many approvals. When this is the case, I usually say that this isn’t a good hill to die on. It is possible to run Kafka on Kubernetes, so just do it. You’ll get your environment allocated faster and will be able to use your time to do productive work rather than fight an organizational battle.
And if things go wrong, you’ll get much better service from your internal infrastructure teams, because you’ll be running in an environment that is familiar to them.

Read on for more benefits as well as a few drawbacks.

Comments closed

Medium-Term Effects Of The Cloudera-Hortonworks Merger

Published 2018-10-25 by Kevin Feasel

Alex Woodie describes some of the ramifications of Cloudera’s merger with Hortonworks:

Whatever camp you sit in, the merger undoubtedly caught the attention of the 2,500 organizations that have adopted Cloudera’s Distribution of Hadoop (CDH) or the Hortonworks Data Platform (HDP) over the years — not to mention the thousands of other companies that have adopted open source Apache Hadoop platforms or Hadoop ecosystem components in the cloud. These Global 2000 companies have invested billions of dollars into building giant clusters to store and process many exabytes worth of data, and they’re not going to just turn them off overnight because the two biggest players suddenly decided to merge.

At the same time, these customers need to be reassured that Cloudera has a plan to maintain the investments they’ve already made in HDP and CDH platforms, both in a short-term or tactical sense, as well as in terms of Cloudera’s long-range strategy to evolve its platform to meet emerging future compute and storage needs.

Read on for more detail.

Comments closed

Spark Streaming On Azure Databricks

Published 2018-10-25 by Kevin Feasel

Tristan Robinson shows us how to run Spark Streaming within Azure Databricks:

Real-time stream processing is becoming more prevalent on modern day data platforms, and with a myriad of processing technologies out there, where do you begin? Stream processing involves the consumption of messages from either queue/files, doing some processing in the middle (querying, filtering, aggregation) and then forwarding the result to a sink – all with a minimal latency. This is in direct contrast to batch processing which usually occurs on an hourly or daily basis. Often is this the case, both of these will need to be combined to create a new data set.

In terms of options for real-time stream processing on Azure you have the following:

Azure Stream Analytics
Spark Streaming / Storm on HDInsight
Spark Streaming on Databricks
Azure Functions

Click through for more.

Comments closed