Updating Hive Tables

Kevin Feasel



Carter Shanklin gives us a few patterns for updating tables in Hive:

Historically, keeping data up-to-date in Apache Hive required custom application development that is complex, non-performant, and difficult to maintain. HDP 2.6 radically simplifies data maintenance with the introduction of SQL MERGE in Hive, complementing existing INSERT, UPDATE, and DELETE capabilities.

This article shows how to solve common data management problems, including:

  • Hive upserts, to synchronize Hive data with a source RDBMS.

  • Update the partition where data lives in Hive.

  • Selectively mask or purge data in Hive.

This isn’t the Hive of 2013; it’s much closer to a real-time warehouse.

Event Hub Performance Tips

Vincent-Philippe Lauzon has a few tips for improving Azure Event Hub performance:

Here are some recommendations in the light of the performance and throughput results:

  • If we send many events:  always reuse connections, i.e. do not create a connection only for one event.  This is valid for both AMQP and HTTP.  A simple Connection Pool pattern makes this easy.
  • If we send many events & throughput is a concern:  use AMQP.
  • If we send few events and latency is a concern:  use HTTP / REST.
  • If events naturally comes in batch of many events:  use batch API.
  • If events do not naturally comes in batch of many events:  simply stream events.  Do not try to batch them unless network IO is constrained.
  • If a latency of 0.1 seconds is a concern:  move the call to Event Hubs away from your critical performance path.

Let’s now look at the tests we did to come up with those recommendations.

Read the whole thing.

Privacy-Preserving Data Mining

Duncan Greaves gives us a few options for mining data while maintaining user anonymity:

In pseudonymisation, matching data sets at individual row level is done using key fields, which are then pseudonymised for consumption. Candidates for key fields include those combinations that are most often used to match the datasets, e.g. DoB/Gender/Postcode, credit card numbers, IP addresses or email identifiers. Allocation of persistent pseudonyms are used to build up profiles over time to allow data mining to happen in a privacy sensitive way.

All methods for privacy aware data mining carry additional complexity associated with creating pools of data from which secondary use can be made, without compromising the identity of the individuals who provided the data. Pseudonymisation can act as the best compromise between full anonymisation and identity in many scenarios where it is essential that the identity is preserved, whilst minimising the risks of re-identification beyond reasonable means.

Read the whole thing.

Aim For Database-First Designs

Lukas Eder explains why database-first design patterns tend to work better than code-first design:

The real “truth” of your database schema, and the “sovereignty” over it, resides with your database. The database is the only place where the schema is defined, and all clients have a copy of the database schema, not vice versa. The data is in your database, not in your client, so it makes perfect sense to enforce the schema and its integrity in the database, right where the data is.

This is old wisdom, nothing new. Primary and unique keys are good. Foreign keys are good. Check constraints are good. Assertions (when they’re finally implemented) are good.

And that’s not where it ends. For instance, if you’re using Oracle, you may want to specify:

  • In what tablespace your table resides
  • What PCTFREE value it has
  • What the cache size of your sequence (behind the identity) is

Maybe, all of this doesn’t matter in small systems, but you don’t have to go “big data” before you can profit from vendor-specific storage optimisations as the above. None of the ORMs I’ve ever seen (including jOOQ) will allow you to use the full set of DDL options that you may want to use on your database. ORMs offer some tools to help you write DDL.

But ultimately, a well-designed schema is hand written in DDL. All generated DDL is only an approximation of that.

It’s a great post.  Also check out Lukas’s responses in the comments section.

Starting And Stopping SQL Server From The Command Line

Marek Masko shows us how to start and stop SQL Server instances in Windows and Linux from the command line:

On Linux, we don’t have yet named instances, so all commands are executed against default instance. SQL Server can be managed using thesystemctl command.

To check the current state of SQL Server instance you can run this command:

  • sudo systemctl status mssql-server

Click through for more.


Paul Randal explains the CXCONSUMER wait type:

Back in September 2016 I created a Connect item (3102145) to split the CXPACKET wait into benign and actionable waits, so that when a CXPACKET wait happens, there’s really something to investigate. Unfortunately the Connect website has been shut down by Microsoft so I can’t point you to the original request I made (I also tried in the Internet Archive but couldn’t find it).

The reason I wanted this change is that CXPACKET waits have always been registered by both producer and consumer threads for query plan operators where some threads produce data (i.e. producer threads) and some threads consume the produced data (i.e. consumer threads). However, the waits for the consumer threads are not actionable, because it’s the *producer* threads that are the cause of the consumer thread waits, and so it’s the producer thread waits that are actionable. By splitting the consumer waits out, the number of CXPACKET waits should be reduced, and those that are left should be investigated.

During the PASS Summit in 2017, my friend Pedro Lopes (b|t) on the Tiger Team announced that they’d made the change. The new wait type, CXCONSUMER, was initially added to SQL Server 2017 RTM CU3 and was added to SQL Server 2016 SP2 in April. You can read Pedro’s blog post here. Note that there was initially a bug with the implementation which was fixed in SQL Server 2017 RTM CU4.

It’s nice to see this split, as CXPACKET has been one of the most ambiguous wait types for a while, due to it having multiple potential meanings.

Making Power BI Reports Accessible

Meagan Longoria has a checklist for ensuring that your Power BI reports are accessible to all users:


  • Don’t use tooltips to convey important information. Users with motor issues and users who do not use a mouse will have difficulties accessing them.

  • Do add tooltips to charts as ancillary information. It is included in the accessible Show Data table for each visual.

There’s a lot of good information here.


Kevin Feasel



Randolph West recommends using RAISERROR WITH NOWAIT rather than PRINT for printing messages:

Read that last line again. It’s saying that the PRINT command needs to fill some sort of memory buffer before you see any output. Sometimes you’re lucky, and sometimes you’re not. Erland Sommarskog says the same thing in the series Error and Transaction Handling in SQL Server (Part 1: Jumpstart Error Handling | Part 2: Commands and Mechanisms | Part 3: Implementation).

If you want immediate output from your long-running process, use RAISERROR instead of PRINT, and make use of the WITH NOWAIT parameter.

Naturally I can’t leave you hanging with a 150-word blog post, most of which is quoting Erik, so I decided to do some digging. The question I posed myself was “How big is the PRINT buffer?”

I always use error level 10, as that’s the minimum level guaranteed to print to the console immediately and it does not write to error logs or stop operations.


June 2018
« May Jul »