Hive With LLAP

Kevin Feasel



Carter Shanklin looks at Hive 2’s performance improvements:


  • LLAP enables as fast as sub-second query in Hive by keeping all data and servers running and in-memory all the time, while retaining the ability to scale elastically within a YARN cluster.

  • LLAP, along with Apache Ranger enables fine-grained security for the Hadoop ecosystem, including data masking and filtering, by providing interfaces for external clients like Spark to read.

  • LLAP is great for cloud because it caches data in memory and keeps it compressed, overcoming long cloud storage access times and stretching the amount of data you can fit in RAM.

This sounds very much like a response to Spark.

Biml Text Nuggets

Kevin Feasel



Meagan Longoria has started a series on Biml code nuggets and has started with text nuggets:

Text nuggets evaluate the expression they contain and then replace the text nugget with the string representation of the value of the expression. I use them often to switch out names of packages, tasks, and components as well as source and destination tables in SSIS development when creating packages based upon a design pattern.

Text nuggets start with <#= and end with #>. Notice there is an equals sign at the beginning of the text nugget but not at the end.

Text nuggets are very useful. You can include complex business logic in the expressions. And the expression result can be any data type. The BimlScript compiler will automatically convert it to text before replacing the code nugget with the result. Like all code nuggets, text nuggets can be a single line or multiple lines.

Read the whole thing.

Dealing With Foreign Keys

Kenneth Fisher discusses foreign key constraints:

I have to be certain to delete from OrderDetail first, then Order (to maintain the RI) and then load Order first then OrderDetail. No big deal in this simple example, but what if I’m dealing with a dozen tables? Or I’m only re-loading the parent (Order)?

The easiest thing to do is to disable the foreign key, load your data, and then re-enable the foreign key. You might be tempted to skip that last step but don’t. RI is very important and in fact a trusted foreign key can be used by the optimizer to improve your query plan. It’s easy to say that the application doesn’t have bugs that cause problems with referential integrity. It’s not like you’re ever going insert an OrderDetail without an Order right? Unfortunately it’s far to easy for mistakes to happen. Maybe not in the application, it might be a mistake in an update meant to fix something else. Our job is to protect the data, and RI is an important part of that. So by all means disable a foreign key to help with a load but make sure you turn it back on when you are done.

If a foreign key constraint isn’t trusted, the optimizer won’t be able to assume relational integrity, and so it’s possible that the optimizer could make sub-par choices when joining tables with a foreign key constraint.

Getting Started With Spark

I discuss getting up and running with Databricks Community Edition:

There are a couple of notes with these clusters:

  1. These are not powerful clusters.  Don’t expect to crunch huge data sets with them.  Notice that the cluster has only 6 GB of RAM, so you can expect to get maybe a few GB of data max.

  2. The cluster will automatically terminate after one hour without activity.  The paid version does not have this limitation.

  3. You interact with the cluster using notebooks rather than opening a command prompt.  In practice, this makes interacting with the cluster a little more difficult, as a good command prompt can provide features such as auto-complete.

Databricks Community Edition has a nice interface, is very easy to get up and running and—most importantly—is free.  Read the whole thing.

Too Much NUMA

Denny Cherry discusses the performance downsides of having too many NUMA nodes on a server:

At one client I was working with in 2015 they had a server which was configured very similarly to what I’ve described above. They had 32 cores, with 11 NUMA nodes. One NUMA node has 4 cores, the next 9 had three cores each, while the 11th NUMA node had one core in it.

When monitoring the performance on the server what we saw was that three cores were running at 100% (or close to it) while the other cores were running at about 4%.

The reason for this, is that SQL Server is NUMA aware, and it was trying to get the best performance possible out of the configuration that it had access to.

Moderation in everything.  Also, it’s important to plan growth and check every once in a while for oddities like this.

Visiting Production

Randolph West discusses production access:

During a recent client meeting about a database migration, I realised that I have never logged into a SQL Server on their production environment. My involvement has been strictly dealing with setting up the new environment and log shipping the backups.

I get that I’m not a full-service DBA for this client, but it got me wondering about the many security discussions I’ve seen and participated in, in the past: that not even a junior DBA might need access to production database systems, if it’s not within the scope of his or her work.

Limiting production access is a smart move, but it’s important to realize the downstream consequences:  the people who still have access to production will (at least in the short term) have to perform a lot of the tasks that others were doing previously, including data fixes, research, etc.  It’s important to be prepared for that.

More Power BI Tips

Dustin Ryan has a few more Power BI tips for us:

But what if I want to show a descriptive field as a tooltip? As of now, only measures can be displayed in the tooltip. But have no fear! One way to work around this is to create a custom calculation to display my descriptive field.

If I want to show a description of the classes within the subjects seen in the chart above, I can create a calculation like so:

Classes = CALCULATE(CONCATENATEX(VALUES(‘Grade data'[Class Name]),’Grade data'[Class Name],”, “))

So now I can show the classes within each subject as a tooltip

I really like his synonyms example.  Read the whole thing.

Zeppelin Road Map

Vinay Shulka, et al, discuss recent and forthcoming improvements to Apache Zeppelin:

Over the last year, there have been several key improvements to Apache Zeppelin that have been contributed by a diverse group of developers. Some of the highlights are:

  • Security Features-Authentication, Access Control, LDAP Support
  • Sharing Collaboration- Notebook import/export
  • Interpreters-Noteable R interpreter, and others too numerous to list

The pluggable nature of the Apache Zeppelin interpreter architecture has made it easy to add support for interpreters. Now there are over 30 interpreters supporting everything from Spark, Hive, MySql, and R to things like Geode and HBase.

It’s an exciting time to be in the world of data analysis.

Jepsen: Crate

Kyle Kingsbury checks out Crate, a SQL database built on Elasticsearch:

Building a database on Elasticsearch is something of a double-edged sword. Crate has been able to focus on hard problems like query planning, joins, aggregations, and so on–without having to take on the tough work of building a storage layer, cluster membership, replication algorithm, etc. However, Crate is tightly coupled to Elasticsearch, and is dependent on the Elastic team for improvements to that technology. Elasticsearch’s consistency issues have been well-known for years, and the process to fix them is still ongoing. It’s not clear what Crate can do to get out of this situation: a rewrite would be complex and expensive (and introduce new and unknown failure modes), whereas fixing Elasticsearch’s consistency problems could easily consume person-years of engineering time that a small company can ill-afford.

There are good reasons to use Crate: distributed SQL stores, especially with Crate’s capacity for aggregations and joins, are hard to come by. Moreover, Crate introduces several helpful features not present in Elasticsearch. That said, the risk of data loss is real, and is unlikely to be resolved at any point in the near future. I recommend that Crate users avoid using Crate as their system of record–at least, where each record matters. Like Elasticsearch itself, you should use a safer database as your primary store, and continuously backfill data from that primary store into Crate for querying. Crate may also be suitable for cases where occasional data loss or corruption does is mostly harmless, e.g. high-volume sensor data, observability, analytics, etc.

Every time the Jepsen series gets updated, I make time to read.

Moving SSRS Reports

Andy Mallon fills a need in the community:

We’re just interacting with the SSRS Web Service. I’m using PowerShell, but you could also write a little .NET app to do something similar.

I’m not using any magical SSRS-specific cmdlets. I’m using PowerShell to interact with a web service….that web service just happens to be SSRS.

If you’re still trying to get along with RSScripter (if you can even find a copy any longer), Andy’s code might help you out.


June 2016
« May Jul »