EMR Now Supports Phoenix

Kevin Feasel



Jonathan Fritz notes that Amazon’s ElasticMapReduce now offers easy support for Apache Phoenix:

Once your HBase table is ready, it’s time to map a table in Phoenix to your data in HBase. You use a JDBC connection to access Phoenix, and there are two drivers included on your cluster under /usr/lib/phoenix/bin. First, the Phoenix client connects directly to HBase processes to execute queries, which requires several ports to be open in your Amazon EC2 Security Group (for ZooKeeper, HBase Master, and RegionServers on your cluster) if your client is off-cluster.

Second, the Phoenix thin client connects to the Phoenix Query Server, which runs on port 8765 on the master node of your EMR cluster. This allows you to use a local client without adjusting your Amazon EC2 Security Groups by creating a SSH tunnel to the master node and using port forwarding for port 8765. The Phoenix Query Server is still a new component, and not all SQL clients can support the Phoenix thin client.

I am not HBase’s biggest fan, but I do think that Phoenix fixes one of the biggest problems HBase has:  its being completely foreign to most data professionals.  It’s not an accident that as a data platform matures, its development language looks more and more like SQL.

How Do You Trace?

Erin Stellato wants to know how you use Profiler and server-side traces:

Back in April I wrote a post asking why people tend to avoid Extended Events.  Many of you provided feedback, which I greatly appreciated.  A common theme in the responses was time.  Many people found that creating an event session in Extended Events that was comparable to one they would create in Trace took longer.  Ok, I get that…so as a follow up I am interested in knowing how you typically use Profiler and Trace.  If you would be willing to share that information, I would be extremely grateful.

Head over there and let her know.  For science!

Tying Meetup To Power BI

Reza Rad shows how to integrate Meetup data into Power BI:

In the documentation of this service in Meetup mentioned that Time column is:

time = UTC start time of the event, in milliseconds since the epoch

That means it is timestamp formatted. Timestamp value is number of seconds from epoch which is 1970-01-01 00:00:00. I have previously written about how to change timestamp value to date time and it is fairly easy with adding seconds to it. However for this case our value is not seconds, it is milliseconds so I have to first divide it by 1000.

This is pretty cool.  We’re starting up a Power BI meetup in the Triangle area, so it’ll be fun hosting a Power BI Meetup where we use Power BI to read Meetup data.


I walk through a scenario which underscores the importance of normalization:

This joins my records to a tally table, which gives one row for each character in RemovedValue (that is, the numbers without recordset separators).  I then retain only the values which start a sequence, and use SUBSTRING to snatch up four digits. What I’m left with is a column named SplitVersion, which has one row for each customer, campaign, and 4-digit value (which is equivalent to my normalized table’s structure).

If that wasn’t exciting enough, we now need to slam this back together into our denormalized format, and that’s what tallyjoin does. It uses the FOR XML PATH trick to concatenate my four-digit values into one string, separated by commas. You might be wondering why I use comma instead of CHAR(30), and the answer is that converting CHAR(30) to XML returns a nasty result, so instead of trying to handle that, I use a character which is copacetic and translate it back using the REPLACE function after casting my “XML” result to varchar.

The implicit story here is, you can find someone who knows how to use tally tables, how to concatenate strings of data (quickly!), who knows how to tie various pieces of the puzzle together, and so on…or design the database the right way and avoid this pain later.

Key Lookup Without Output Column

Daniel Hutmacher gives us a head-scratcher:

Performance tuning the other day, I was stumped by a query plan I was looking at. Even though I had constructed a covering index, I was still getting a Key Lookup operator in my query plan. What I usually do when that happens is to check the operator’s properties to see what its output columns are, so I can include those columns in my covering index.

Here’s the interesting thing: there weren’t any output columns. What happened?

The answer makes perfect sense, and shows that looking at the SELECT clause isn’t enough.

Getting Physical

Denny Cherry explains the value of his having physical hardware on hand:

Now you may be asking why don’t we just do all this in Azure? And we could, but the reason we didn’t is pretty straight forward. Cost. Building tons of VMs in Azure and leaving them running for a few weeks for customers can cost a decent amount pretty quickly, even with smaller VMs. Here our cost is fixed. As long as we don’t need another power circuit (we can probably triple the number of servers before that becomes an issue) the cost is fixed. And if we need more power that’s not all that much per month to add on.

All and all, this will make a really nice resource for our customers to take advantage of, and give us a place to play with whatever we want without spending anything.

Ah, the life-long struggle between cap-x and op-x…

R For The Microsoft Developer

Kevin Feasel



Ginger Grant has a great intro on setting up R Tools for Visual Studio:

For those people who haven’t made the decision as far as which tool to use, let me offer two compelling reasons to pick Visual Studio [VS] instead of R Studio: Intellisense and Improved Debugging Tools. R studio does not have intellisense and it is not possible to debug your code by stepping through it in the manner that many developers of VS are already quite familiar. You will need to configure VS to use R tools, which are detailed below.

Those are nice features.  I’m still a big fan of R Studio, but have seen big improvements in R Tools for Visual Studio, so I imagine I’ll make the switch by the end of the year.

Kitchen Sink Procedures

Aaron Bertrand discusses procedures which satisfy all potential search conditions:

This table has a lot of columns, obviously. Now envision an interface where employees at World Wide Importers could perform searches for specific customers. They may want to perform different searches based on unique combinations of criteria, including (and this is not an exhaustive list by any means, but enough for my purposes today):

  • CustomerID (e.g. pick the customer from a list);
  • actual “bill to” CustomerID;
  • name;
  • category;
  • buying group;
  • when the account was opened;
  • city; or,
  • whether the account is on credit hold.

And any combination of the above. I know you’ve seen and probably built interfaces like this before, but just as a quick mockup, the employees would have a screen something like this:

There are plenty of ways to solve this problem, and Aaron shows a couple methods (including one which has major problems).


June 2016
« May Jul »