To measure the improvement LLAP brings we ran 15 queries that were taken from the TPC-DS benchmark, similar to what we have done in the past. The entire process was run using the hive-testbench repository and data generation tools. The queries there are adapted to Hive SQL but are otherwise not modified from the standard TPC-DS queries using any of the tricks that some big data vendors routinely use to show better performance for their tools. This blog only covers 15 queries but a more comprehensive performance test is underway.
The full test environment is explored below but at a high level the tests run using 10 powerful VMs with a 1TB dataset that is intended to show performance at data scales commonly used with BI tools. The same VMs and the same data are used both for Hive 1 and for Hive 2. All reported times represent the average across 3 runs in the respective Hive version.
Hive 2.1 looks like a big step forward for Hadoop performance.
First, which security features are available in Apache Kafka, and thus in Kafka Streams? Kafka Streams supports all the client-side security features in Apache Kafka. In this short blog post we cannot cover these client-side security features in full detail, so I recommend reading the Kafka Security chapter in the Confluent Platform documentation and our previous blog post Apache Kafka Security 101 to familiarize yourself with the security features that are currently available in Apache Kafka.
That said, let me highlight a couple of important Kafka security features that are essential for implementing robust data infrastructures, whether these are used for building horizontal services at larger companies, for multi-tenant infrastructures (e.g. microservices), or for shared platforms such as in the Internet of Things. Later on I will then demonstrate an example application where we use some of these security features in Kafka Streams.
It’s important to secure sensitive data, even in “transient” media like Kafka (though the transience of Kafka is user-definable, so “It’ll go away soon” isn’t really a good argument).
The Spark-Hbase Connector provides an easy way to store and access data from HBase clusters with Spark jobs. HBase is really successful for highest level of data scale needs. Thus, existing Spark customers should definitely explore this storage option. Similarly, if the customers are already having HDinsight HBase clusters and they want to access their data by Spark jobs then there is no need to move data to any other storage medium. In both the cases, the connector will be extremely useful.
I’m not the biggest fan of HBase, but if it’s part of your environment, you should definitely look at this Spark connector.
Now for the big question, Windows or Linux?
That’s absolutely correct.
After adding new racks to our 70 node cluster, we noticed that it was taking several hours per terabyte to rebalance the nodes. You can copy a terabyte of data across a 10GbE network in under half an hour with SCP, so why should HDFS take several hours?
It didn’t take long to discover the cause—the configuration parameterdfs.datanode.balance.bandwidthPerSecond controls how much bandwidth each node is allowed to use for rebalancing, and it defaults to a conservative value of 10Mb/sec/node, which is 1.25MB/sec. If you have 70 nodes (the number we started with before adding new ones), that’s 87.5MB/second. One terabyte, i.e., a million MB, divided 87.5MB/sec, equals 11,428 sec, or 3.17 hours per TB. The more nodes in the original cluster, the faster it will write.
On the development side, “it’ll automatically rebalance without us having to worry” is a great thing. On the administrative side, we’re paid to worry about these things…
If your tables are large it may not be possible to restore them under a different name due to space constraints. The really powerful thing about HBase backups is they are stored in WAL files that can be parsed using a simple interface that can be consumed either in Java or using the “hbase wal” utility.
Consider this scenario: A customer rep deleted some data because he thought it was unimportant. A week later the customer is upset because the data was important and you need to restore these few pieces of information. With HBase backups all you need to do is parse through the backups with a WAL reader and extract the historical values, which you can then add back in. With other databases you would have to bring another database instance online and load the backups into it. Having backups in open, well-understood formats unlocks many powerful opportunities and can bring recovery times down from days to minutes.
Read on if you manage a Hadoop cluster with HBase (or you’re likely to administer one soon).
Numbers suggest that Storm has come a long way in terms of performance but it still has room go faster. Here are some of the broad areas that should improve performance in future:
An effort to rewrite much of Storm’s Clojure code in Java is underway. Profiling has shown many hotspots in Clojure code.
Better scheduling of workers. Yahoo is experimenting with a Load Aware Scheduler for Storm to be smarter about the way in which topologies are scheduled on the cluster.
Based on microbenchmarking and discussions with other Storm developers there appears potential for streamlining the internal queueing for faster message transfer.
Operator coalescing (executing consecutive spouts/bolts in a single thread when possible) is another area that reduces intertask messaging and improve throughput.
Even with these potential improvements, Storm has come a long way—their benchmarks show around 5x throughput and a tiny fraction of the latency of Storm 0.9.1.
Table-Valued Parameters (TVPs)
TVP support allows a client application to send parameterized data to the server more efficiently by sending multiple rows to the server with a single call. You can use the JDBC Driver 6.0 to encapsulate rows of data in a client application and send the data to the server in a single parameterized command.
There are a couple of interesting features in this driver which could help your Hadoop cluster play nice with SQL Server.
Third, how many instances can or should you run for your application? Is there an upper limit for the number of instances and, similarly, for the parallelism of your application? In a nutshell, the parallelism of a Kafka Streams application — similar to the parallelism of Kafka — is primarily determined by the number of partitions of the input topic(s) from which your application is reading. For example, if your application reads from a single topic that has 10 partitions, then you can run up to 10 instances of your applications (note that you can run further instances but these will be idle). In summary, the number of topic partitions is the upper limit for the parallelism of your Kafka Streams application and thus for the number of running instances of your application. Note: A scaling/parallelism caveat here is that the balance of the processing work between application instances depends on how well data messages are balanced between partitions.
Check it out. Kafka Streams is a potential alternative to Spark Streaming and Storm for real-time (for some definition of “real-time”) distributed computing.
I like to think of Pig as a high-level Map/Reduce commands pipeline. As a former SQL programmer, I find it quite intuitive, and at my organization our Hadoop jobs are still mostly developed in Pig.
Pig has a lot of qualities: it is stable, scales very well, and integrates natively with the Hive metastore HCatalog. By describing each step atomically, it minimizes conceptual bugs that you often find in complicated SQL code.
But sometimes, Pig has some limitations that makes it a poor programming paradigm to fit your needs.
Philippe includes a couple of examples in Pig, PySpark, and SparkSQL. Even if you aren’t familiar with Pig, this is a good article to help familiarize yourself with Spark.