Press "Enter" to skip to content

Category: Hadoop

Sparklyr On HDInsight

Ali Zaidi has a walkthrough on using sparklyr on HDInsight:

The majority of Spark is written in Scala (~80% of Spark core), which is a functional programming language. Functional programming languages emphasize functional purity (the output only depends on the inputs) and strive to avoid side-effects. One important component of most functional programming languages is their lazy evaluation. While it might seem odd that we would appreciate laziness from our computing tools, lazy evaluation is an effective way of ensuring computations are evaluated in the most efficient manner possible.

Lazy evaluation allows Spark SQL to highly optimize the queries. When a user submits a query to Spark SQL, Spark composes the components of the SQL query into a logical plan. The logical plan is basically a recipe Spark SQL creates in order to evaluate the desired query. Spark SQL then submits the logical plan to its highly optimized engine called Catalyst, which optimizes this plan into a physical plan of action that is executed inside Spark computation engine (a series of coordinating JVMs).

Read on for more description and code.

Comments closed

Polybase External Data Source To Hadoop

I take a look at connecting to a Hadoop cluster for Polybase:

There are a couple of things I want to point out here.  First, the Type is HADOOP, one of the three types currently available:  HADOOP (for Hadoop, Azure SQL Data Warehouse, and Azure Blob Storage), SHARD_MAP_MANAGER (for sharded Azure SQL Database Elastic Database queries), and RDBMS (for cross-database Elastic Database queries on Azure SQL Database).

Second, the Location is my name node on port 8020.  If you’re curious about how we figure that one out, go to Ambari (which, for me, is http://sandbox.hortonworks.com:8080) and go to HDFS and then Configs.  In the Advanced tab, you can see the name node:

There are different options available for different sources, but this post is focused on Hadoop.

Comments closed

Test Connection With HDInsight

I have a post trying to test a connection using HDInsight:

WebHCat is a web-based REST API for HCatalog, a management layer for dealing with files in HDFS.  If you’re looking for configuration settings for WebHCat, you’ll want generally to look for “templeton” in config files, as Templeton was the project name before WebHCat.  In Ambari, you can go to the Hive configs and look at webhcat-site.xml for configuration settings.  For WebHCat, the default port in HDInsight is 30111, which you should find in the templeton.port configuration setting.

I don’t like the fact that WebHDFS is blocked, but at least WebHCat is functional.

Comments closed

OLAP On Hadoop

Tim Spann discusses OLAP options on the Hadoop stack:

Apache Kylin

For an introduction to this interesting Hadoop project, check out this article.   Apache Kylin originally from eBay, is a Distributed Analytics Engine that provides SQL and OLAP access to Hadoop datasets utilizing Hive and HBase.   It can use called through SparkSQL as well making for a very useful project.   This project let’s you work with PowerBI, Tableau and Excel with more tool support coming soon.    You can doMOLAP cubes and support many users with fast queries over billions of rows.   Apache Kylin provides JDBC and ODBC drivers.

There are a few interesting options here.

Comments closed

Cloudera, Polybase, And Active Directory

Ajay Jagannathan shows how to integrate a SQL Server instance + Polybase with a Cloudera Hadoop cluster, all using Active Directory for accounts:

For all usernames and principals, we will use the suffixes like Cluster14 for name-scalability.

  1. Active Directory setup:
  1. Install OpenLDAP utilities (openldap-clients on RHEL/Centos) on the host of Cloudera Manager server. Install Kerberos client (krb5-workstation on RHEL/Centos) on all hosts of the cluster. This step requires internet connection in Hadoop server. If there is no internet connection in the server, you can download the rpm and install.

This is absolutely worth the read.

Comments closed

Online HDFS Disk Balancer

Lei Xu demonstrates the intra-DataNode disk balancer in HDFS:

By default, the DataNode uses the round-robin-based policy to write new blocks. However, in a long-running cluster, it is still possible for the DataNode to have created significantly imbalanced volumes due to events like massive file deletion in HDFS or the addition of new DataNode disks via the disk hot-swap feature. Even if you use the available-space-based volume-choosing policy instead, volume imbalance can still lead to less efficient disk I/O: For example, every new write will go to the newly-added empty disk while the other disks are idle during the period, creating a bottleneck on the new disk.

Recently, the Apache Hadoop community developed server offline scripts (as discussed inHDFS-1312, the dev@ mailing list, and GitHub) to alleviate the data imbalance issue. However, due to being outside the HDFS codebase, these scripts require that the DataNode be offline before moving data between disks. As a result, HDFS-1312 also introduces an online disk balancer that is designed to re-balance the volumes on a running DataNode based on various metrics. Similar to the HDFS Balancer, the HDFS disk balancer runs as a thread in the DataNode to move the block files across volumes with the same storage types.

This is a good read and sounds like a very useful feature.

Comments closed

Resetting Kafka Topics

I show two methods to clear out a Kafka topic:

The first method works fine for non-production scenarios where you can stop all of the producers and consumers, but let’s say that you want to flush the topic while leaving your producers and consumers up (but maybe you have a downtime window where you know the producers aren’t pushing anything).  In this case, we can change the retention period to something very short, let the queue flush, and bring it back to normal, all using the kafka-configs shell script.

Points deducted for slipping and writing “queue” there, but otherwise, I prefer the second method, as things are still online.  In less-extreme scenarios, you might drop the retention period to a few minutes, especially if your consumers are all caught up.

Comments closed

Kafka Consumer

I build a consumer and aggregator of Kafka data:

From here, I hook into the OnMessage event just like before, and like before we decode the Kafka payload and turn it into a string.  Unlike before, however, I call Newtonsoft’s DeserializeObject method and return a Flight type, which I’ve defined above.  This is the same definition as in the Producer, so in a production-quality environment, I’d pull that out to a single location rather than duplicating it.

Going back to the main function, I call the consumer.Start() method and let ‘er rip.  When I’m ready to aggregate, I’ll hit the enter key and that’ll call consumer.Stop().  When that happens, I’m going to have up to 7 million records in a list called flights.  Out of all of this information, I only need two attributes:  the destination state and the arrival delay in minutes.  I get those by using the map function on my sequence of flights, taking advantage of F#’s match syntax to get all relevant scenarios safely and put the result into a tuple.  The resulting sequence of tuples is called flightTuple.  I pass that into the delaysByState function.

By the time I give this presentation, I’m going to change the way I aggregate just a little bit to cut down on the gigs of RAM necessary to do this operation.  But hey, at least it works…

Comments closed

Spark Overview

Jen Underwood provides an overview of the Apache Spark project:

Spark provides a comprehensive framework to manage big data processing with a variety of data set types including text and graph data. It can also handle batch pipelines and real-time streaming data. Using Spark libraries, you can create big data analytics apps in Java, Scala, Clojure, and popular R and Python languages.

Spark brings analytics pros an improved MapReduce type query capability with more performant data processing in memory or on disk. It can be used with datasets that are larger than the aggregate memory in a cluster. Spark also has savvy lazy evaluation of big data queries which helps with workflow optimization and reuse of intermediate results in memory. TheSpark API is easy to learn.

One of my taglines is, Spark is not the future of Hadoop; Spark is the present of Hadoop.  If you want to get into this space, learn how to work with Spark.

Comments closed