Hadoop – Page 129 – Curated SQL

Reason 3: Poor transparency for teammates

Which jobs are running right now? Which are going to run today? How long do these jobs take? How do I schedule my job? What machine should I schedule it on? These are all questions that are impossible to answer without building custom orchestration around your Cron process – time you’d be better off spending on building a better system.

Matthew then gives us four alternative products.

Comments closed

Securing The Data Plane

Published 2016-08-01 by Kevin Feasel

Michael Schiebel gives an overview of security architecture inside a data lake:

Existing platform based Hadoop architectures make several implicit assumptions on how users interact with the platform such as developmental research versus production applications. While this was perfectly good in a research mode, as we move to a modern data application architecture we need to bring back modern application concepts to the Hadoop ecosystem. For example, existing Hadoop architectures tightly couple the user interface with the source of data. This is done for good reasons that apply in a data discovery research context, but cause significant issues in developing and maintaining a production application. We see this in some of the popular user interfaces such as Kibana, Banana, Grafana, etc. Each user interface is directly tied to a specific type of data lake and imposes schema choices on that data.

Read the whole thing. Also, “Securing the data plane” sounds like a terrible ’90s action film.

Comments closed

Syncing LDAP With Ranger

Published 2016-08-01 by Kevin Feasel

Colm O hEigeartaigh shows how to load users and groups into Apache Ranger from LDAP:

For the purposes of this tutorial, we will use OpenDS as the LDAP server. It contains a domain called “dc=example,dc=com”, and 5 users (alice/bob/dave/oscar/victor) and 2 groups (employee/manager). Victor, Oscar and Bob are employees, Alice and Dave are managers. Here is a screenshot using Apache Directory Studio:

Colm’s scenario uses OpenDS, but you can integrate with Active Directory as well.

Comments closed

Kafka On AWS

Published 2016-07-29 by Kevin Feasel

Alex Loddengaard explains a few things you should think about when deploying Apache Kafka to AWS:

Kafka has built-in fault tolerance by replicating partitions across a configurable number of brokers. However, when a broker fails and a new replacement broker is added, the replacement broker fetches all data the original broker previously stored from other brokers in the cluster that host the other replicas. Depending on your application, this could involve copying tens of gigabytes or terabytes of data. Fetching this data takes time and increases network traffic, which could impact the performance of the Kafka cluster for the period the data transfer is happening.

EBS volumes are persisted when an instance fails or is terminated. When an EC2 instance running a Kafka broker fails or is terminated, the broker’s on-disk partition replicas remain intact and can be mounted by a new EC2 instance. By using EBS, most of the replica data for the replacement broker will already be in the EBS volume and hence won’t need to be transferred over the network. Only data produced since the original broker failed or was terminated will need to be fetched across the network.

There are some good insights here; read the whole thing if you’re thinking about running Kafka.

Comments closed

NodeGroup Performance Issues

Published 2016-07-27 by Kevin Feasel

Babak Behzad explains potential Hadoop NodeGroup performance bottlenecks:

As can be seen in the logs, the localityWaitFactor value is 1, but the delay that this code causes grows linearly with the number of required containers. Since our DFSIO-large benchmark creates 1,024 files, each 1 GB in size, it requests 1,024 YARN containers. Therefore, the code has to miss at least 1,024 scheduling opportunities until it schedules containers on this (wrongly assumed) OFF_SWITCH node.

But why is this delay enforced? This idea falls into a big area of scheduling research. The Delay Scheduling algorithm was introduced by Matei Zaharia’s EuroSys ’10 paper titled “Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling”.

That post is a bit deeper than my Hadoop administration comfort level, but if you’re given the task of performance tuning a cluster, this might be one place to look.

Comments closed

Hive 2.1 Benchmarks

Published 2016-07-26 by Kevin Feasel

Nita Dembla and Gopal Vijayaraghavan compare Hive 2.1 versus Hive 1:

To measure the improvement LLAP brings we ran 15 queries that were taken from the TPC-DS benchmark, similar to what we have done in the past. The entire process was run using the hive-testbench repository and data generation tools. The queries there are adapted to Hive SQL but are otherwise not modified from the standard TPC-DS queries using any of the tricks that some big data vendors routinely use to show better performance for their tools. This blog only covers 15 queries but a more comprehensive performance test is underway.

The full test environment is explored below but at a high level the tests run using 10 powerful VMs with a 1TB dataset that is intended to show performance at data scales commonly used with BI tools. The same VMs and the same data are used both for Hive 1 and for Hive 2. All reported times represent the average across 3 runs in the respective Hive version.

Hive 2.1 looks like a big step forward for Hadoop performance.

Comments closed

Securing Kafka Streams

Published 2016-07-25 by Kevin Feasel

Michael Noll shows security features of Kafka Streams:

First, which security features are available in Apache Kafka, and thus in Kafka Streams? Kafka Streams supports all the client-side security features in Apache Kafka. In this short blog post we cannot cover these client-side security features in full detail, so I recommend reading the Kafka Security chapter in the Confluent Platform documentation and our previous blog post Apache Kafka Security 101 to familiarize yourself with the security features that are currently available in Apache Kafka.

That said, let me highlight a couple of important Kafka security features that are essential for implementing robust data infrastructures, whether these are used for building horizontal services at larger companies, for multi-tenant infrastructures (e.g. microservices), or for shared platforms such as in the Internet of Things. Later on I will then demonstrate an example application where we use some of these security features in Kafka Streams.

It’s important to secure sensitive data, even in “transient” media like Kafka (though the transience of Kafka is user-definable, so “It’ll go away soon” isn’t really a good argument).

Comments closed

Using The Spark-HBase Connector

Published 2016-07-25 by Kevin Feasel

Anunay Tiwari shows how to use the Spark-HBase connector in HDInsight:

The Spark-Hbase Connector provides an easy way to store and access data from HBase clusters with Spark jobs. HBase is really successful for highest level of data scale needs. Thus, existing Spark customers should definitely explore this storage option. Similarly, if the customers are already having HDinsight HBase clusters and they want to access their data by Spark jobs then there is no need to move data to any other storage medium. In both the cases, the connector will be extremely useful.

I’m not the biggest fan of HBase, but if it’s part of your environment, you should definitely look at this Spark connector.

Comments closed

Choosing The Right HDInsight Cluster

Published 2016-07-22 by Kevin Feasel

Josh Fennessy discusses things to consider before you create an HDInsight cluster:

Now for the big question, Windows or Linux?

Linux.

That’s absolutely correct.

Comments closed

Cluster Rebalancing

Published 2016-07-22 by Kevin Feasel

Peter Coates discusses cluster rebalancing in Hadoop:

After adding new racks to our 70 node cluster, we noticed that it was taking several hours per terabyte to rebalance the nodes. You can copy a terabyte of data across a 10GbE network in under half an hour with SCP, so why should HDFS take several hours?

It didn’t take long to discover the cause—the configuration parameterdfs.datanode.balance.bandwidthPerSecond controls how much bandwidth each node is allowed to use for rebalancing, and it defaults to a conservative value of 10Mb/sec/node, which is 1.25MB/sec. If you have 70 nodes (the number we started with before adding new ones), that’s 87.5MB/second. One terabyte, i.e., a million MB, divided 87.5MB/sec, equals 11,428 sec, or 3.17 hours per TB. The more nodes in the original cluster, the faster it will write.

On the development side, “it’ll automatically rebalance without us having to worry” is a great thing. On the administrative side, we’re paid to worry about these things…

Comments closed

Category: Hadoop

Don’t Use Cron For Scheduling Hadoop Jobs

Reason 3: Poor transparency for teammates

Securing The Data Plane

Syncing LDAP With Ranger

Kafka On AWS

NodeGroup Performance Issues

Hive 2.1 Benchmarks

Securing Kafka Streams

Using The Spark-HBase Connector

Choosing The Right HDInsight Cluster

Now for the big question, Windows or Linux?

Cluster Rebalancing