Press "Enter" to skip to content

Category: Hadoop

Explaining Yarn Container Memory Allocations

Skumar T explains container sizes in Yarn:

So jobs on yarn cluster runs in individual containers which is allocated by Node Manager which in turn gets permissions from Resource Manager.

So few configuration parameters of node manager those are important in context of jobs running in the containers.

–>yarn.nodemanager.resource.memory-mb  8192(value)

Amount of physical memory, in MB, that can be allocated for containers.

–>yarn.nodemanager.pmem-check-enabled  true(value)

Whether physical memory limits will be enforced for containers.

The bottom half of the article goes into an extended example.

Comments closed

Developing Spark Applications In .NET

Kaarthik Sivashanmugam talks about Mobius, a Microsoft-driven .NET wrapper for Spark:

The C# language binding to Spark is similar to the Python and R bindings. In fact, Mobius follows the same design pattern and leverages the existing implementation of language binding components in Spark where applicable for consistency and reuse. The following picture shows the dependency between the .NET application and the C# API in Mobius, which internally depends on Spark’s public API in Scala and Java and extends PythonRDD from PySpark to implement CSharpRDD.

Looks like there’s some fuzziness on just how well F# is supported.  Still, this is very exciting as a way of bridging the gap for .NET developers.

Comments closed

Optimizing HBase In HDInsight

Ashish Thapliyal links to a 30-minute presentation on HBase optimization:

This session was presented by Nitin Verma (Sr. Software Engineer) and Pravin Mittal (Principal Engineering Manager) @ HBaseCon 2016. The session goes deeper into success story of enabling a big internal customer on HDInsight HBase.

HBase design is a totally different mindset from relational design, so you have to unlearn a lot of habits when moving over to it.

Comments closed

Spatial Functions In Hive

Constantin Stanca has a couple of posts on using Hive to implement geospatial queries.  First, an overview:

The Esri Geometry API for Java includes geometry objects (e.g. points, lines, and polygons), spatial operations (e.g. intersects, buffer), and spatial indexing. By deploying the library (as a jar) within Hadoop, you are able to build custom MapReduce applications using Java to complete analysis on your spatial data. This can be used as a standalone library, or combined with Spatial Framework for Hadoop to create a SQL like experience.

The Spatial Framework for Hadoop includes among others, the Hive Spatial library with User-Defined Functions and SerDes for spatial analysis in Hive. By enabling this library in Hive, you are able to construct queries using Hive Query Language (HQL), which is very similar to SQL. This allows you to avoid complicated MapReduce algorithms and stick to a more familiar workflow. The API used by the Hive UDF’s could be used by developers building geometry functions for 3rd-party applications using Storm, Spark, HBase etc.

He follows that up with some pieces hive misses compared to SQL Server, Oracle, etc.:

As discussed with ESRI recently, there are no plans to open source all spatial functions currently available for traditional RDBMS like Oracle, SQL Server, or Netezza, as those are commercially licensed packages. The best option to compensate for the 5-10% missing functions is to contribute to ESRI’s open source repository: ESRI does not provide a commercial library for Hive including all spatial functions.

Be sure to check out that second link to get an understanding of exactly what’s missing.  Via Mark Herring.

Comments closed

Ingesting E-Mail Into Hadoop

Jordan Volz and Stefan Salandy show how to feed e-mails into Hadoop for almost-immediate analysis:

In particular, compliance-related use cases centered on electronic forms of communication, such as archiving, supervision, and e-discovery, are extremely important in financial services and related industries where being “out of compliance” can result in hefty fines. For example, financial institutions are under regulatory pressure to archive all forms of e-communication (email, IM, social media, proprietary communication tools, and so on) for a set period of time. Once data has grown past its retention period, it can then be permanently removed; in the meantime, such data is subject to e-discovery requests and legal holds. Even outside of compliance use cases, most large organizations that are subject to litigation have some form of archive in place for purposes of e-discovery.

Traditional solutions in this area comprise various moving parts and can be quite costly and complex to implement, maintain, and upgrade. By using the Hadoop stack to take advantage of cost-efficient distributed computing, companies can expect significant cost savings and performance benefits.

In this post, as a simple example of this use case, I’ll describe how to set up an open source, real-time ingestion pipeline from the leading source of electronic communication, Microsoft Exchange.

Most of this post is about setting up the interconnections between Exchange and Apache James, and feeding data in.  It looks like this will be part 1 of a multi-part series.

Comments closed

Don’t Use Cron For Scheduling Hadoop Jobs

Matthew Rathbone explains why cron is not a great choice for scheduling Hadoop and Spark jobs:

Reason 3: Poor transparency for teammates

Which jobs are running right now? Which are going to run today? How long do these jobs take? How do I schedule my job? What machine should I schedule it on? These are all questions that are impossible to answer without building custom orchestration around your Cron process – time you’d be better off spending on building a better system.

Matthew then gives us four alternative products.

Comments closed

Securing The Data Plane

Michael Schiebel gives an overview of security architecture inside a data lake:

Existing platform based Hadoop architectures make several implicit assumptions on how users interact with the platform such as developmental research versus production applications.  While this was perfectly good in a research mode, as we move to a modern data application architecture we need to bring back modern application concepts to the Hadoop ecosystem.  For example, existing Hadoop architectures tightly couple the user interface with the source of data.  This is done for good reasons that apply in a data discovery research context, but cause significant issues in developing and maintaining a production application.  We see this in some of the popular user interfaces such as Kibana, Banana, Grafana, etc.  Each user interface is directly tied to a specific type of data lake and imposes schema choices on that data.

Read the whole thing.  Also, “Securing the data plane” sounds like a terrible ’90s action film.

Comments closed

Syncing LDAP With Ranger

Colm O hEigeartaigh shows how to load users and groups into Apache Ranger from LDAP:

For the purposes of this tutorial, we will use OpenDS as the LDAP server. It contains a domain called “dc=example,dc=com”, and 5 users (alice/bob/dave/oscar/victor) and 2 groups (employee/manager). Victor, Oscar and Bob are employees, Alice and Dave are managers. Here is a screenshot using Apache Directory Studio:

Colm’s scenario uses OpenDS, but you can integrate with Active Directory as well.

Comments closed

Kafka On AWS

Alex Loddengaard explains a few things you should think about when deploying Apache Kafka to AWS:

Kafka has built-in fault tolerance by replicating partitions across a configurable number of brokers. However, when a broker fails and a new replacement broker is added, the replacement broker fetches all data the original broker previously stored from other brokers in the cluster that host the other replicas. Depending on your application, this could involve copying tens of gigabytes or terabytes of data. Fetching this data takes time and increases network traffic, which could impact the performance of the Kafka cluster for the period the data transfer is happening.

EBS volumes are persisted when an instance fails or is terminated. When an EC2 instance running a Kafka broker fails or is terminated, the broker’s on-disk partition replicas remain intact and can be mounted by a new EC2 instance. By using EBS, most of the replica data for the replacement broker will already be in the EBS volume and hence won’t need to be transferred over the network. Only data produced since the original broker failed or was terminated will need to be fetched across the network.

There are some good insights here; read the whole thing if you’re thinking about running Kafka.

Comments closed

NodeGroup Performance Issues

Babak Behzad explains potential Hadoop NodeGroup performance bottlenecks:

As can be seen in the logs, the localityWaitFactor value is 1, but the delay that this code causes grows linearly with the number of required containers. Since our DFSIO-large benchmark creates 1,024 files, each 1 GB in size, it requests 1,024 YARN containers. Therefore, the code has to miss at least 1,024 scheduling opportunities until it schedules containers on this (wrongly assumed) OFF_SWITCH node.

But why is this delay enforced? This idea falls into a big area of scheduling research. The Delay Scheduling algorithm was introduced by Matei Zaharia’s EuroSys ’10 paper titled “Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling”.

That post is a bit deeper than my Hadoop administration comfort level, but if you’re given the task of performance tuning a cluster, this might be one place to look.

Comments closed